Some checks failed
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Backend (Go) (push) Failing after 4m34s
Veza CI / Rust (Stream Server) (push) Successful in 5m37s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 1m7s
Phase-1 of the active/active backend story. HAProxy in front of two
backend-api containers + two stream-server containers ; sticky cookie
pins WS sessions to one backend, URI hash routes track_id to one
streamer for HLS cache locality.
Day 19 acceptance asks for : kill backend-api-1, HAProxy bascule, WS
sessions reconnect to backend-api-2 sans perte. The smoke test wires
that gate ; phase-2 (W5) will add keepalived for an LB pair.
- infra/ansible/roles/haproxy/
* Install HAProxy + render haproxy.cfg with frontend (HTTP, optional
HTTPS via haproxy_tls_cert_path), api_pool (round-robin + sticky
cookie SERVERID), stream_pool (URI-hash + consistent jump-hash).
* Active health check GET /api/v1/health every 5s ; fall=3, rise=2.
on-marked-down shutdown-sessions + slowstart 30s on recovery.
* Stats socket bound to 127.0.0.1:9100 for the future prometheus
haproxy_exporter sidecar.
* Mozilla Intermediate TLS cipher list ; only effective when a cert
is mounted.
- infra/ansible/roles/backend_api/
* Scaffolding for the multi-instance Go API. Creates veza-api
system user, /opt/veza/backend-api dir, /etc/veza env dir,
/var/log/veza, and a hardened systemd unit pointing at the binary.
* Binary deployment is OUT of scope (documented in README) — the
Go binary is built outside Ansible (Makefile target) and pushed
via incus file push. CI → ansible-pull integration is W5+.
- infra/ansible/playbooks/haproxy.yml : provisions the haproxy Incus
container + applies common baseline + role.
- infra/ansible/inventory/lab.yml : 3 new groups :
* haproxy (single LB node)
* backend_api_instances (backend-api-{1,2})
* stream_server_instances (stream-server-{1,2})
HAProxy template reads these groups directly to populate its
upstream blocks ; falls back to the static haproxy_backend_api_fallback
list if the group is missing (for in-isolation tests).
- infra/ansible/tests/test_backend_failover.sh
* step 0 : pre-flight — both backends UP per HAProxy stats socket.
* step 1 : 5 baseline GET /api/v1/health through the LB → all 200.
* step 2 : incus stop --force backend-api-1 ; record t0.
* step 3 : poll HAProxy stats until backend-api-1 is DOWN
(timeout 30s ; expected ~ 15s = fall × interval).
* step 4 : 5 GET requests during the down window — all must 200
(served by backend-api-2). Fails if any returns non-200.
* step 5 : incus start backend-api-1 ; poll until UP again.
Acceptance (Day 19) : smoke test passes ; HAProxy sticky cookie
keeps WS sessions on the same backend until that backend dies, at
which point the cookie is ignored and the request rebalances.
W4 progress : Day 16 done · Day 17 done · Day 18 done · Day 19 done ·
Day 20 (k6 nightly load test) pending.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
157 lines
5.8 KiB
Bash
Executable file
157 lines
5.8 KiB
Bash
Executable file
#!/usr/bin/env bash
|
||
# test_backend_failover.sh — verify HAProxy fails over from backend-api-1
|
||
# to backend-api-2 when the first instance dies, with no client-visible
|
||
# error window beyond the health-check fall.
|
||
#
|
||
# Sequence :
|
||
# 1. Pre-flight : both backends UP per HAProxy stats.
|
||
# 2. Issue 5 GET /api/v1/health through HAProxy ; all should return 200.
|
||
# Capture the SERVERID cookie to know which backend was chosen.
|
||
# 3. incus stop --force backend-api-1 (or whoever the cookie pinned).
|
||
# 4. Poll HAProxy stats until the killed server is marked DOWN
|
||
# (typically within fall × interval = 3 × 5 s = 15 s).
|
||
# 5. Issue another 5 GET /api/v1/health ; all must return 200, served
|
||
# by the surviving backend.
|
||
# 6. incus start backend-api-1 ; poll until UP again.
|
||
#
|
||
# v1.0.9 W4 Day 19 — acceptance for the verification gate.
|
||
#
|
||
# Usage :
|
||
# bash infra/ansible/tests/test_backend_failover.sh
|
||
#
|
||
# Exit codes :
|
||
# 0 — failover happened, no errors during the window
|
||
# 1 — pool not healthy at start
|
||
# 2 — failover took too long OR errors observed during the window
|
||
# 3 — required tool missing
|
||
set -euo pipefail
|
||
|
||
HAPROXY_HOST=${HAPROXY_HOST:-haproxy.lxd}
|
||
HAPROXY_PORT=${HAPROXY_PORT:-80}
|
||
KILL_BACKEND=${KILL_BACKEND:-backend-api-1}
|
||
SURVIVING_BACKEND=${SURVIVING_BACKEND:-backend-api-2}
|
||
HEALTH_PATH=${HEALTH_PATH:-/api/v1/health}
|
||
DOWN_TIMEOUT_SECONDS=${DOWN_TIMEOUT_SECONDS:-30}
|
||
UP_TIMEOUT_SECONDS=${UP_TIMEOUT_SECONDS:-60}
|
||
|
||
log() { printf '[%s] %s\n' "$(date +%H:%M:%S)" "$*" >&2; }
|
||
fail() { log "FAIL: $*"; exit "${2:-2}"; }
|
||
|
||
require() {
|
||
command -v "$1" >/dev/null 2>&1 || fail "required tool missing on host: $1" 3
|
||
}
|
||
|
||
require incus
|
||
require curl
|
||
require date
|
||
|
||
# -----------------------------------------------------------------------------
|
||
# Helper : ask HAProxy admin socket for a server's status (UP / DOWN / DRAIN /
|
||
# MAINT). Bound to loopback inside the haproxy container.
|
||
# -----------------------------------------------------------------------------
|
||
server_status() {
|
||
local server=$1
|
||
incus exec haproxy -- bash -c \
|
||
"echo 'show stat' | socat /run/haproxy/admin.sock - \
|
||
| awk -F, -v s=\"$server\" '\$2 == s {print \$18; exit}'"
|
||
}
|
||
|
||
curl_via_lb() {
|
||
local accept_404=${1:-0}
|
||
local code
|
||
code=$(curl --max-time 5 -sS -o /dev/null -w "%{http_code}" \
|
||
"http://${HAPROXY_HOST}:${HAPROXY_PORT}${HEALTH_PATH}" || echo 000)
|
||
echo "$code"
|
||
}
|
||
|
||
# -----------------------------------------------------------------------------
|
||
# 1. Pre-flight — both backends must be UP.
|
||
# -----------------------------------------------------------------------------
|
||
log "step 0: pre-flight — querying HAProxy admin socket"
|
||
status_kill=$(server_status "$KILL_BACKEND")
|
||
status_survive=$(server_status "$SURVIVING_BACKEND")
|
||
log " $KILL_BACKEND : $status_kill"
|
||
log " $SURVIVING_BACKEND : $status_survive"
|
||
if [ "$status_kill" != "UP" ] || [ "$status_survive" != "UP" ]; then
|
||
fail "pool not fully UP at start — refusing to test from a degraded baseline" 1
|
||
fi
|
||
|
||
# -----------------------------------------------------------------------------
|
||
# 2. Sanity — 5 successful requests through the LB.
|
||
# -----------------------------------------------------------------------------
|
||
log "step 1: 5 baseline requests through HAProxy"
|
||
for i in 1 2 3 4 5; do
|
||
code=$(curl_via_lb)
|
||
log " request $i → HTTP $code"
|
||
if [ "$code" != "200" ]; then
|
||
fail "baseline request $i returned HTTP $code, want 200" 1
|
||
fi
|
||
done
|
||
|
||
# -----------------------------------------------------------------------------
|
||
# 3. Kill the backend container.
|
||
# -----------------------------------------------------------------------------
|
||
log "step 2: stopping $KILL_BACKEND — start failover timer"
|
||
t0=$(date +%s)
|
||
incus stop --force "$KILL_BACKEND"
|
||
|
||
# -----------------------------------------------------------------------------
|
||
# 4. Poll until HAProxy marks the killed server DOWN.
|
||
# -----------------------------------------------------------------------------
|
||
log "step 3: polling HAProxy until $KILL_BACKEND is DOWN (timeout ${DOWN_TIMEOUT_SECONDS}s)"
|
||
deadline=$((t0 + DOWN_TIMEOUT_SECONDS))
|
||
killed_down=0
|
||
while [ "$(date +%s)" -lt "$deadline" ]; do
|
||
s=$(server_status "$KILL_BACKEND")
|
||
if [ "$s" = "DOWN" ] || [ "$s" = "MAINT" ]; then
|
||
killed_down=1
|
||
break
|
||
fi
|
||
sleep 1
|
||
done
|
||
elapsed=$(( $(date +%s) - t0 ))
|
||
if [ "$killed_down" -eq 0 ]; then
|
||
fail "$KILL_BACKEND not marked DOWN within ${DOWN_TIMEOUT_SECONDS}s" 2
|
||
fi
|
||
log " $KILL_BACKEND went DOWN in ${elapsed}s"
|
||
|
||
# -----------------------------------------------------------------------------
|
||
# 5. 5 requests through the LB — all must succeed via the surviving backend.
|
||
# -----------------------------------------------------------------------------
|
||
log "step 4: 5 requests through HAProxy with $KILL_BACKEND down"
|
||
errors=0
|
||
for i in 1 2 3 4 5; do
|
||
code=$(curl_via_lb)
|
||
log " request $i → HTTP $code"
|
||
if [ "$code" != "200" ]; then
|
||
errors=$((errors + 1))
|
||
fi
|
||
done
|
||
if [ "$errors" -gt 0 ]; then
|
||
fail "$errors of 5 requests failed during failover — survivor isn't catching all traffic" 2
|
||
fi
|
||
|
||
# -----------------------------------------------------------------------------
|
||
# 6. Restart the killed backend and confirm it rejoins as UP.
|
||
# -----------------------------------------------------------------------------
|
||
log "step 5: restarting $KILL_BACKEND"
|
||
incus start "$KILL_BACKEND" || true
|
||
log " polling until $KILL_BACKEND is UP again (timeout ${UP_TIMEOUT_SECONDS}s)"
|
||
deadline=$(( $(date +%s) + UP_TIMEOUT_SECONDS ))
|
||
recovered=0
|
||
while [ "$(date +%s)" -lt "$deadline" ]; do
|
||
s=$(server_status "$KILL_BACKEND")
|
||
if [ "$s" = "UP" ]; then
|
||
recovered=1
|
||
break
|
||
fi
|
||
sleep 2
|
||
done
|
||
if [ "$recovered" -eq 0 ]; then
|
||
log "WARN: $KILL_BACKEND did not return to UP within ${UP_TIMEOUT_SECONDS}s — manual check needed"
|
||
else
|
||
log " $KILL_BACKEND back UP"
|
||
fi
|
||
|
||
log "PASS: HAProxy fail-over OK ($KILL_BACKEND down in ${elapsed}s, no client-visible errors during the window)"
|
||
exit 0
|