veza/infra/ansible/tests/test_backend_failover.sh
senke a9541f517b
Some checks failed
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Backend (Go) (push) Failing after 4m34s
Veza CI / Rust (Stream Server) (push) Successful in 5m37s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 1m7s
feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19)
Phase-1 of the active/active backend story. HAProxy in front of two
backend-api containers + two stream-server containers ; sticky cookie
pins WS sessions to one backend, URI hash routes track_id to one
streamer for HLS cache locality.

Day 19 acceptance asks for : kill backend-api-1, HAProxy bascule, WS
sessions reconnect to backend-api-2 sans perte. The smoke test wires
that gate ; phase-2 (W5) will add keepalived for an LB pair.

- infra/ansible/roles/haproxy/
  * Install HAProxy + render haproxy.cfg with frontend (HTTP, optional
    HTTPS via haproxy_tls_cert_path), api_pool (round-robin + sticky
    cookie SERVERID), stream_pool (URI-hash + consistent jump-hash).
  * Active health check GET /api/v1/health every 5s ; fall=3, rise=2.
    on-marked-down shutdown-sessions + slowstart 30s on recovery.
  * Stats socket bound to 127.0.0.1:9100 for the future prometheus
    haproxy_exporter sidecar.
  * Mozilla Intermediate TLS cipher list ; only effective when a cert
    is mounted.

- infra/ansible/roles/backend_api/
  * Scaffolding for the multi-instance Go API. Creates veza-api
    system user, /opt/veza/backend-api dir, /etc/veza env dir,
    /var/log/veza, and a hardened systemd unit pointing at the binary.
  * Binary deployment is OUT of scope (documented in README) — the
    Go binary is built outside Ansible (Makefile target) and pushed
    via incus file push. CI → ansible-pull integration is W5+.

- infra/ansible/playbooks/haproxy.yml : provisions the haproxy Incus
  container + applies common baseline + role.

- infra/ansible/inventory/lab.yml : 3 new groups :
  * haproxy (single LB node)
  * backend_api_instances (backend-api-{1,2})
  * stream_server_instances (stream-server-{1,2})
  HAProxy template reads these groups directly to populate its
  upstream blocks ; falls back to the static haproxy_backend_api_fallback
  list if the group is missing (for in-isolation tests).

- infra/ansible/tests/test_backend_failover.sh
  * step 0 : pre-flight — both backends UP per HAProxy stats socket.
  * step 1 : 5 baseline GET /api/v1/health through the LB → all 200.
  * step 2 : incus stop --force backend-api-1 ; record t0.
  * step 3 : poll HAProxy stats until backend-api-1 is DOWN
    (timeout 30s ; expected ~ 15s = fall × interval).
  * step 4 : 5 GET requests during the down window — all must 200
    (served by backend-api-2). Fails if any returns non-200.
  * step 5 : incus start backend-api-1 ; poll until UP again.

Acceptance (Day 19) : smoke test passes ; HAProxy sticky cookie
keeps WS sessions on the same backend until that backend dies, at
which point the cookie is ignored and the request rebalances.

W4 progress : Day 16 done · Day 17 done · Day 18 done · Day 19 done ·
Day 20 (k6 nightly load test) pending.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 11:32:48 +02:00

157 lines
5.8 KiB
Bash
Executable file
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#!/usr/bin/env bash
# test_backend_failover.sh — verify HAProxy fails over from backend-api-1
# to backend-api-2 when the first instance dies, with no client-visible
# error window beyond the health-check fall.
#
# Sequence :
# 1. Pre-flight : both backends UP per HAProxy stats.
# 2. Issue 5 GET /api/v1/health through HAProxy ; all should return 200.
# Capture the SERVERID cookie to know which backend was chosen.
# 3. incus stop --force backend-api-1 (or whoever the cookie pinned).
# 4. Poll HAProxy stats until the killed server is marked DOWN
# (typically within fall × interval = 3 × 5 s = 15 s).
# 5. Issue another 5 GET /api/v1/health ; all must return 200, served
# by the surviving backend.
# 6. incus start backend-api-1 ; poll until UP again.
#
# v1.0.9 W4 Day 19 — acceptance for the verification gate.
#
# Usage :
# bash infra/ansible/tests/test_backend_failover.sh
#
# Exit codes :
# 0 — failover happened, no errors during the window
# 1 — pool not healthy at start
# 2 — failover took too long OR errors observed during the window
# 3 — required tool missing
set -euo pipefail
HAPROXY_HOST=${HAPROXY_HOST:-haproxy.lxd}
HAPROXY_PORT=${HAPROXY_PORT:-80}
KILL_BACKEND=${KILL_BACKEND:-backend-api-1}
SURVIVING_BACKEND=${SURVIVING_BACKEND:-backend-api-2}
HEALTH_PATH=${HEALTH_PATH:-/api/v1/health}
DOWN_TIMEOUT_SECONDS=${DOWN_TIMEOUT_SECONDS:-30}
UP_TIMEOUT_SECONDS=${UP_TIMEOUT_SECONDS:-60}
log() { printf '[%s] %s\n' "$(date +%H:%M:%S)" "$*" >&2; }
fail() { log "FAIL: $*"; exit "${2:-2}"; }
require() {
command -v "$1" >/dev/null 2>&1 || fail "required tool missing on host: $1" 3
}
require incus
require curl
require date
# -----------------------------------------------------------------------------
# Helper : ask HAProxy admin socket for a server's status (UP / DOWN / DRAIN /
# MAINT). Bound to loopback inside the haproxy container.
# -----------------------------------------------------------------------------
server_status() {
local server=$1
incus exec haproxy -- bash -c \
"echo 'show stat' | socat /run/haproxy/admin.sock - \
| awk -F, -v s=\"$server\" '\$2 == s {print \$18; exit}'"
}
curl_via_lb() {
local accept_404=${1:-0}
local code
code=$(curl --max-time 5 -sS -o /dev/null -w "%{http_code}" \
"http://${HAPROXY_HOST}:${HAPROXY_PORT}${HEALTH_PATH}" || echo 000)
echo "$code"
}
# -----------------------------------------------------------------------------
# 1. Pre-flight — both backends must be UP.
# -----------------------------------------------------------------------------
log "step 0: pre-flight — querying HAProxy admin socket"
status_kill=$(server_status "$KILL_BACKEND")
status_survive=$(server_status "$SURVIVING_BACKEND")
log " $KILL_BACKEND : $status_kill"
log " $SURVIVING_BACKEND : $status_survive"
if [ "$status_kill" != "UP" ] || [ "$status_survive" != "UP" ]; then
fail "pool not fully UP at start — refusing to test from a degraded baseline" 1
fi
# -----------------------------------------------------------------------------
# 2. Sanity — 5 successful requests through the LB.
# -----------------------------------------------------------------------------
log "step 1: 5 baseline requests through HAProxy"
for i in 1 2 3 4 5; do
code=$(curl_via_lb)
log " request $i → HTTP $code"
if [ "$code" != "200" ]; then
fail "baseline request $i returned HTTP $code, want 200" 1
fi
done
# -----------------------------------------------------------------------------
# 3. Kill the backend container.
# -----------------------------------------------------------------------------
log "step 2: stopping $KILL_BACKEND — start failover timer"
t0=$(date +%s)
incus stop --force "$KILL_BACKEND"
# -----------------------------------------------------------------------------
# 4. Poll until HAProxy marks the killed server DOWN.
# -----------------------------------------------------------------------------
log "step 3: polling HAProxy until $KILL_BACKEND is DOWN (timeout ${DOWN_TIMEOUT_SECONDS}s)"
deadline=$((t0 + DOWN_TIMEOUT_SECONDS))
killed_down=0
while [ "$(date +%s)" -lt "$deadline" ]; do
s=$(server_status "$KILL_BACKEND")
if [ "$s" = "DOWN" ] || [ "$s" = "MAINT" ]; then
killed_down=1
break
fi
sleep 1
done
elapsed=$(( $(date +%s) - t0 ))
if [ "$killed_down" -eq 0 ]; then
fail "$KILL_BACKEND not marked DOWN within ${DOWN_TIMEOUT_SECONDS}s" 2
fi
log " $KILL_BACKEND went DOWN in ${elapsed}s"
# -----------------------------------------------------------------------------
# 5. 5 requests through the LB — all must succeed via the surviving backend.
# -----------------------------------------------------------------------------
log "step 4: 5 requests through HAProxy with $KILL_BACKEND down"
errors=0
for i in 1 2 3 4 5; do
code=$(curl_via_lb)
log " request $i → HTTP $code"
if [ "$code" != "200" ]; then
errors=$((errors + 1))
fi
done
if [ "$errors" -gt 0 ]; then
fail "$errors of 5 requests failed during failover — survivor isn't catching all traffic" 2
fi
# -----------------------------------------------------------------------------
# 6. Restart the killed backend and confirm it rejoins as UP.
# -----------------------------------------------------------------------------
log "step 5: restarting $KILL_BACKEND"
incus start "$KILL_BACKEND" || true
log " polling until $KILL_BACKEND is UP again (timeout ${UP_TIMEOUT_SECONDS}s)"
deadline=$(( $(date +%s) + UP_TIMEOUT_SECONDS ))
recovered=0
while [ "$(date +%s)" -lt "$deadline" ]; do
s=$(server_status "$KILL_BACKEND")
if [ "$s" = "UP" ]; then
recovered=1
break
fi
sleep 2
done
if [ "$recovered" -eq 0 ]; then
log "WARN: $KILL_BACKEND did not return to UP within ${UP_TIMEOUT_SECONDS}s — manual check needed"
else
log " $KILL_BACKEND back UP"
fi
log "PASS: HAProxy fail-over OK ($KILL_BACKEND down in ${elapsed}s, no client-visible errors during the window)"
exit 0