# `haproxy` role — TLS termination + sticky-WS load balancer Single Incus container in front of the active/active backend-api fleet and the stream-server fleet. v1.0.9 W4 Day 19 — phase-1 of the HA story (single-host LB ; phase-2 adds keepalived for an LB pair). ## Topology ``` :80 / :443 │ ┌──────▼─────────┐ │ haproxy.lxd │ (this role) │ HTTP + WS │ │ TLS terminate │ │ sticky cookie │ └─┬───────┬──────┘ │ │ ┌──────────┘ └──────────┐ ▼ ▼ ┌──────────────┐ ┌──────────────┐ │ api_pool │ │ stream_pool │ │ ───────── │ │ ───────── │ │ backend-api-1│ │ stream-srv-1 │ │ backend-api-2│ │ stream-srv-2 │ │ (port 8080) │ │ (port 8082) │ │ Round-robin │ │ URI-hash │ │ Sticky cookie│ │ (track_id) │ └──────────────┘ └──────────────┘ ``` ## Why these balance modes - **api_pool : `balance roundrobin` + `cookie SERVERID insert indirect`.** The Go API is stateless (sessions live in Redis), so any backend can serve any request. The cookie keeps a logged-in user pinned to one backend through the session, which makes WebSocket upgrades land on the same instance that authenticated the user — avoiding a Redis round-trip on every WS hello. - **stream_pool : `balance uri whole` + `hash-type consistent`.** The Rust streamer keeps a hot HLS-segment cache in process. URI-hash routes the same track_id to the same node ; jump-hash means adding or removing a node only displaces ~`1/N` of the keys, not the entire pool. ## Failover behaviour - Health check `GET /api/v1/health` (or `/health` for stream) every `haproxy_health_check_interval_ms` ms (default 5 s). 3 consecutive failures = down ; 2 consecutive successes = back up. - `on-marked-down shutdown-sessions` : when a backend drops, all its in-flight TCP/WS sessions are cut. Clients reconnect ; the cookie targets the dead backend → HAProxy ignores the dead pin and re-balances. WebSocket clients on the frontend (chat, presence) MUST handle the close + reconnect — that's already wired in `apps/web/src/features/chat/services/websocket.ts`. - `slowstart {{ haproxy_graceful_drain_seconds }}s` : when a backend recovers, its weight ramps up linearly over 30 s instead of taking a full third of the traffic on the first scrape. Smoothes the post-restart latency spike. ## Defaults | variable | default | meaning | | --------------------------------- | ------------------ | ----------------------------------------- | | `haproxy_listen_http` | `80` | HTTP listener | | `haproxy_listen_https` | `443` | HTTPS listener (only bound when cert set) | | `haproxy_tls_cert_path` | `""` | path to PEM (cert+key concat). Empty = HTTP only. | | `haproxy_backend_api_port` | `8080` | upstream port for backend-api | | `haproxy_stream_server_port` | `8082` | upstream port for stream-server | | `haproxy_health_check_interval_ms`| `5000` | active-check cadence | | `haproxy_health_check_fall` | `3` | failed checks before "down" | | `haproxy_health_check_rise` | `2` | successful checks before "up" | | `haproxy_graceful_drain_seconds` | `30` | post-recovery weight ramp-up | | `haproxy_sticky_cookie_name` | `VEZA_SERVERID` | cookie name for backend stickiness | ## Operations ```bash # Health view (admin socket, loopback only) : sudo socat /run/haproxy/admin.sock - <<< "show servers state" sudo socat /run/haproxy/admin.sock - <<< "show stat" # Disable a server gracefully (drains existing connections, # new requests skip it ; useful before a planned restart) : echo "set server api_pool/backend-api-1 state drain" | sudo socat /run/haproxy/admin.sock - # ...wait haproxy_graceful_drain_seconds, then on the backend host : # sudo systemctl restart veza-backend-api echo "set server api_pool/backend-api-1 state ready" | sudo socat /run/haproxy/admin.sock - # Stats UI for a human (browser only ; bound to localhost) : ssh -L 9100:localhost:9100 haproxy.lxd # then open http://localhost:9100/stats # Live log tail (HAProxy logs to journald via /dev/log) : sudo journalctl -u haproxy -f ``` ## Failover smoke test ```bash bash infra/ansible/tests/test_backend_failover.sh ``` Sequence : verifies the api_pool is healthy at start, kills `backend-api-1`, polls HAProxy until the server is marked DOWN, asserts the next request still gets a 200 (served by `backend-api-2`), restarts the killed container, asserts it rejoins as healthy. Suitable for the W2 game-day day 24 drill. ## What this role does NOT cover - **TLS cert provisioning.** Phase-1 lab : HTTP only. Phase-2 mounts a Let's Encrypt cert from Caddy's data dir or directly via certbot. mTLS to the backends is W5 territory. - **Multi-LB HA.** Single HAProxy node — if it dies, the cluster is dark. Phase-2 adds keepalived + a floating VIP. - **Rate limiting.** The Gin middleware does that today ; pushing it to the LB is a v1.1 optimisation. - **WebSocket auth header passing.** HAProxy passes `Sec-WebSocket-*` headers through unchanged ; Gin's middleware authenticates the upgrade request. No extra config needed.