92 lines
6 KiB
Markdown
92 lines
6 KiB
Markdown
|
|
# `haproxy` role — TLS termination + sticky-WS load balancer
|
||
|
|
|
||
|
|
Single Incus container in front of the active/active backend-api fleet and the stream-server fleet. v1.0.9 W4 Day 19 — phase-1 of the HA story (single-host LB ; phase-2 adds keepalived for an LB pair).
|
||
|
|
|
||
|
|
## Topology
|
||
|
|
|
||
|
|
```
|
||
|
|
:80 / :443
|
||
|
|
│
|
||
|
|
┌──────▼─────────┐
|
||
|
|
│ haproxy.lxd │ (this role)
|
||
|
|
│ HTTP + WS │
|
||
|
|
│ TLS terminate │
|
||
|
|
│ sticky cookie │
|
||
|
|
└─┬───────┬──────┘
|
||
|
|
│ │
|
||
|
|
┌──────────┘ └──────────┐
|
||
|
|
▼ ▼
|
||
|
|
┌──────────────┐ ┌──────────────┐
|
||
|
|
│ api_pool │ │ stream_pool │
|
||
|
|
│ ───────── │ │ ───────── │
|
||
|
|
│ backend-api-1│ │ stream-srv-1 │
|
||
|
|
│ backend-api-2│ │ stream-srv-2 │
|
||
|
|
│ (port 8080) │ │ (port 8082) │
|
||
|
|
│ Round-robin │ │ URI-hash │
|
||
|
|
│ Sticky cookie│ │ (track_id) │
|
||
|
|
└──────────────┘ └──────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
## Why these balance modes
|
||
|
|
|
||
|
|
- **api_pool : `balance roundrobin` + `cookie SERVERID insert indirect`.** The Go API is stateless (sessions live in Redis), so any backend can serve any request. The cookie keeps a logged-in user pinned to one backend through the session, which makes WebSocket upgrades land on the same instance that authenticated the user — avoiding a Redis round-trip on every WS hello.
|
||
|
|
- **stream_pool : `balance uri whole` + `hash-type consistent`.** The Rust streamer keeps a hot HLS-segment cache in process. URI-hash routes the same track_id to the same node ; jump-hash means adding or removing a node only displaces ~`1/N` of the keys, not the entire pool.
|
||
|
|
|
||
|
|
## Failover behaviour
|
||
|
|
|
||
|
|
- Health check `GET /api/v1/health` (or `/health` for stream) every `haproxy_health_check_interval_ms` ms (default 5 s). 3 consecutive failures = down ; 2 consecutive successes = back up.
|
||
|
|
- `on-marked-down shutdown-sessions` : when a backend drops, all its in-flight TCP/WS sessions are cut. Clients reconnect ; the cookie targets the dead backend → HAProxy ignores the dead pin and re-balances. WebSocket clients on the frontend (chat, presence) MUST handle the close + reconnect — that's already wired in `apps/web/src/features/chat/services/websocket.ts`.
|
||
|
|
- `slowstart {{ haproxy_graceful_drain_seconds }}s` : when a backend recovers, its weight ramps up linearly over 30 s instead of taking a full third of the traffic on the first scrape. Smoothes the post-restart latency spike.
|
||
|
|
|
||
|
|
## Defaults
|
||
|
|
|
||
|
|
| variable | default | meaning |
|
||
|
|
| --------------------------------- | ------------------ | ----------------------------------------- |
|
||
|
|
| `haproxy_listen_http` | `80` | HTTP listener |
|
||
|
|
| `haproxy_listen_https` | `443` | HTTPS listener (only bound when cert set) |
|
||
|
|
| `haproxy_tls_cert_path` | `""` | path to PEM (cert+key concat). Empty = HTTP only. |
|
||
|
|
| `haproxy_backend_api_port` | `8080` | upstream port for backend-api |
|
||
|
|
| `haproxy_stream_server_port` | `8082` | upstream port for stream-server |
|
||
|
|
| `haproxy_health_check_interval_ms`| `5000` | active-check cadence |
|
||
|
|
| `haproxy_health_check_fall` | `3` | failed checks before "down" |
|
||
|
|
| `haproxy_health_check_rise` | `2` | successful checks before "up" |
|
||
|
|
| `haproxy_graceful_drain_seconds` | `30` | post-recovery weight ramp-up |
|
||
|
|
| `haproxy_sticky_cookie_name` | `VEZA_SERVERID` | cookie name for backend stickiness |
|
||
|
|
|
||
|
|
## Operations
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Health view (admin socket, loopback only) :
|
||
|
|
sudo socat /run/haproxy/admin.sock - <<< "show servers state"
|
||
|
|
sudo socat /run/haproxy/admin.sock - <<< "show stat"
|
||
|
|
|
||
|
|
# Disable a server gracefully (drains existing connections,
|
||
|
|
# new requests skip it ; useful before a planned restart) :
|
||
|
|
echo "set server api_pool/backend-api-1 state drain" | sudo socat /run/haproxy/admin.sock -
|
||
|
|
# ...wait haproxy_graceful_drain_seconds, then on the backend host :
|
||
|
|
# sudo systemctl restart veza-backend-api
|
||
|
|
echo "set server api_pool/backend-api-1 state ready" | sudo socat /run/haproxy/admin.sock -
|
||
|
|
|
||
|
|
# Stats UI for a human (browser only ; bound to localhost) :
|
||
|
|
ssh -L 9100:localhost:9100 haproxy.lxd
|
||
|
|
# then open http://localhost:9100/stats
|
||
|
|
|
||
|
|
# Live log tail (HAProxy logs to journald via /dev/log) :
|
||
|
|
sudo journalctl -u haproxy -f
|
||
|
|
```
|
||
|
|
|
||
|
|
## Failover smoke test
|
||
|
|
|
||
|
|
```bash
|
||
|
|
bash infra/ansible/tests/test_backend_failover.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
Sequence : verifies the api_pool is healthy at start, kills `backend-api-1`, polls HAProxy until the server is marked DOWN, asserts the next request still gets a 200 (served by `backend-api-2`), restarts the killed container, asserts it rejoins as healthy. Suitable for the W2 game-day day 24 drill.
|
||
|
|
|
||
|
|
## What this role does NOT cover
|
||
|
|
|
||
|
|
- **TLS cert provisioning.** Phase-1 lab : HTTP only. Phase-2 mounts a Let's Encrypt cert from Caddy's data dir or directly via certbot. mTLS to the backends is W5 territory.
|
||
|
|
- **Multi-LB HA.** Single HAProxy node — if it dies, the cluster is dark. Phase-2 adds keepalived + a floating VIP.
|
||
|
|
- **Rate limiting.** The Gin middleware does that today ; pushing it to the LB is a v1.1 optimisation.
|
||
|
|
- **WebSocket auth header passing.** HAProxy passes `Sec-WebSocket-*` headers through unchanged ; Gin's middleware authenticates the upgrade request. No extra config needed.
|