senke/veza

History

senke 6c644cff03 fix(haproxy): forgejo backend uses HTTPS re-encrypt + Host header on healthcheck Forgejo at 10.0.20.105:3000 serves HTTPS only (self-signed cert). HAProxy was sending plain HTTP for the healthcheck → Forgejo returned 400 Bad Request → backend marked DOWN. Two coupled fixes : 1. `server forgejo ... ssl verify none sni str(forgejo.talas.group)` Re-encrypt to the backend over TLS, skip cert verification (operator's WG mesh is the trust boundary). SNI set to the public hostname so Forgejo serves the right vhost. 2. Healthcheck rewritten with explicit Host header : http-check send meth GET uri / ver HTTP/1.1 hdr Host forgejo.talas.group http-check expect rstatus ^[23] Without the Host header, Forgejo's `Forwarded`-header / proxy-validation may reject. Accept any 2xx/3xx (Forgejo redirects to /login → 302). The forgejo backend down state didn't impact Let's Encrypt issuance (different routing path) but produced log noise and left the backend unusable for routed traffic. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-30 16:31:29 +02:00
..
defaults	feat(ansible): TLS via dehydrated/Let's Encrypt + Forgejo on talas.group	2026-04-29 15:54:05 +02:00
files	fix(haproxy): use shipped selfsigned.pem (matches working role pattern)	2026-04-30 16:12:35 +02:00
handlers	feat(ansible): TLS via dehydrated/Let's Encrypt + Forgejo on talas.group	2026-04-29 15:54:05 +02:00
tasks	fix(haproxy): use shipped selfsigned.pem (matches working role pattern)	2026-04-30 16:12:35 +02:00
templates	fix(haproxy): forgejo backend uses HTTPS re-encrypt + Host header on healthcheck	2026-04-30 16:31:29 +02:00
README.md	feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19)	2026-04-29 11:32:48 +02:00

README.md

`haproxy` role — TLS termination + sticky-WS load balancer

Single Incus container in front of the active/active backend-api fleet and the stream-server fleet. v1.0.9 W4 Day 19 — phase-1 of the HA story (single-host LB ; phase-2 adds keepalived for an LB pair).

Topology

                         :80 / :443
                              │
                       ┌──────▼─────────┐
                       │   haproxy.lxd  │   (this role)
                       │  HTTP + WS     │
                       │  TLS terminate │
                       │  sticky cookie │
                       └─┬───────┬──────┘
                         │       │
              ┌──────────┘       └──────────┐
              ▼                              ▼
       ┌──────────────┐              ┌──────────────┐
       │ api_pool     │              │ stream_pool  │
       │ ─────────    │              │ ─────────    │
       │ backend-api-1│              │ stream-srv-1 │
       │ backend-api-2│              │ stream-srv-2 │
       │  (port 8080) │              │  (port 8082) │
       │  Round-robin │              │  URI-hash    │
       │  Sticky cookie│             │  (track_id)  │
       └──────────────┘              └──────────────┘

Why these balance modes

api_pool : balance roundrobin + cookie SERVERID insert indirect. The Go API is stateless (sessions live in Redis), so any backend can serve any request. The cookie keeps a logged-in user pinned to one backend through the session, which makes WebSocket upgrades land on the same instance that authenticated the user — avoiding a Redis round-trip on every WS hello.
stream_pool : balance uri whole + hash-type consistent. The Rust streamer keeps a hot HLS-segment cache in process. URI-hash routes the same track_id to the same node ; jump-hash means adding or removing a node only displaces ~1/N of the keys, not the entire pool.

Failover behaviour

Health check GET /api/v1/health (or /health for stream) every haproxy_health_check_interval_ms ms (default 5 s). 3 consecutive failures = down ; 2 consecutive successes = back up.
on-marked-down shutdown-sessions : when a backend drops, all its in-flight TCP/WS sessions are cut. Clients reconnect ; the cookie targets the dead backend → HAProxy ignores the dead pin and re-balances. WebSocket clients on the frontend (chat, presence) MUST handle the close + reconnect — that's already wired in apps/web/src/features/chat/services/websocket.ts.
slowstart {{ haproxy_graceful_drain_seconds }}s : when a backend recovers, its weight ramps up linearly over 30 s instead of taking a full third of the traffic on the first scrape. Smoothes the post-restart latency spike.

Defaults

variable	default	meaning
`haproxy_listen_http`	`80`	HTTP listener
`haproxy_listen_https`	`443`	HTTPS listener (only bound when cert set)
`haproxy_tls_cert_path`	`""`	path to PEM (cert+key concat). Empty = HTTP only.
`haproxy_backend_api_port`	`8080`	upstream port for backend-api
`haproxy_stream_server_port`	`8082`	upstream port for stream-server
`haproxy_health_check_interval_ms`	`5000`	active-check cadence
`haproxy_health_check_fall`	`3`	failed checks before "down"
`haproxy_health_check_rise`	`2`	successful checks before "up"
`haproxy_graceful_drain_seconds`	`30`	post-recovery weight ramp-up
`haproxy_sticky_cookie_name`	`VEZA_SERVERID`	cookie name for backend stickiness

Operations

# Health view (admin socket, loopback only) :
sudo socat /run/haproxy/admin.sock - <<< "show servers state"
sudo socat /run/haproxy/admin.sock - <<< "show stat"

# Disable a server gracefully (drains existing connections,
# new requests skip it ; useful before a planned restart) :
echo "set server api_pool/backend-api-1 state drain" | sudo socat /run/haproxy/admin.sock -
# ...wait haproxy_graceful_drain_seconds, then on the backend host :
#   sudo systemctl restart veza-backend-api
echo "set server api_pool/backend-api-1 state ready"  | sudo socat /run/haproxy/admin.sock -

# Stats UI for a human (browser only ; bound to localhost) :
ssh -L 9100:localhost:9100 haproxy.lxd
# then open http://localhost:9100/stats

# Live log tail (HAProxy logs to journald via /dev/log) :
sudo journalctl -u haproxy -f

Failover smoke test

bash infra/ansible/tests/test_backend_failover.sh

Sequence : verifies the api_pool is healthy at start, kills backend-api-1, polls HAProxy until the server is marked DOWN, asserts the next request still gets a 200 (served by backend-api-2), restarts the killed container, asserts it rejoins as healthy. Suitable for the W2 game-day day 24 drill.

What this role does NOT cover

TLS cert provisioning. Phase-1 lab : HTTP only. Phase-2 mounts a Let's Encrypt cert from Caddy's data dir or directly via certbot. mTLS to the backends is W5 territory.
Multi-LB HA. Single HAProxy node — if it dies, the cluster is dark. Phase-2 adds keepalived + a floating VIP.
Rate limiting. The Gin middleware does that today ; pushing it to the LB is a v1.1 optimisation.
WebSocket auth header passing. HAProxy passes Sec-WebSocket-* headers through unchanged ; Gin's middleware authenticates the upgrade request. No extra config needed.