Forgejo at 10.0.20.105:3000 serves HTTPS only (self-signed cert).
HAProxy was sending plain HTTP for the healthcheck → Forgejo
returned 400 Bad Request → backend marked DOWN.
Two coupled fixes :
1. `server forgejo ... ssl verify none sni str(forgejo.talas.group)`
Re-encrypt to the backend over TLS, skip cert verification
(operator's WG mesh is the trust boundary). SNI set to the
public hostname so Forgejo serves the right vhost.
2. Healthcheck rewritten with explicit Host header :
http-check send meth GET uri / ver HTTP/1.1 hdr Host forgejo.talas.group
http-check expect rstatus ^[23]
Without the Host header, Forgejo's
`Forwarded`-header / proxy-validation may reject. Accept any
2xx/3xx (Forgejo redirects to /login → 302).
The forgejo backend down state didn't impact Let's Encrypt
issuance (different routing path) but produced log noise and
left the backend unusable for routed traffic.
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| defaults | ||
| files | ||
| handlers | ||
| tasks | ||
| templates | ||
| README.md | ||
haproxy role — TLS termination + sticky-WS load balancer
Single Incus container in front of the active/active backend-api fleet and the stream-server fleet. v1.0.9 W4 Day 19 — phase-1 of the HA story (single-host LB ; phase-2 adds keepalived for an LB pair).
Topology
:80 / :443
│
┌──────▼─────────┐
│ haproxy.lxd │ (this role)
│ HTTP + WS │
│ TLS terminate │
│ sticky cookie │
└─┬───────┬──────┘
│ │
┌──────────┘ └──────────┐
▼ ▼
┌──────────────┐ ┌──────────────┐
│ api_pool │ │ stream_pool │
│ ───────── │ │ ───────── │
│ backend-api-1│ │ stream-srv-1 │
│ backend-api-2│ │ stream-srv-2 │
│ (port 8080) │ │ (port 8082) │
│ Round-robin │ │ URI-hash │
│ Sticky cookie│ │ (track_id) │
└──────────────┘ └──────────────┘
Why these balance modes
- api_pool :
balance roundrobin+cookie SERVERID insert indirect. The Go API is stateless (sessions live in Redis), so any backend can serve any request. The cookie keeps a logged-in user pinned to one backend through the session, which makes WebSocket upgrades land on the same instance that authenticated the user — avoiding a Redis round-trip on every WS hello. - stream_pool :
balance uri whole+hash-type consistent. The Rust streamer keeps a hot HLS-segment cache in process. URI-hash routes the same track_id to the same node ; jump-hash means adding or removing a node only displaces ~1/Nof the keys, not the entire pool.
Failover behaviour
- Health check
GET /api/v1/health(or/healthfor stream) everyhaproxy_health_check_interval_msms (default 5 s). 3 consecutive failures = down ; 2 consecutive successes = back up. on-marked-down shutdown-sessions: when a backend drops, all its in-flight TCP/WS sessions are cut. Clients reconnect ; the cookie targets the dead backend → HAProxy ignores the dead pin and re-balances. WebSocket clients on the frontend (chat, presence) MUST handle the close + reconnect — that's already wired inapps/web/src/features/chat/services/websocket.ts.slowstart {{ haproxy_graceful_drain_seconds }}s: when a backend recovers, its weight ramps up linearly over 30 s instead of taking a full third of the traffic on the first scrape. Smoothes the post-restart latency spike.
Defaults
| variable | default | meaning |
|---|---|---|
haproxy_listen_http |
80 |
HTTP listener |
haproxy_listen_https |
443 |
HTTPS listener (only bound when cert set) |
haproxy_tls_cert_path |
"" |
path to PEM (cert+key concat). Empty = HTTP only. |
haproxy_backend_api_port |
8080 |
upstream port for backend-api |
haproxy_stream_server_port |
8082 |
upstream port for stream-server |
haproxy_health_check_interval_ms |
5000 |
active-check cadence |
haproxy_health_check_fall |
3 |
failed checks before "down" |
haproxy_health_check_rise |
2 |
successful checks before "up" |
haproxy_graceful_drain_seconds |
30 |
post-recovery weight ramp-up |
haproxy_sticky_cookie_name |
VEZA_SERVERID |
cookie name for backend stickiness |
Operations
# Health view (admin socket, loopback only) :
sudo socat /run/haproxy/admin.sock - <<< "show servers state"
sudo socat /run/haproxy/admin.sock - <<< "show stat"
# Disable a server gracefully (drains existing connections,
# new requests skip it ; useful before a planned restart) :
echo "set server api_pool/backend-api-1 state drain" | sudo socat /run/haproxy/admin.sock -
# ...wait haproxy_graceful_drain_seconds, then on the backend host :
# sudo systemctl restart veza-backend-api
echo "set server api_pool/backend-api-1 state ready" | sudo socat /run/haproxy/admin.sock -
# Stats UI for a human (browser only ; bound to localhost) :
ssh -L 9100:localhost:9100 haproxy.lxd
# then open http://localhost:9100/stats
# Live log tail (HAProxy logs to journald via /dev/log) :
sudo journalctl -u haproxy -f
Failover smoke test
bash infra/ansible/tests/test_backend_failover.sh
Sequence : verifies the api_pool is healthy at start, kills backend-api-1, polls HAProxy until the server is marked DOWN, asserts the next request still gets a 200 (served by backend-api-2), restarts the killed container, asserts it rejoins as healthy. Suitable for the W2 game-day day 24 drill.
What this role does NOT cover
- TLS cert provisioning. Phase-1 lab : HTTP only. Phase-2 mounts a Let's Encrypt cert from Caddy's data dir or directly via certbot. mTLS to the backends is W5 territory.
- Multi-LB HA. Single HAProxy node — if it dies, the cluster is dark. Phase-2 adds keepalived + a floating VIP.
- Rate limiting. The Gin middleware does that today ; pushing it to the LB is a v1.1 optimisation.
- WebSocket auth header passing. HAProxy passes
Sec-WebSocket-*headers through unchanged ; Gin's middleware authenticates the upgrade request. No extra config needed.