senke/veza

History

senke 5153ab113d refactor(ansible): single edge HAProxy — multi-env + Forgejo + Talas The 12-record DNS plan ($1 per record at the registrar but only one public R720 IP) forces the obvious : a single HAProxy on :443 must serve staging.veza.fr + veza.fr + www.veza.fr + talas.fr + www.talas.fr + forgejo.talas.group all at once. Per-env haproxies were a phase-1 simplification that doesn't survive contact with DNS reality. Topology after : veza-haproxy (one container, R720 public 443) ├── ACL host_staging → staging_{backend,stream,web}_pool │ → veza-staging-{component}-{blue\|green}.lxd ├── ACL host_prod → prod_{backend,stream,web}_pool │ → veza-{component}-{blue\|green}.lxd ├── ACL host_forgejo → forgejo_backend → 10.0.20.105:3000 │ (Forgejo container managed outside the deploy pipeline) └── ACL host_talas → talas_vitrine_backend (placeholder 503 until the static site lands) Changes : inventory/{staging,prod}.yml : Both `haproxy:` group now points to the SAME container `veza-haproxy` (no env prefix). Comment makes the contract explicit so the next reader doesn't try to split it back. group_vars/all/main.yml : NEW : haproxy_env_prefixes (per-env container prefix mapping). NEW : haproxy_env_public_hosts (per-env Host-header mapping). NEW : haproxy_forgejo_host + haproxy_forgejo_backend. NEW : haproxy_talas_hosts + haproxy_talas_vitrine_backend. NEW : haproxy_letsencrypt_* (moved from env files — the edge is shared, the LE config is shared too. Else the env that ran the haproxy role last would clobber the domain set). group_vars/{staging,prod}.yml : Strip the haproxy_letsencrypt_* block (now in all/main.yml). Comment points readers there. roles/haproxy/templates/haproxy.cfg.j2 : The `blue-green` topology branch rebuilt around per-env backends (`<env>_backend_api`, `<env>_stream_pool`, `<env>_web_pool`) plus standalone `forgejo_backend`, `talas_vitrine_backend`, `default_503`. Frontend ACLs : `host_<env>` (hdr(host) -i ...) selects which env's backends to use ; path ACLs (`is_api`, `is_stream_seg`, etc.) refine within the env. Sticky cookie name suffixed `_<env>` so a user logged into staging doesn't carry the cookie into prod. Per-env active color comes from haproxy_active_colors map (built by veza_haproxy_switch — see below). Multi-instance branch (lab) untouched. roles/veza_haproxy_switch/defaults/main.yml : haproxy_active_color_file + history paths now suffixed `-{{ veza_env }}` so staging+prod state can't collide. roles/veza_haproxy_switch/tasks/main.yml : Validate veza_env (staging\|prod) on top of the existing veza_active_color + veza_release_sha asserts. Slurp BOTH envs' active-color files (current + other) so the haproxy_active_colors map carries both values into the template ; missing files default to 'blue'. playbooks/deploy_app.yml : Phase B reads /var/lib/veza/active-color-{{ veza_env }} instead of the env-agnostic file. playbooks/cleanup_failed.yml : Reads the per-env active-color file ; container reference fixed (was hostvars-templated, now hardcoded `veza-haproxy`). playbooks/rollback.yml : Fast-mode SHA lookup reads the per-env history file. Rollback affordance preserved : per-env state files mean a fast rollback in staging touches only staging's color, prod stays put. The history files (`active-color-{staging,prod}.history`) keep the last 5 deploys per env independently. Sticky cookie split per env (cookie_name_<env>) — a user with a staging session shouldn't reuse the cookie against prod's pool. Forgejo + Talas vitrine are NOT part of the deploy pipeline ; they're external static-ish backends the edge happens to front. haproxy_forgejo_backend is "10.0.20.105:3000" today (matches the existing Incus container at that address). --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-29 16:32:49 +02:00
..
defaults	feat(ansible): TLS via dehydrated/Let's Encrypt + Forgejo on talas.group	2026-04-29 15:54:05 +02:00
files	feat(ansible): TLS via dehydrated/Let's Encrypt + Forgejo on talas.group	2026-04-29 15:54:05 +02:00
handlers	feat(ansible): TLS via dehydrated/Let's Encrypt + Forgejo on talas.group	2026-04-29 15:54:05 +02:00
tasks	feat(ansible): TLS via dehydrated/Let's Encrypt + Forgejo on talas.group	2026-04-29 15:54:05 +02:00
templates	refactor(ansible): single edge HAProxy — multi-env + Forgejo + Talas	2026-04-29 16:32:49 +02:00
README.md	feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19)	2026-04-29 11:32:48 +02:00

README.md

`haproxy` role — TLS termination + sticky-WS load balancer

Single Incus container in front of the active/active backend-api fleet and the stream-server fleet. v1.0.9 W4 Day 19 — phase-1 of the HA story (single-host LB ; phase-2 adds keepalived for an LB pair).

Topology

                         :80 / :443
                              │
                       ┌──────▼─────────┐
                       │   haproxy.lxd  │   (this role)
                       │  HTTP + WS     │
                       │  TLS terminate │
                       │  sticky cookie │
                       └─┬───────┬──────┘
                         │       │
              ┌──────────┘       └──────────┐
              ▼                              ▼
       ┌──────────────┐              ┌──────────────┐
       │ api_pool     │              │ stream_pool  │
       │ ─────────    │              │ ─────────    │
       │ backend-api-1│              │ stream-srv-1 │
       │ backend-api-2│              │ stream-srv-2 │
       │  (port 8080) │              │  (port 8082) │
       │  Round-robin │              │  URI-hash    │
       │  Sticky cookie│             │  (track_id)  │
       └──────────────┘              └──────────────┘

Why these balance modes

api_pool : balance roundrobin + cookie SERVERID insert indirect. The Go API is stateless (sessions live in Redis), so any backend can serve any request. The cookie keeps a logged-in user pinned to one backend through the session, which makes WebSocket upgrades land on the same instance that authenticated the user — avoiding a Redis round-trip on every WS hello.
stream_pool : balance uri whole + hash-type consistent. The Rust streamer keeps a hot HLS-segment cache in process. URI-hash routes the same track_id to the same node ; jump-hash means adding or removing a node only displaces ~1/N of the keys, not the entire pool.

Failover behaviour

Health check GET /api/v1/health (or /health for stream) every haproxy_health_check_interval_ms ms (default 5 s). 3 consecutive failures = down ; 2 consecutive successes = back up.
on-marked-down shutdown-sessions : when a backend drops, all its in-flight TCP/WS sessions are cut. Clients reconnect ; the cookie targets the dead backend → HAProxy ignores the dead pin and re-balances. WebSocket clients on the frontend (chat, presence) MUST handle the close + reconnect — that's already wired in apps/web/src/features/chat/services/websocket.ts.
slowstart {{ haproxy_graceful_drain_seconds }}s : when a backend recovers, its weight ramps up linearly over 30 s instead of taking a full third of the traffic on the first scrape. Smoothes the post-restart latency spike.

Defaults

variable	default	meaning
`haproxy_listen_http`	`80`	HTTP listener
`haproxy_listen_https`	`443`	HTTPS listener (only bound when cert set)
`haproxy_tls_cert_path`	`""`	path to PEM (cert+key concat). Empty = HTTP only.
`haproxy_backend_api_port`	`8080`	upstream port for backend-api
`haproxy_stream_server_port`	`8082`	upstream port for stream-server
`haproxy_health_check_interval_ms`	`5000`	active-check cadence
`haproxy_health_check_fall`	`3`	failed checks before "down"
`haproxy_health_check_rise`	`2`	successful checks before "up"
`haproxy_graceful_drain_seconds`	`30`	post-recovery weight ramp-up
`haproxy_sticky_cookie_name`	`VEZA_SERVERID`	cookie name for backend stickiness

Operations

# Health view (admin socket, loopback only) :
sudo socat /run/haproxy/admin.sock - <<< "show servers state"
sudo socat /run/haproxy/admin.sock - <<< "show stat"

# Disable a server gracefully (drains existing connections,
# new requests skip it ; useful before a planned restart) :
echo "set server api_pool/backend-api-1 state drain" | sudo socat /run/haproxy/admin.sock -
# ...wait haproxy_graceful_drain_seconds, then on the backend host :
#   sudo systemctl restart veza-backend-api
echo "set server api_pool/backend-api-1 state ready"  | sudo socat /run/haproxy/admin.sock -

# Stats UI for a human (browser only ; bound to localhost) :
ssh -L 9100:localhost:9100 haproxy.lxd
# then open http://localhost:9100/stats

# Live log tail (HAProxy logs to journald via /dev/log) :
sudo journalctl -u haproxy -f

Failover smoke test

bash infra/ansible/tests/test_backend_failover.sh

Sequence : verifies the api_pool is healthy at start, kills backend-api-1, polls HAProxy until the server is marked DOWN, asserts the next request still gets a 200 (served by backend-api-2), restarts the killed container, asserts it rejoins as healthy. Suitable for the W2 game-day day 24 drill.

What this role does NOT cover

TLS cert provisioning. Phase-1 lab : HTTP only. Phase-2 mounts a Let's Encrypt cert from Caddy's data dir or directly via certbot. mTLS to the backends is W5 territory.
Multi-LB HA. Single HAProxy node — if it dies, the cluster is dark. Phase-2 adds keepalived + a floating VIP.
Rate limiting. The Gin middleware does that today ; pushing it to the LB is a v1.1 optimisation.
WebSocket auth header passing. HAProxy passes Sec-WebSocket-* headers through unchanged ; Gin's middleware authenticates the upgrade request. No extra config needed.