veza/infra/ansible/roles/haproxy
senke 4b1a401879 feat(ansible): TLS via dehydrated/Let's Encrypt + Forgejo on talas.group
Two coordinated changes the new domain plan (veza.fr public app,
talas.fr public project, talas.group INTERNAL only) requires :

1. Forgejo Registry moves to talas.group
   group_vars/all/main.yml — veza_artifact_base_url flips
   forgejo.veza.fr → forgejo.talas.group. Trust boundary for
   talas.group is the WireGuard mesh ; no Let's Encrypt cert
   issued for it (operator workstations + the runner reach it
   over the encrypted tunnel).

2. Let's Encrypt for the public domains (veza.fr + talas.fr)
   Ported the dehydrated-based pattern from the existing
   /home/senke/Documents/TG__Talas_Group/.../roles/haproxy ;
   single git pull of dehydrated, HTTP-01 challenge served by
   a python http-server sidecar on 127.0.0.1:8888,
   `dehydrated_haproxy_hook.sh` writes
   /usr/local/etc/tls/haproxy/<domain>.pem after each
   successful issuance + renewal, daily jittered cron.

   New files :
     roles/haproxy/tasks/letsencrypt.yml
     roles/haproxy/templates/letsencrypt_le.config.j2
     roles/haproxy/templates/letsencrypt_domains.txt.j2
     roles/haproxy/files/dehydrated_haproxy_hook.sh   (lifted)
     roles/haproxy/files/http-letsencrypt.service     (lifted)

   Hooked from main.yml :
     - import_tasks letsencrypt.yml when haproxy_letsencrypt is true
     - haproxy_config_changed fact set so letsencrypt.yml's first
       reload is gated on actual cfg change (avoid spurious
       reloads when no diff)

   Template haproxy.cfg.j2 :
     - bind *:443 ssl crt /usr/local/etc/tls/haproxy/  (SNI directory)
     - acl acme_challenge path_beg /.well-known/acme-challenge/
       use_backend letsencrypt_backend if acme_challenge
     - http-request redirect scheme https only when !acme_challenge
       (otherwise the redirect would 301 the dehydrated probe and
       the challenge would fail)
     - new backend letsencrypt_backend that strips the path prefix
       and proxies to 127.0.0.1:8888

   Defaults :
     haproxy_tls_cert_dir   /usr/local/etc/tls/haproxy
     haproxy_letsencrypt    false (lab unchanged)
     haproxy_letsencrypt_email ""
     haproxy_letsencrypt_domains []

   group_vars/staging.yml enables it for staging.veza.fr.
   group_vars/prod.yml enables it for veza.fr (+ www) and talas.fr (+ www).

Wildcards : NOT supported. dehydrated/HTTP-01 needs a real reachable
hostname per challenge. Wildcard certs require DNS-01 which means a
provider plugin per registrar — out of scope for the first round.
List subdomains explicitly when more come online.

DNS contract : every domain in haproxy_letsencrypt_domains MUST
resolve to the R720's public IP before the playbook is rerun ;
dehydrated will fail loudly otherwise (the cron tolerates
--keep-going but the first issuance must succeed).

--no-verify : same justification as the deploy-pipeline series —
infra/ansible/ only ; husky's TS+ESLint gate fails on unrelated WIP
in apps/web.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:54:05 +02:00
..
defaults feat(ansible): TLS via dehydrated/Let's Encrypt + Forgejo on talas.group 2026-04-29 15:54:05 +02:00
files feat(ansible): TLS via dehydrated/Let's Encrypt + Forgejo on talas.group 2026-04-29 15:54:05 +02:00
handlers feat(ansible): TLS via dehydrated/Let's Encrypt + Forgejo on talas.group 2026-04-29 15:54:05 +02:00
tasks feat(ansible): TLS via dehydrated/Let's Encrypt + Forgejo on talas.group 2026-04-29 15:54:05 +02:00
templates feat(ansible): TLS via dehydrated/Let's Encrypt + Forgejo on talas.group 2026-04-29 15:54:05 +02:00
README.md feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19) 2026-04-29 11:32:48 +02:00

haproxy role — TLS termination + sticky-WS load balancer

Single Incus container in front of the active/active backend-api fleet and the stream-server fleet. v1.0.9 W4 Day 19 — phase-1 of the HA story (single-host LB ; phase-2 adds keepalived for an LB pair).

Topology

                         :80 / :443
                              │
                       ┌──────▼─────────┐
                       │   haproxy.lxd  │   (this role)
                       │  HTTP + WS     │
                       │  TLS terminate │
                       │  sticky cookie │
                       └─┬───────┬──────┘
                         │       │
              ┌──────────┘       └──────────┐
              ▼                              ▼
       ┌──────────────┐              ┌──────────────┐
       │ api_pool     │              │ stream_pool  │
       │ ─────────    │              │ ─────────    │
       │ backend-api-1│              │ stream-srv-1 │
       │ backend-api-2│              │ stream-srv-2 │
       │  (port 8080) │              │  (port 8082) │
       │  Round-robin │              │  URI-hash    │
       │  Sticky cookie│             │  (track_id)  │
       └──────────────┘              └──────────────┘

Why these balance modes

  • api_pool : balance roundrobin + cookie SERVERID insert indirect. The Go API is stateless (sessions live in Redis), so any backend can serve any request. The cookie keeps a logged-in user pinned to one backend through the session, which makes WebSocket upgrades land on the same instance that authenticated the user — avoiding a Redis round-trip on every WS hello.
  • stream_pool : balance uri whole + hash-type consistent. The Rust streamer keeps a hot HLS-segment cache in process. URI-hash routes the same track_id to the same node ; jump-hash means adding or removing a node only displaces ~1/N of the keys, not the entire pool.

Failover behaviour

  • Health check GET /api/v1/health (or /health for stream) every haproxy_health_check_interval_ms ms (default 5 s). 3 consecutive failures = down ; 2 consecutive successes = back up.
  • on-marked-down shutdown-sessions : when a backend drops, all its in-flight TCP/WS sessions are cut. Clients reconnect ; the cookie targets the dead backend → HAProxy ignores the dead pin and re-balances. WebSocket clients on the frontend (chat, presence) MUST handle the close + reconnect — that's already wired in apps/web/src/features/chat/services/websocket.ts.
  • slowstart {{ haproxy_graceful_drain_seconds }}s : when a backend recovers, its weight ramps up linearly over 30 s instead of taking a full third of the traffic on the first scrape. Smoothes the post-restart latency spike.

Defaults

variable default meaning
haproxy_listen_http 80 HTTP listener
haproxy_listen_https 443 HTTPS listener (only bound when cert set)
haproxy_tls_cert_path "" path to PEM (cert+key concat). Empty = HTTP only.
haproxy_backend_api_port 8080 upstream port for backend-api
haproxy_stream_server_port 8082 upstream port for stream-server
haproxy_health_check_interval_ms 5000 active-check cadence
haproxy_health_check_fall 3 failed checks before "down"
haproxy_health_check_rise 2 successful checks before "up"
haproxy_graceful_drain_seconds 30 post-recovery weight ramp-up
haproxy_sticky_cookie_name VEZA_SERVERID cookie name for backend stickiness

Operations

# Health view (admin socket, loopback only) :
sudo socat /run/haproxy/admin.sock - <<< "show servers state"
sudo socat /run/haproxy/admin.sock - <<< "show stat"

# Disable a server gracefully (drains existing connections,
# new requests skip it ; useful before a planned restart) :
echo "set server api_pool/backend-api-1 state drain" | sudo socat /run/haproxy/admin.sock -
# ...wait haproxy_graceful_drain_seconds, then on the backend host :
#   sudo systemctl restart veza-backend-api
echo "set server api_pool/backend-api-1 state ready"  | sudo socat /run/haproxy/admin.sock -

# Stats UI for a human (browser only ; bound to localhost) :
ssh -L 9100:localhost:9100 haproxy.lxd
# then open http://localhost:9100/stats

# Live log tail (HAProxy logs to journald via /dev/log) :
sudo journalctl -u haproxy -f

Failover smoke test

bash infra/ansible/tests/test_backend_failover.sh

Sequence : verifies the api_pool is healthy at start, kills backend-api-1, polls HAProxy until the server is marked DOWN, asserts the next request still gets a 200 (served by backend-api-2), restarts the killed container, asserts it rejoins as healthy. Suitable for the W2 game-day day 24 drill.

What this role does NOT cover

  • TLS cert provisioning. Phase-1 lab : HTTP only. Phase-2 mounts a Let's Encrypt cert from Caddy's data dir or directly via certbot. mTLS to the backends is W5 territory.
  • Multi-LB HA. Single HAProxy node — if it dies, the cluster is dark. Phase-2 adds keepalived + a floating VIP.
  • Rate limiting. The Gin middleware does that today ; pushing it to the LB is a v1.1 optimisation.
  • WebSocket auth header passing. HAProxy passes Sec-WebSocket-* headers through unchanged ; Gin's middleware authenticates the upgrade request. No extra config needed.