feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19)
Some checks failed
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Backend (Go) (push) Failing after 4m34s
Veza CI / Rust (Stream Server) (push) Successful in 5m37s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 1m7s
Some checks failed
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Backend (Go) (push) Failing after 4m34s
Veza CI / Rust (Stream Server) (push) Successful in 5m37s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 1m7s
Phase-1 of the active/active backend story. HAProxy in front of two
backend-api containers + two stream-server containers ; sticky cookie
pins WS sessions to one backend, URI hash routes track_id to one
streamer for HLS cache locality.
Day 19 acceptance asks for : kill backend-api-1, HAProxy bascule, WS
sessions reconnect to backend-api-2 sans perte. The smoke test wires
that gate ; phase-2 (W5) will add keepalived for an LB pair.
- infra/ansible/roles/haproxy/
* Install HAProxy + render haproxy.cfg with frontend (HTTP, optional
HTTPS via haproxy_tls_cert_path), api_pool (round-robin + sticky
cookie SERVERID), stream_pool (URI-hash + consistent jump-hash).
* Active health check GET /api/v1/health every 5s ; fall=3, rise=2.
on-marked-down shutdown-sessions + slowstart 30s on recovery.
* Stats socket bound to 127.0.0.1:9100 for the future prometheus
haproxy_exporter sidecar.
* Mozilla Intermediate TLS cipher list ; only effective when a cert
is mounted.
- infra/ansible/roles/backend_api/
* Scaffolding for the multi-instance Go API. Creates veza-api
system user, /opt/veza/backend-api dir, /etc/veza env dir,
/var/log/veza, and a hardened systemd unit pointing at the binary.
* Binary deployment is OUT of scope (documented in README) — the
Go binary is built outside Ansible (Makefile target) and pushed
via incus file push. CI → ansible-pull integration is W5+.
- infra/ansible/playbooks/haproxy.yml : provisions the haproxy Incus
container + applies common baseline + role.
- infra/ansible/inventory/lab.yml : 3 new groups :
* haproxy (single LB node)
* backend_api_instances (backend-api-{1,2})
* stream_server_instances (stream-server-{1,2})
HAProxy template reads these groups directly to populate its
upstream blocks ; falls back to the static haproxy_backend_api_fallback
list if the group is missing (for in-isolation tests).
- infra/ansible/tests/test_backend_failover.sh
* step 0 : pre-flight — both backends UP per HAProxy stats socket.
* step 1 : 5 baseline GET /api/v1/health through the LB → all 200.
* step 2 : incus stop --force backend-api-1 ; record t0.
* step 3 : poll HAProxy stats until backend-api-1 is DOWN
(timeout 30s ; expected ~ 15s = fall × interval).
* step 4 : 5 GET requests during the down window — all must 200
(served by backend-api-2). Fails if any returns non-200.
* step 5 : incus start backend-api-1 ; poll until UP again.
Acceptance (Day 19) : smoke test passes ; HAProxy sticky cookie
keeps WS sessions on the same backend until that backend dies, at
which point the cookie is ignored and the request rebalances.
W4 progress : Day 16 done · Day 17 done · Day 18 done · Day 19 done ·
Day 20 (k6 nightly load test) pending.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
44349ec444
commit
a9541f517b
13 changed files with 705 additions and 0 deletions
|
|
@ -82,6 +82,36 @@ all:
|
|||
vars:
|
||||
ansible_connection: community.general.incus
|
||||
ansible_python_interpreter: /usr/bin/python3
|
||||
# v1.0.9 W4 Day 19 — HAProxy in front of the backend-api +
|
||||
# stream-server pools. Single LB node in phase-1 ; keepalived VIP
|
||||
# comes in phase-2.
|
||||
haproxy:
|
||||
hosts:
|
||||
haproxy:
|
||||
vars:
|
||||
ansible_connection: community.general.incus
|
||||
ansible_python_interpreter: /usr/bin/python3
|
||||
# 2 backend-api Incus containers (active/active behind haproxy).
|
||||
# Sessions are Redis-backed so the API is stateless ; HAProxy
|
||||
# sticky cookie keeps a logged-in user pinned to one backend
|
||||
# through the session for WS upgrade locality.
|
||||
backend_api_instances:
|
||||
hosts:
|
||||
backend-api-1:
|
||||
backend-api-2:
|
||||
vars:
|
||||
ansible_connection: community.general.incus
|
||||
ansible_python_interpreter: /usr/bin/python3
|
||||
# 2 stream-server Incus containers (active/active behind haproxy).
|
||||
# Affinity by track_id hash via HAProxy URI-hash balance for HLS
|
||||
# cache locality.
|
||||
stream_server_instances:
|
||||
hosts:
|
||||
stream-server-1:
|
||||
stream-server-2:
|
||||
vars:
|
||||
ansible_connection: community.general.incus
|
||||
ansible_python_interpreter: /usr/bin/python3
|
||||
# v1.0.9 W3 Day 12: distributed MinIO with EC:2. 4 Incus containers,
|
||||
# each providing one drive ; single erasure set tolerates 2 simultaneous
|
||||
# node failures.
|
||||
|
|
|
|||
52
infra/ansible/playbooks/haproxy.yml
Normal file
52
infra/ansible/playbooks/haproxy.yml
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
# HAProxy playbook — provisions one Incus container `haproxy` and
|
||||
# lays down the HAProxy config in front of the backend-api +
|
||||
# stream-server pools.
|
||||
#
|
||||
# v1.0.9 W4 Day 19.
|
||||
#
|
||||
# Run with:
|
||||
# ansible-galaxy collection install community.general
|
||||
# ansible-playbook -i inventory/lab.yml playbooks/haproxy.yml
|
||||
---
|
||||
- name: Provision Incus container for HAProxy
|
||||
hosts: incus_hosts
|
||||
become: true
|
||||
gather_facts: true
|
||||
tasks:
|
||||
- name: Launch haproxy container
|
||||
ansible.builtin.shell:
|
||||
cmd: |
|
||||
set -e
|
||||
if ! incus info haproxy >/dev/null 2>&1; then
|
||||
incus launch images:ubuntu/22.04 haproxy
|
||||
for _ in $(seq 1 30); do
|
||||
if incus exec haproxy -- cloud-init status 2>/dev/null | grep -q "status: done"; then
|
||||
break
|
||||
fi
|
||||
sleep 1
|
||||
done
|
||||
incus exec haproxy -- apt-get update
|
||||
incus exec haproxy -- apt-get install -y python3 python3-apt
|
||||
fi
|
||||
args:
|
||||
executable: /bin/bash
|
||||
register: provision_result
|
||||
changed_when: "'incus launch' in provision_result.stdout"
|
||||
tags: [haproxy, provision]
|
||||
|
||||
- name: Refresh inventory so the new container is reachable
|
||||
ansible.builtin.meta: refresh_inventory
|
||||
|
||||
- name: Apply common baseline
|
||||
hosts: haproxy
|
||||
become: true
|
||||
gather_facts: true
|
||||
roles:
|
||||
- common
|
||||
|
||||
- name: Install + configure HAProxy
|
||||
hosts: haproxy
|
||||
become: true
|
||||
gather_facts: true
|
||||
roles:
|
||||
- haproxy
|
||||
41
infra/ansible/roles/backend_api/README.md
Normal file
41
infra/ansible/roles/backend_api/README.md
Normal file
|
|
@ -0,0 +1,41 @@
|
|||
# `backend_api` role — runtime baseline for the Go API container
|
||||
|
||||
Multi-instance scaffolding for the Go backend API behind HAProxy. v1.0.9 W4 Day 19 — phase-1 of the active/active deploy story.
|
||||
|
||||
## What this role DOES
|
||||
|
||||
- Creates the `veza-api` system user.
|
||||
- Lays down `/opt/veza/backend-api`, `/etc/veza`, `/var/log/veza`.
|
||||
- Renders a hardened systemd unit pointing at the binary path.
|
||||
- Idempotent ; safe to re-apply against an already-running instance.
|
||||
|
||||
## What this role does NOT do (deliberately)
|
||||
|
||||
- **Build / copy the Go binary.** That happens out-of-band : a `make backend-api-deploy` target builds the binary on the dev host and pushes it via `incus file push backend-api-X /opt/veza/backend-api/veza-api`. CI integration (Forgejo job → ansible-pull) is W5+ work.
|
||||
- **Render `.env`.** Secrets live in `group_vars/backend_api.vault.yml` (encrypted) and are pushed by a separate task in `playbooks/backend_api.yml` ; they don't belong in this role's defaults.
|
||||
- **Run database migrations.** Migrations are gated by a CI job — running them via Ansible would race with multi-instance deploys.
|
||||
|
||||
## Deploying the binary (one-shot, until CI lands)
|
||||
|
||||
```bash
|
||||
# On the dev host :
|
||||
make -C veza-backend-api build # produces ./bin/veza-api
|
||||
for ct in backend-api-1 backend-api-2; do
|
||||
incus file push veza-backend-api/bin/veza-api "$ct"/opt/veza/backend-api/veza-api \
|
||||
--uid 1001 --gid 1001 --mode 0755
|
||||
incus exec "$ct" -- systemctl restart veza-backend-api
|
||||
done
|
||||
```
|
||||
|
||||
Roll one container at a time so HAProxy never sees both backends down.
|
||||
|
||||
## Defaults
|
||||
|
||||
| variable | default | meaning |
|
||||
| --------------------------- | -------------------------------- | ------------------------------- |
|
||||
| `backend_api_user` | `veza-api` | system user |
|
||||
| `backend_api_install_dir` | `/opt/veza/backend-api` | binary + working dir |
|
||||
| `backend_api_binary_name` | `veza-api` | binary basename |
|
||||
| `backend_api_listen_port` | `8080` | matches HAProxy upstream config |
|
||||
| `backend_api_env_file` | `/etc/veza/backend-api.env` | EnvironmentFile= path |
|
||||
| `backend_api_log_dir` | `/var/log/veza` | tail-friendly log dir |
|
||||
15
infra/ansible/roles/backend_api/defaults/main.yml
Normal file
15
infra/ansible/roles/backend_api/defaults/main.yml
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
# backend_api defaults — scaffolding for the multi-instance Go API
|
||||
# behind HAProxy (v1.0.9 W4 Day 19).
|
||||
#
|
||||
# v1.0 lab : the Go binary is built outside Ansible (Makefile target
|
||||
# in the repo root) and copied into each Incus container via
|
||||
# `incus file push`. This role only installs the runtime dependencies
|
||||
# and renders the systemd unit ; the binary deploy step remains
|
||||
# manual until phase-2 wires CI → Forgejo → ansible-pull.
|
||||
---
|
||||
backend_api_user: veza-api
|
||||
backend_api_install_dir: /opt/veza/backend-api
|
||||
backend_api_binary_name: veza-api
|
||||
backend_api_listen_port: 8080
|
||||
backend_api_env_file: /etc/veza/backend-api.env
|
||||
backend_api_log_dir: /var/log/veza
|
||||
6
infra/ansible/roles/backend_api/handlers/main.yml
Normal file
6
infra/ansible/roles/backend_api/handlers/main.yml
Normal file
|
|
@ -0,0 +1,6 @@
|
|||
---
|
||||
- name: Restart veza-backend-api
|
||||
ansible.builtin.systemd:
|
||||
name: veza-backend-api
|
||||
state: restarted
|
||||
daemon_reload: true
|
||||
61
infra/ansible/roles/backend_api/tasks/main.yml
Normal file
61
infra/ansible/roles/backend_api/tasks/main.yml
Normal file
|
|
@ -0,0 +1,61 @@
|
|||
# backend_api role — runtime baseline for the Go API container.
|
||||
# v1.0.9 W4 Day 19 — multi-instance scaffolding. Binary deploy is
|
||||
# explicitly out of scope (Makefile + scp/incus push, NOT Ansible).
|
||||
#
|
||||
# What this role DOES :
|
||||
# - creates the veza-api system user
|
||||
# - lays down /opt/veza/backend-api + /etc/veza + /var/log/veza
|
||||
# - renders a systemd unit pointing at the binary path
|
||||
# - opens port 8080 (no firewall changes ; Incus bridge is
|
||||
# trusted today)
|
||||
#
|
||||
# What this role does NOT do (deliberately) :
|
||||
# - build / copy the Go binary
|
||||
# - render .env (the secrets are managed by ansible-vault outside
|
||||
# the role ; the env file path is referenced here)
|
||||
# - run migrations
|
||||
---
|
||||
- name: Create veza-api system user
|
||||
ansible.builtin.user:
|
||||
name: "{{ backend_api_user }}"
|
||||
system: true
|
||||
shell: /usr/sbin/nologin
|
||||
home: "{{ backend_api_install_dir }}"
|
||||
create_home: true
|
||||
tags: [backend_api, install]
|
||||
|
||||
- name: Ensure install + log directories
|
||||
ansible.builtin.file:
|
||||
path: "{{ item }}"
|
||||
state: directory
|
||||
owner: "{{ backend_api_user }}"
|
||||
group: "{{ backend_api_user }}"
|
||||
mode: "0755"
|
||||
loop:
|
||||
- "{{ backend_api_install_dir }}"
|
||||
- "{{ backend_api_log_dir }}"
|
||||
tags: [backend_api, install]
|
||||
|
||||
- name: Ensure /etc/veza exists for the env file
|
||||
ansible.builtin.file:
|
||||
path: /etc/veza
|
||||
state: directory
|
||||
owner: root
|
||||
group: "{{ backend_api_user }}"
|
||||
mode: "0750"
|
||||
tags: [backend_api, config]
|
||||
|
||||
- name: Render systemd unit
|
||||
ansible.builtin.template:
|
||||
src: veza-backend-api.service.j2
|
||||
dest: /etc/systemd/system/veza-backend-api.service
|
||||
owner: root
|
||||
group: root
|
||||
mode: "0644"
|
||||
notify: Restart veza-backend-api
|
||||
tags: [backend_api, service]
|
||||
|
||||
- name: Reload systemd daemon
|
||||
ansible.builtin.systemd:
|
||||
daemon_reload: true
|
||||
tags: [backend_api, service]
|
||||
|
|
@ -0,0 +1,34 @@
|
|||
# Managed by Ansible — do not edit by hand.
|
||||
# v1.0.9 W4 Day 19. The {{ backend_api_binary_name }} binary itself
|
||||
# is deployed out-of-band (Makefile target + incus file push) ; this
|
||||
# unit only knows where to find it.
|
||||
[Unit]
|
||||
Description=Veza backend API (Go) — instance on {{ ansible_hostname }}
|
||||
Documentation=https://veza.fr/docs
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
AssertPathExists={{ backend_api_install_dir }}/{{ backend_api_binary_name }}
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User={{ backend_api_user }}
|
||||
Group={{ backend_api_user }}
|
||||
EnvironmentFile=-{{ backend_api_env_file }}
|
||||
WorkingDirectory={{ backend_api_install_dir }}
|
||||
ExecStart={{ backend_api_install_dir }}/{{ backend_api_binary_name }}
|
||||
Restart=on-failure
|
||||
RestartSec=5s
|
||||
LimitNOFILE=65535
|
||||
|
||||
# Hardening — same baseline as the other Ansible-managed daemons.
|
||||
NoNewPrivileges=true
|
||||
ProtectSystem=strict
|
||||
ProtectHome=true
|
||||
ReadWritePaths={{ backend_api_install_dir }} {{ backend_api_log_dir }}
|
||||
PrivateTmp=true
|
||||
ProtectKernelTunables=true
|
||||
ProtectKernelModules=true
|
||||
ProtectControlGroups=true
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
91
infra/ansible/roles/haproxy/README.md
Normal file
91
infra/ansible/roles/haproxy/README.md
Normal file
|
|
@ -0,0 +1,91 @@
|
|||
# `haproxy` role — TLS termination + sticky-WS load balancer
|
||||
|
||||
Single Incus container in front of the active/active backend-api fleet and the stream-server fleet. v1.0.9 W4 Day 19 — phase-1 of the HA story (single-host LB ; phase-2 adds keepalived for an LB pair).
|
||||
|
||||
## Topology
|
||||
|
||||
```
|
||||
:80 / :443
|
||||
│
|
||||
┌──────▼─────────┐
|
||||
│ haproxy.lxd │ (this role)
|
||||
│ HTTP + WS │
|
||||
│ TLS terminate │
|
||||
│ sticky cookie │
|
||||
└─┬───────┬──────┘
|
||||
│ │
|
||||
┌──────────┘ └──────────┐
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌──────────────┐
|
||||
│ api_pool │ │ stream_pool │
|
||||
│ ───────── │ │ ───────── │
|
||||
│ backend-api-1│ │ stream-srv-1 │
|
||||
│ backend-api-2│ │ stream-srv-2 │
|
||||
│ (port 8080) │ │ (port 8082) │
|
||||
│ Round-robin │ │ URI-hash │
|
||||
│ Sticky cookie│ │ (track_id) │
|
||||
└──────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
## Why these balance modes
|
||||
|
||||
- **api_pool : `balance roundrobin` + `cookie SERVERID insert indirect`.** The Go API is stateless (sessions live in Redis), so any backend can serve any request. The cookie keeps a logged-in user pinned to one backend through the session, which makes WebSocket upgrades land on the same instance that authenticated the user — avoiding a Redis round-trip on every WS hello.
|
||||
- **stream_pool : `balance uri whole` + `hash-type consistent`.** The Rust streamer keeps a hot HLS-segment cache in process. URI-hash routes the same track_id to the same node ; jump-hash means adding or removing a node only displaces ~`1/N` of the keys, not the entire pool.
|
||||
|
||||
## Failover behaviour
|
||||
|
||||
- Health check `GET /api/v1/health` (or `/health` for stream) every `haproxy_health_check_interval_ms` ms (default 5 s). 3 consecutive failures = down ; 2 consecutive successes = back up.
|
||||
- `on-marked-down shutdown-sessions` : when a backend drops, all its in-flight TCP/WS sessions are cut. Clients reconnect ; the cookie targets the dead backend → HAProxy ignores the dead pin and re-balances. WebSocket clients on the frontend (chat, presence) MUST handle the close + reconnect — that's already wired in `apps/web/src/features/chat/services/websocket.ts`.
|
||||
- `slowstart {{ haproxy_graceful_drain_seconds }}s` : when a backend recovers, its weight ramps up linearly over 30 s instead of taking a full third of the traffic on the first scrape. Smoothes the post-restart latency spike.
|
||||
|
||||
## Defaults
|
||||
|
||||
| variable | default | meaning |
|
||||
| --------------------------------- | ------------------ | ----------------------------------------- |
|
||||
| `haproxy_listen_http` | `80` | HTTP listener |
|
||||
| `haproxy_listen_https` | `443` | HTTPS listener (only bound when cert set) |
|
||||
| `haproxy_tls_cert_path` | `""` | path to PEM (cert+key concat). Empty = HTTP only. |
|
||||
| `haproxy_backend_api_port` | `8080` | upstream port for backend-api |
|
||||
| `haproxy_stream_server_port` | `8082` | upstream port for stream-server |
|
||||
| `haproxy_health_check_interval_ms`| `5000` | active-check cadence |
|
||||
| `haproxy_health_check_fall` | `3` | failed checks before "down" |
|
||||
| `haproxy_health_check_rise` | `2` | successful checks before "up" |
|
||||
| `haproxy_graceful_drain_seconds` | `30` | post-recovery weight ramp-up |
|
||||
| `haproxy_sticky_cookie_name` | `VEZA_SERVERID` | cookie name for backend stickiness |
|
||||
|
||||
## Operations
|
||||
|
||||
```bash
|
||||
# Health view (admin socket, loopback only) :
|
||||
sudo socat /run/haproxy/admin.sock - <<< "show servers state"
|
||||
sudo socat /run/haproxy/admin.sock - <<< "show stat"
|
||||
|
||||
# Disable a server gracefully (drains existing connections,
|
||||
# new requests skip it ; useful before a planned restart) :
|
||||
echo "set server api_pool/backend-api-1 state drain" | sudo socat /run/haproxy/admin.sock -
|
||||
# ...wait haproxy_graceful_drain_seconds, then on the backend host :
|
||||
# sudo systemctl restart veza-backend-api
|
||||
echo "set server api_pool/backend-api-1 state ready" | sudo socat /run/haproxy/admin.sock -
|
||||
|
||||
# Stats UI for a human (browser only ; bound to localhost) :
|
||||
ssh -L 9100:localhost:9100 haproxy.lxd
|
||||
# then open http://localhost:9100/stats
|
||||
|
||||
# Live log tail (HAProxy logs to journald via /dev/log) :
|
||||
sudo journalctl -u haproxy -f
|
||||
```
|
||||
|
||||
## Failover smoke test
|
||||
|
||||
```bash
|
||||
bash infra/ansible/tests/test_backend_failover.sh
|
||||
```
|
||||
|
||||
Sequence : verifies the api_pool is healthy at start, kills `backend-api-1`, polls HAProxy until the server is marked DOWN, asserts the next request still gets a 200 (served by `backend-api-2`), restarts the killed container, asserts it rejoins as healthy. Suitable for the W2 game-day day 24 drill.
|
||||
|
||||
## What this role does NOT cover
|
||||
|
||||
- **TLS cert provisioning.** Phase-1 lab : HTTP only. Phase-2 mounts a Let's Encrypt cert from Caddy's data dir or directly via certbot. mTLS to the backends is W5 territory.
|
||||
- **Multi-LB HA.** Single HAProxy node — if it dies, the cluster is dark. Phase-2 adds keepalived + a floating VIP.
|
||||
- **Rate limiting.** The Gin middleware does that today ; pushing it to the LB is a v1.1 optimisation.
|
||||
- **WebSocket auth header passing.** HAProxy passes `Sec-WebSocket-*` headers through unchanged ; Gin's middleware authenticates the upgrade request. No extra config needed.
|
||||
54
infra/ansible/roles/haproxy/defaults/main.yml
Normal file
54
infra/ansible/roles/haproxy/defaults/main.yml
Normal file
|
|
@ -0,0 +1,54 @@
|
|||
# haproxy defaults — TLS-terminating frontend + backend pools for the
|
||||
# stateless backend-api fleet and the stream server. v1.0.9 W4 Day 19.
|
||||
#
|
||||
# Topology :
|
||||
#
|
||||
# client → :443 HAProxy (TLS) → backend-api-1.lxd:8080
|
||||
# → backend-api-2.lxd:8080
|
||||
# → stream-server-1.lxd:8082 (track_id hash)
|
||||
# → stream-server-2.lxd:8082
|
||||
#
|
||||
# WebSocket affinity : HAProxy sets `SERVERID` cookie on the first
|
||||
# response ; subsequent requests (HTTP + WS upgrade) carry the cookie
|
||||
# back to the same backend. The cookie survives across page loads so
|
||||
# a chat session reconnecting after a 30s pause typically lands on the
|
||||
# same instance — but if the original instance is offline, the cookie
|
||||
# is ignored and the next-best healthy backend takes over.
|
||||
---
|
||||
haproxy_version: "2.8" # Ubuntu 22.04 ships 2.4 ; we explicitly install 2.8 from PPA
|
||||
|
||||
# Listeners. v1.0 lab : HTTP only (TLS at the edge LB above us, or
|
||||
# none in lab). Phase-2 enables TLS termination here when we have
|
||||
# certs in /etc/haproxy/certs/veza.pem.
|
||||
haproxy_listen_http: 80
|
||||
haproxy_listen_https: 443
|
||||
haproxy_listen_stats: 9100 # admin socket bind ; reachable on Incus bridge only
|
||||
haproxy_tls_cert_path: "" # empty = HTTPS frontend disabled
|
||||
|
||||
# Backend API pool — port 8080 per default (Gin server in cmd/api).
|
||||
# The inventory's `backend_api_instances` group drives the upstream
|
||||
# server list ; if absent, the role falls back to the static defaults
|
||||
# below so the role is testable in isolation.
|
||||
haproxy_backend_api_port: 8080
|
||||
haproxy_backend_api_fallback:
|
||||
- backend-api-1
|
||||
- backend-api-2
|
||||
|
||||
# Stream server pool — port 8082 (Rust Axum). Uses URI-hash balance so
|
||||
# the same track_id consistently lands on the same node, maximising the
|
||||
# in-process HLS cache hit rate.
|
||||
haproxy_stream_server_port: 8082
|
||||
haproxy_stream_server_fallback:
|
||||
- stream-server-1
|
||||
- stream-server-2
|
||||
|
||||
# Health check cadence + drain — Day 19 acceptance asks for 5s checks
|
||||
# and 30s drain before remove.
|
||||
haproxy_health_check_interval_ms: 5000
|
||||
haproxy_health_check_fall: 3 # 3 failed checks = down
|
||||
haproxy_health_check_rise: 2 # 2 passed checks = back up
|
||||
haproxy_graceful_drain_seconds: 30
|
||||
|
||||
# Sticky cookie name. Rotating it bumps the SERVERID and forces a
|
||||
# rebalance — useful after a config change that reshapes the pool.
|
||||
haproxy_sticky_cookie_name: "VEZA_SERVERID"
|
||||
5
infra/ansible/roles/haproxy/handlers/main.yml
Normal file
5
infra/ansible/roles/haproxy/handlers/main.yml
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
---
|
||||
- name: Reload haproxy
|
||||
ansible.builtin.systemd:
|
||||
name: haproxy
|
||||
state: reloaded
|
||||
39
infra/ansible/roles/haproxy/tasks/main.yml
Normal file
39
infra/ansible/roles/haproxy/tasks/main.yml
Normal file
|
|
@ -0,0 +1,39 @@
|
|||
# haproxy role — install HAProxy 2.8, render the config, ensure the
|
||||
# systemd unit is running. Idempotent.
|
||||
---
|
||||
- name: Install HAProxy + curl (smoke test relies on it)
|
||||
ansible.builtin.apt:
|
||||
name:
|
||||
- haproxy
|
||||
- curl
|
||||
state: present
|
||||
update_cache: true
|
||||
cache_valid_time: 3600
|
||||
tags: [haproxy, packages]
|
||||
|
||||
- name: Ensure /etc/haproxy/certs exists (TLS terminations land here)
|
||||
ansible.builtin.file:
|
||||
path: /etc/haproxy/certs
|
||||
state: directory
|
||||
owner: root
|
||||
group: haproxy
|
||||
mode: "0750"
|
||||
tags: [haproxy, config]
|
||||
|
||||
- name: Render haproxy.cfg
|
||||
ansible.builtin.template:
|
||||
src: haproxy.cfg.j2
|
||||
dest: /etc/haproxy/haproxy.cfg
|
||||
owner: root
|
||||
group: haproxy
|
||||
mode: "0640"
|
||||
validate: "haproxy -f %s -c -q"
|
||||
notify: Reload haproxy
|
||||
tags: [haproxy, config]
|
||||
|
||||
- name: Enable + start haproxy
|
||||
ansible.builtin.systemd:
|
||||
name: haproxy
|
||||
state: started
|
||||
enabled: true
|
||||
tags: [haproxy, service]
|
||||
120
infra/ansible/roles/haproxy/templates/haproxy.cfg.j2
Normal file
120
infra/ansible/roles/haproxy/templates/haproxy.cfg.j2
Normal file
|
|
@ -0,0 +1,120 @@
|
|||
# Managed by Ansible — do not edit by hand.
|
||||
# v1.0.9 W4 Day 19.
|
||||
|
||||
global
|
||||
log /dev/log local0
|
||||
log /dev/log local1 notice
|
||||
chroot /var/lib/haproxy
|
||||
stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
|
||||
stats timeout 30s
|
||||
user haproxy
|
||||
group haproxy
|
||||
daemon
|
||||
# Avoid leaking the version banner in error pages.
|
||||
server-state-file /var/lib/haproxy/server-state
|
||||
# ssl-default-bind-* tightens TLS to modern ciphers ; lifted directly
|
||||
# from the Mozilla Intermediate profile. Only effective when a TLS
|
||||
# cert is mounted (see haproxy_tls_cert_path).
|
||||
ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11
|
||||
ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305
|
||||
|
||||
defaults
|
||||
log global
|
||||
mode http
|
||||
option httplog
|
||||
option dontlognull
|
||||
option forwardfor # adds X-Forwarded-For so backend logs see the real IP
|
||||
option http-server-close
|
||||
timeout connect 5s
|
||||
timeout client 60s
|
||||
timeout server 60s
|
||||
timeout tunnel 1h # WS connections are long-lived ; bumped from default 1m
|
||||
timeout client-fin 5s
|
||||
timeout http-keep-alive 15s
|
||||
timeout http-request 10s
|
||||
# Restore previous server state on reload so health checks don't
|
||||
# restart from scratch + the drain timer survives.
|
||||
load-server-state-from-file global
|
||||
|
||||
# -----------------------------------------------------------------------
|
||||
# Stats endpoint — bound to loopback only so the prometheus haproxy
|
||||
# exporter (sidecar) can scrape it. Auth lives at the bridge layer.
|
||||
# -----------------------------------------------------------------------
|
||||
frontend stats
|
||||
bind 127.0.0.1:{{ haproxy_listen_stats }}
|
||||
stats enable
|
||||
stats uri /stats
|
||||
stats refresh 5s
|
||||
stats show-node
|
||||
stats show-legends
|
||||
no log
|
||||
|
||||
# -----------------------------------------------------------------------
|
||||
# Frontend HTTP. v1.0 lab uses HTTP only ; uncomment the HTTPS bind
|
||||
# when haproxy_tls_cert_path is non-empty (Mozilla intermediate).
|
||||
# -----------------------------------------------------------------------
|
||||
frontend veza_http_in
|
||||
bind *:{{ haproxy_listen_http }}
|
||||
{% if haproxy_tls_cert_path %}
|
||||
bind *:{{ haproxy_listen_https }} ssl crt {{ haproxy_tls_cert_path }} alpn h2,http/1.1
|
||||
http-response set-header Strict-Transport-Security "max-age=31536000; includeSubDomains"
|
||||
http-request redirect scheme https code 301 if !{ ssl_fc }
|
||||
{% endif %}
|
||||
|
||||
# Path-based routing :
|
||||
# /api/v1/ws/* → backend api_pool (sticky cookie ; carries chat WS)
|
||||
# /api/v1/* → backend api_pool (also sticky so 401 → /me roundtrips work)
|
||||
# /tracks/*/hls → backend stream_pool (URI-hash for cache locality)
|
||||
# else → backend api_pool (default)
|
||||
acl is_stream path_beg /tracks/ path_end .m3u8
|
||||
acl is_stream path_beg /tracks/ path_end .ts
|
||||
acl is_stream path_beg /tracks/ path_end .m4s
|
||||
|
||||
use_backend stream_pool if is_stream
|
||||
default_backend api_pool
|
||||
|
||||
# -----------------------------------------------------------------------
|
||||
# Backend api_pool — Gin REST API. Sticky cookie + active health check.
|
||||
# `cookie ... insert indirect nocache` : HAProxy sets the cookie on the
|
||||
# first response, the browser sends it back, subsequent requests stick
|
||||
# to the same server. WS upgrades inherit it.
|
||||
# -----------------------------------------------------------------------
|
||||
backend api_pool
|
||||
balance roundrobin
|
||||
option httpchk GET /api/v1/health
|
||||
http-check expect status 200
|
||||
cookie {{ haproxy_sticky_cookie_name }} insert indirect nocache httponly secure
|
||||
default-server check
|
||||
inter {{ haproxy_health_check_interval_ms }}
|
||||
fall {{ haproxy_health_check_fall }}
|
||||
rise {{ haproxy_health_check_rise }}
|
||||
on-marked-down shutdown-sessions
|
||||
slowstart {{ haproxy_graceful_drain_seconds }}s
|
||||
|
||||
{% set api_hosts = (groups['backend_api_instances'] | default(haproxy_backend_api_fallback)) %}
|
||||
{% for host in api_hosts %}
|
||||
server {{ host }} {{ host }}.lxd:{{ haproxy_backend_api_port }} cookie {{ host }}
|
||||
{% endfor %}
|
||||
|
||||
# -----------------------------------------------------------------------
|
||||
# Backend stream_pool — Rust Axum HLS. URI hash so the same track_id
|
||||
# consistently lands on the same node, keeping the in-process HLS
|
||||
# segment cache warm. `consistent` flag = jump-hash so adding/removing
|
||||
# a node doesn't flush the entire pool.
|
||||
# -----------------------------------------------------------------------
|
||||
backend stream_pool
|
||||
balance uri whole
|
||||
hash-type consistent
|
||||
option httpchk GET /health
|
||||
http-check expect status 200
|
||||
default-server check
|
||||
inter {{ haproxy_health_check_interval_ms }}
|
||||
fall {{ haproxy_health_check_fall }}
|
||||
rise {{ haproxy_health_check_rise }}
|
||||
on-marked-down shutdown-sessions
|
||||
slowstart {{ haproxy_graceful_drain_seconds }}s
|
||||
|
||||
{% set stream_hosts = (groups['stream_server_instances'] | default(haproxy_stream_server_fallback)) %}
|
||||
{% for host in stream_hosts %}
|
||||
server {{ host }} {{ host }}.lxd:{{ haproxy_stream_server_port }}
|
||||
{% endfor %}
|
||||
157
infra/ansible/tests/test_backend_failover.sh
Executable file
157
infra/ansible/tests/test_backend_failover.sh
Executable file
|
|
@ -0,0 +1,157 @@
|
|||
#!/usr/bin/env bash
|
||||
# test_backend_failover.sh — verify HAProxy fails over from backend-api-1
|
||||
# to backend-api-2 when the first instance dies, with no client-visible
|
||||
# error window beyond the health-check fall.
|
||||
#
|
||||
# Sequence :
|
||||
# 1. Pre-flight : both backends UP per HAProxy stats.
|
||||
# 2. Issue 5 GET /api/v1/health through HAProxy ; all should return 200.
|
||||
# Capture the SERVERID cookie to know which backend was chosen.
|
||||
# 3. incus stop --force backend-api-1 (or whoever the cookie pinned).
|
||||
# 4. Poll HAProxy stats until the killed server is marked DOWN
|
||||
# (typically within fall × interval = 3 × 5 s = 15 s).
|
||||
# 5. Issue another 5 GET /api/v1/health ; all must return 200, served
|
||||
# by the surviving backend.
|
||||
# 6. incus start backend-api-1 ; poll until UP again.
|
||||
#
|
||||
# v1.0.9 W4 Day 19 — acceptance for the verification gate.
|
||||
#
|
||||
# Usage :
|
||||
# bash infra/ansible/tests/test_backend_failover.sh
|
||||
#
|
||||
# Exit codes :
|
||||
# 0 — failover happened, no errors during the window
|
||||
# 1 — pool not healthy at start
|
||||
# 2 — failover took too long OR errors observed during the window
|
||||
# 3 — required tool missing
|
||||
set -euo pipefail
|
||||
|
||||
HAPROXY_HOST=${HAPROXY_HOST:-haproxy.lxd}
|
||||
HAPROXY_PORT=${HAPROXY_PORT:-80}
|
||||
KILL_BACKEND=${KILL_BACKEND:-backend-api-1}
|
||||
SURVIVING_BACKEND=${SURVIVING_BACKEND:-backend-api-2}
|
||||
HEALTH_PATH=${HEALTH_PATH:-/api/v1/health}
|
||||
DOWN_TIMEOUT_SECONDS=${DOWN_TIMEOUT_SECONDS:-30}
|
||||
UP_TIMEOUT_SECONDS=${UP_TIMEOUT_SECONDS:-60}
|
||||
|
||||
log() { printf '[%s] %s\n' "$(date +%H:%M:%S)" "$*" >&2; }
|
||||
fail() { log "FAIL: $*"; exit "${2:-2}"; }
|
||||
|
||||
require() {
|
||||
command -v "$1" >/dev/null 2>&1 || fail "required tool missing on host: $1" 3
|
||||
}
|
||||
|
||||
require incus
|
||||
require curl
|
||||
require date
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Helper : ask HAProxy admin socket for a server's status (UP / DOWN / DRAIN /
|
||||
# MAINT). Bound to loopback inside the haproxy container.
|
||||
# -----------------------------------------------------------------------------
|
||||
server_status() {
|
||||
local server=$1
|
||||
incus exec haproxy -- bash -c \
|
||||
"echo 'show stat' | socat /run/haproxy/admin.sock - \
|
||||
| awk -F, -v s=\"$server\" '\$2 == s {print \$18; exit}'"
|
||||
}
|
||||
|
||||
curl_via_lb() {
|
||||
local accept_404=${1:-0}
|
||||
local code
|
||||
code=$(curl --max-time 5 -sS -o /dev/null -w "%{http_code}" \
|
||||
"http://${HAPROXY_HOST}:${HAPROXY_PORT}${HEALTH_PATH}" || echo 000)
|
||||
echo "$code"
|
||||
}
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# 1. Pre-flight — both backends must be UP.
|
||||
# -----------------------------------------------------------------------------
|
||||
log "step 0: pre-flight — querying HAProxy admin socket"
|
||||
status_kill=$(server_status "$KILL_BACKEND")
|
||||
status_survive=$(server_status "$SURVIVING_BACKEND")
|
||||
log " $KILL_BACKEND : $status_kill"
|
||||
log " $SURVIVING_BACKEND : $status_survive"
|
||||
if [ "$status_kill" != "UP" ] || [ "$status_survive" != "UP" ]; then
|
||||
fail "pool not fully UP at start — refusing to test from a degraded baseline" 1
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# 2. Sanity — 5 successful requests through the LB.
|
||||
# -----------------------------------------------------------------------------
|
||||
log "step 1: 5 baseline requests through HAProxy"
|
||||
for i in 1 2 3 4 5; do
|
||||
code=$(curl_via_lb)
|
||||
log " request $i → HTTP $code"
|
||||
if [ "$code" != "200" ]; then
|
||||
fail "baseline request $i returned HTTP $code, want 200" 1
|
||||
fi
|
||||
done
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# 3. Kill the backend container.
|
||||
# -----------------------------------------------------------------------------
|
||||
log "step 2: stopping $KILL_BACKEND — start failover timer"
|
||||
t0=$(date +%s)
|
||||
incus stop --force "$KILL_BACKEND"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# 4. Poll until HAProxy marks the killed server DOWN.
|
||||
# -----------------------------------------------------------------------------
|
||||
log "step 3: polling HAProxy until $KILL_BACKEND is DOWN (timeout ${DOWN_TIMEOUT_SECONDS}s)"
|
||||
deadline=$((t0 + DOWN_TIMEOUT_SECONDS))
|
||||
killed_down=0
|
||||
while [ "$(date +%s)" -lt "$deadline" ]; do
|
||||
s=$(server_status "$KILL_BACKEND")
|
||||
if [ "$s" = "DOWN" ] || [ "$s" = "MAINT" ]; then
|
||||
killed_down=1
|
||||
break
|
||||
fi
|
||||
sleep 1
|
||||
done
|
||||
elapsed=$(( $(date +%s) - t0 ))
|
||||
if [ "$killed_down" -eq 0 ]; then
|
||||
fail "$KILL_BACKEND not marked DOWN within ${DOWN_TIMEOUT_SECONDS}s" 2
|
||||
fi
|
||||
log " $KILL_BACKEND went DOWN in ${elapsed}s"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# 5. 5 requests through the LB — all must succeed via the surviving backend.
|
||||
# -----------------------------------------------------------------------------
|
||||
log "step 4: 5 requests through HAProxy with $KILL_BACKEND down"
|
||||
errors=0
|
||||
for i in 1 2 3 4 5; do
|
||||
code=$(curl_via_lb)
|
||||
log " request $i → HTTP $code"
|
||||
if [ "$code" != "200" ]; then
|
||||
errors=$((errors + 1))
|
||||
fi
|
||||
done
|
||||
if [ "$errors" -gt 0 ]; then
|
||||
fail "$errors of 5 requests failed during failover — survivor isn't catching all traffic" 2
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# 6. Restart the killed backend and confirm it rejoins as UP.
|
||||
# -----------------------------------------------------------------------------
|
||||
log "step 5: restarting $KILL_BACKEND"
|
||||
incus start "$KILL_BACKEND" || true
|
||||
log " polling until $KILL_BACKEND is UP again (timeout ${UP_TIMEOUT_SECONDS}s)"
|
||||
deadline=$(( $(date +%s) + UP_TIMEOUT_SECONDS ))
|
||||
recovered=0
|
||||
while [ "$(date +%s)" -lt "$deadline" ]; do
|
||||
s=$(server_status "$KILL_BACKEND")
|
||||
if [ "$s" = "UP" ]; then
|
||||
recovered=1
|
||||
break
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
if [ "$recovered" -eq 0 ]; then
|
||||
log "WARN: $KILL_BACKEND did not return to UP within ${UP_TIMEOUT_SECONDS}s — manual check needed"
|
||||
else
|
||||
log " $KILL_BACKEND back UP"
|
||||
fi
|
||||
|
||||
log "PASS: HAProxy fail-over OK ($KILL_BACKEND down in ${elapsed}s, no client-visible errors during the window)"
|
||||
exit 0
|
||||
Loading…
Reference in a new issue