feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19)
Phase-1 of the active/active backend story. HAProxy in front of two
backend-api containers + two stream-server containers ; sticky cookie
pins WS sessions to one backend, URI hash routes track_id to one
streamer for HLS cache locality.
Day 19 acceptance asks for : kill backend-api-1, HAProxy bascule, WS
sessions reconnect to backend-api-2 sans perte. The smoke test wires
that gate ; phase-2 (W5) will add keepalived for an LB pair.
- infra/ansible/roles/haproxy/
* Install HAProxy + render haproxy.cfg with frontend (HTTP, optional
HTTPS via haproxy_tls_cert_path), api_pool (round-robin + sticky
cookie SERVERID), stream_pool (URI-hash + consistent jump-hash).
* Active health check GET /api/v1/health every 5s ; fall=3, rise=2.
on-marked-down shutdown-sessions + slowstart 30s on recovery.
* Stats socket bound to 127.0.0.1:9100 for the future prometheus
haproxy_exporter sidecar.
* Mozilla Intermediate TLS cipher list ; only effective when a cert
is mounted.
- infra/ansible/roles/backend_api/
* Scaffolding for the multi-instance Go API. Creates veza-api
system user, /opt/veza/backend-api dir, /etc/veza env dir,
/var/log/veza, and a hardened systemd unit pointing at the binary.
* Binary deployment is OUT of scope (documented in README) — the
Go binary is built outside Ansible (Makefile target) and pushed
via incus file push. CI → ansible-pull integration is W5+.
- infra/ansible/playbooks/haproxy.yml : provisions the haproxy Incus
container + applies common baseline + role.
- infra/ansible/inventory/lab.yml : 3 new groups :
* haproxy (single LB node)
* backend_api_instances (backend-api-{1,2})
* stream_server_instances (stream-server-{1,2})
HAProxy template reads these groups directly to populate its
upstream blocks ; falls back to the static haproxy_backend_api_fallback
list if the group is missing (for in-isolation tests).
- infra/ansible/tests/test_backend_failover.sh
* step 0 : pre-flight — both backends UP per HAProxy stats socket.
* step 1 : 5 baseline GET /api/v1/health through the LB → all 200.
* step 2 : incus stop --force backend-api-1 ; record t0.
* step 3 : poll HAProxy stats until backend-api-1 is DOWN
(timeout 30s ; expected ~ 15s = fall × interval).
* step 4 : 5 GET requests during the down window — all must 200
(served by backend-api-2). Fails if any returns non-200.
* step 5 : incus start backend-api-1 ; poll until UP again.
Acceptance (Day 19) : smoke test passes ; HAProxy sticky cookie
keeps WS sessions on the same backend until that backend dies, at
which point the cookie is ignored and the request rebalances.
W4 progress : Day 16 done · Day 17 done · Day 18 done · Day 19 done ·
Day 20 (k6 nightly load test) pending.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 09:32:48 +00:00
|
|
|
# haproxy role — install HAProxy 2.8, render the config, ensure the
|
|
|
|
|
# systemd unit is running. Idempotent.
|
2026-04-29 13:54:05 +00:00
|
|
|
#
|
|
|
|
|
# Optional Let's Encrypt sub-task : when haproxy_letsencrypt is true,
|
|
|
|
|
# dehydrated issues + auto-renews certs for haproxy_letsencrypt_domains
|
|
|
|
|
# via HTTP-01. Wildcards are NOT supported (need DNS-01) — list
|
|
|
|
|
# subdomains explicitly. Internal services on talas.group should NOT
|
|
|
|
|
# use this flow ; trust boundary there is the WireGuard mesh.
|
feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19)
Phase-1 of the active/active backend story. HAProxy in front of two
backend-api containers + two stream-server containers ; sticky cookie
pins WS sessions to one backend, URI hash routes track_id to one
streamer for HLS cache locality.
Day 19 acceptance asks for : kill backend-api-1, HAProxy bascule, WS
sessions reconnect to backend-api-2 sans perte. The smoke test wires
that gate ; phase-2 (W5) will add keepalived for an LB pair.
- infra/ansible/roles/haproxy/
* Install HAProxy + render haproxy.cfg with frontend (HTTP, optional
HTTPS via haproxy_tls_cert_path), api_pool (round-robin + sticky
cookie SERVERID), stream_pool (URI-hash + consistent jump-hash).
* Active health check GET /api/v1/health every 5s ; fall=3, rise=2.
on-marked-down shutdown-sessions + slowstart 30s on recovery.
* Stats socket bound to 127.0.0.1:9100 for the future prometheus
haproxy_exporter sidecar.
* Mozilla Intermediate TLS cipher list ; only effective when a cert
is mounted.
- infra/ansible/roles/backend_api/
* Scaffolding for the multi-instance Go API. Creates veza-api
system user, /opt/veza/backend-api dir, /etc/veza env dir,
/var/log/veza, and a hardened systemd unit pointing at the binary.
* Binary deployment is OUT of scope (documented in README) — the
Go binary is built outside Ansible (Makefile target) and pushed
via incus file push. CI → ansible-pull integration is W5+.
- infra/ansible/playbooks/haproxy.yml : provisions the haproxy Incus
container + applies common baseline + role.
- infra/ansible/inventory/lab.yml : 3 new groups :
* haproxy (single LB node)
* backend_api_instances (backend-api-{1,2})
* stream_server_instances (stream-server-{1,2})
HAProxy template reads these groups directly to populate its
upstream blocks ; falls back to the static haproxy_backend_api_fallback
list if the group is missing (for in-isolation tests).
- infra/ansible/tests/test_backend_failover.sh
* step 0 : pre-flight — both backends UP per HAProxy stats socket.
* step 1 : 5 baseline GET /api/v1/health through the LB → all 200.
* step 2 : incus stop --force backend-api-1 ; record t0.
* step 3 : poll HAProxy stats until backend-api-1 is DOWN
(timeout 30s ; expected ~ 15s = fall × interval).
* step 4 : 5 GET requests during the down window — all must 200
(served by backend-api-2). Fails if any returns non-200.
* step 5 : incus start backend-api-1 ; poll until UP again.
Acceptance (Day 19) : smoke test passes ; HAProxy sticky cookie
keeps WS sessions on the same backend until that backend dies, at
which point the cookie is ignored and the request rebalances.
W4 progress : Day 16 done · Day 17 done · Day 18 done · Day 19 done ·
Day 20 (k6 nightly load test) pending.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 09:32:48 +00:00
|
|
|
---
|
|
|
|
|
- name: Install HAProxy + curl (smoke test relies on it)
|
|
|
|
|
ansible.builtin.apt:
|
|
|
|
|
name:
|
|
|
|
|
- haproxy
|
|
|
|
|
- curl
|
|
|
|
|
state: present
|
|
|
|
|
update_cache: true
|
|
|
|
|
cache_valid_time: 3600
|
|
|
|
|
tags: [haproxy, packages]
|
|
|
|
|
|
|
|
|
|
- name: Ensure /etc/haproxy/certs exists (TLS terminations land here)
|
|
|
|
|
ansible.builtin.file:
|
|
|
|
|
path: /etc/haproxy/certs
|
|
|
|
|
state: directory
|
|
|
|
|
owner: root
|
|
|
|
|
group: haproxy
|
|
|
|
|
mode: "0750"
|
|
|
|
|
tags: [haproxy, config]
|
|
|
|
|
|
2026-04-30 14:10:27 +00:00
|
|
|
# Chicken-and-egg : haproxy.cfg.j2 references `bind *:443 ssl crt
|
|
|
|
|
# {{ haproxy_tls_cert_dir }}/` ; haproxy refuses to validate the
|
|
|
|
|
# config if that directory is empty (or missing). dehydrated creates
|
|
|
|
|
# real LE certs there LATER (in letsencrypt.yml). To break the cycle,
|
|
|
|
|
# pre-create the dir with a 30-day self-signed placeholder cert.
|
|
|
|
|
# The placeholder is overwritten / shadowed once dehydrated lands ;
|
|
|
|
|
# SNI picks the matching real cert.
|
|
|
|
|
- name: Ensure TLS cert dir + placeholder cert exist (gates the haproxy.cfg validate)
|
|
|
|
|
when: haproxy_letsencrypt | default(false)
|
|
|
|
|
block:
|
|
|
|
|
- name: Ensure {{ haproxy_tls_cert_dir }} exists
|
|
|
|
|
ansible.builtin.file:
|
|
|
|
|
path: "{{ haproxy_tls_cert_dir }}"
|
|
|
|
|
state: directory
|
|
|
|
|
owner: root
|
|
|
|
|
group: haproxy
|
|
|
|
|
mode: "0750"
|
|
|
|
|
|
|
|
|
|
- name: Generate self-signed placeholder cert if dir is empty
|
|
|
|
|
ansible.builtin.shell: |
|
|
|
|
|
set -e
|
|
|
|
|
if ls "{{ haproxy_tls_cert_dir }}"/*.pem >/dev/null 2>&1; then
|
|
|
|
|
echo "cert already present"
|
|
|
|
|
exit 0
|
|
|
|
|
fi
|
|
|
|
|
openssl req -x509 -nodes -newkey rsa:2048 \
|
|
|
|
|
-keyout /tmp/_placeholder.key \
|
|
|
|
|
-out /tmp/_placeholder.crt \
|
|
|
|
|
-days 30 \
|
|
|
|
|
-subj '/CN=placeholder.veza.local' >/dev/null 2>&1
|
|
|
|
|
cat /tmp/_placeholder.crt /tmp/_placeholder.key \
|
|
|
|
|
> "{{ haproxy_tls_cert_dir }}/_placeholder.pem"
|
|
|
|
|
chmod 0640 "{{ haproxy_tls_cert_dir }}/_placeholder.pem"
|
|
|
|
|
chown root:haproxy "{{ haproxy_tls_cert_dir }}/_placeholder.pem"
|
|
|
|
|
rm -f /tmp/_placeholder.key /tmp/_placeholder.crt
|
|
|
|
|
echo "placeholder cert generated"
|
|
|
|
|
register: placeholder_cert
|
|
|
|
|
changed_when: "'placeholder cert generated' in placeholder_cert.stdout"
|
|
|
|
|
tags: [haproxy, config, letsencrypt]
|
|
|
|
|
|
feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19)
Phase-1 of the active/active backend story. HAProxy in front of two
backend-api containers + two stream-server containers ; sticky cookie
pins WS sessions to one backend, URI hash routes track_id to one
streamer for HLS cache locality.
Day 19 acceptance asks for : kill backend-api-1, HAProxy bascule, WS
sessions reconnect to backend-api-2 sans perte. The smoke test wires
that gate ; phase-2 (W5) will add keepalived for an LB pair.
- infra/ansible/roles/haproxy/
* Install HAProxy + render haproxy.cfg with frontend (HTTP, optional
HTTPS via haproxy_tls_cert_path), api_pool (round-robin + sticky
cookie SERVERID), stream_pool (URI-hash + consistent jump-hash).
* Active health check GET /api/v1/health every 5s ; fall=3, rise=2.
on-marked-down shutdown-sessions + slowstart 30s on recovery.
* Stats socket bound to 127.0.0.1:9100 for the future prometheus
haproxy_exporter sidecar.
* Mozilla Intermediate TLS cipher list ; only effective when a cert
is mounted.
- infra/ansible/roles/backend_api/
* Scaffolding for the multi-instance Go API. Creates veza-api
system user, /opt/veza/backend-api dir, /etc/veza env dir,
/var/log/veza, and a hardened systemd unit pointing at the binary.
* Binary deployment is OUT of scope (documented in README) — the
Go binary is built outside Ansible (Makefile target) and pushed
via incus file push. CI → ansible-pull integration is W5+.
- infra/ansible/playbooks/haproxy.yml : provisions the haproxy Incus
container + applies common baseline + role.
- infra/ansible/inventory/lab.yml : 3 new groups :
* haproxy (single LB node)
* backend_api_instances (backend-api-{1,2})
* stream_server_instances (stream-server-{1,2})
HAProxy template reads these groups directly to populate its
upstream blocks ; falls back to the static haproxy_backend_api_fallback
list if the group is missing (for in-isolation tests).
- infra/ansible/tests/test_backend_failover.sh
* step 0 : pre-flight — both backends UP per HAProxy stats socket.
* step 1 : 5 baseline GET /api/v1/health through the LB → all 200.
* step 2 : incus stop --force backend-api-1 ; record t0.
* step 3 : poll HAProxy stats until backend-api-1 is DOWN
(timeout 30s ; expected ~ 15s = fall × interval).
* step 4 : 5 GET requests during the down window — all must 200
(served by backend-api-2). Fails if any returns non-200.
* step 5 : incus start backend-api-1 ; poll until UP again.
Acceptance (Day 19) : smoke test passes ; HAProxy sticky cookie
keeps WS sessions on the same backend until that backend dies, at
which point the cookie is ignored and the request rebalances.
W4 progress : Day 16 done · Day 17 done · Day 18 done · Day 19 done ·
Day 20 (k6 nightly load test) pending.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 09:32:48 +00:00
|
|
|
- name: Render haproxy.cfg
|
|
|
|
|
ansible.builtin.template:
|
|
|
|
|
src: haproxy.cfg.j2
|
|
|
|
|
dest: /etc/haproxy/haproxy.cfg
|
|
|
|
|
owner: root
|
|
|
|
|
group: haproxy
|
|
|
|
|
mode: "0640"
|
2026-04-30 14:06:50 +00:00
|
|
|
# No -q so the actual validation error reaches the operator's
|
|
|
|
|
# console. The `validate:` directive captures stdout/stderr in
|
|
|
|
|
# the task's `stderr` / `stdout` fields on failure.
|
|
|
|
|
validate: "haproxy -f %s -c"
|
2026-04-29 13:54:05 +00:00
|
|
|
register: haproxy_config
|
feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19)
Phase-1 of the active/active backend story. HAProxy in front of two
backend-api containers + two stream-server containers ; sticky cookie
pins WS sessions to one backend, URI hash routes track_id to one
streamer for HLS cache locality.
Day 19 acceptance asks for : kill backend-api-1, HAProxy bascule, WS
sessions reconnect to backend-api-2 sans perte. The smoke test wires
that gate ; phase-2 (W5) will add keepalived for an LB pair.
- infra/ansible/roles/haproxy/
* Install HAProxy + render haproxy.cfg with frontend (HTTP, optional
HTTPS via haproxy_tls_cert_path), api_pool (round-robin + sticky
cookie SERVERID), stream_pool (URI-hash + consistent jump-hash).
* Active health check GET /api/v1/health every 5s ; fall=3, rise=2.
on-marked-down shutdown-sessions + slowstart 30s on recovery.
* Stats socket bound to 127.0.0.1:9100 for the future prometheus
haproxy_exporter sidecar.
* Mozilla Intermediate TLS cipher list ; only effective when a cert
is mounted.
- infra/ansible/roles/backend_api/
* Scaffolding for the multi-instance Go API. Creates veza-api
system user, /opt/veza/backend-api dir, /etc/veza env dir,
/var/log/veza, and a hardened systemd unit pointing at the binary.
* Binary deployment is OUT of scope (documented in README) — the
Go binary is built outside Ansible (Makefile target) and pushed
via incus file push. CI → ansible-pull integration is W5+.
- infra/ansible/playbooks/haproxy.yml : provisions the haproxy Incus
container + applies common baseline + role.
- infra/ansible/inventory/lab.yml : 3 new groups :
* haproxy (single LB node)
* backend_api_instances (backend-api-{1,2})
* stream_server_instances (stream-server-{1,2})
HAProxy template reads these groups directly to populate its
upstream blocks ; falls back to the static haproxy_backend_api_fallback
list if the group is missing (for in-isolation tests).
- infra/ansible/tests/test_backend_failover.sh
* step 0 : pre-flight — both backends UP per HAProxy stats socket.
* step 1 : 5 baseline GET /api/v1/health through the LB → all 200.
* step 2 : incus stop --force backend-api-1 ; record t0.
* step 3 : poll HAProxy stats until backend-api-1 is DOWN
(timeout 30s ; expected ~ 15s = fall × interval).
* step 4 : 5 GET requests during the down window — all must 200
(served by backend-api-2). Fails if any returns non-200.
* step 5 : incus start backend-api-1 ; poll until UP again.
Acceptance (Day 19) : smoke test passes ; HAProxy sticky cookie
keeps WS sessions on the same backend until that backend dies, at
which point the cookie is ignored and the request rebalances.
W4 progress : Day 16 done · Day 17 done · Day 18 done · Day 19 done ·
Day 20 (k6 nightly load test) pending.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 09:32:48 +00:00
|
|
|
notify: Reload haproxy
|
|
|
|
|
tags: [haproxy, config]
|
|
|
|
|
|
2026-04-29 13:54:05 +00:00
|
|
|
- name: Set haproxy_config_changed fact (consumed by letsencrypt.yml)
|
|
|
|
|
ansible.builtin.set_fact:
|
|
|
|
|
haproxy_config_changed: "{{ haproxy_config.changed }}"
|
|
|
|
|
tags: [haproxy, config]
|
|
|
|
|
|
feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19)
Phase-1 of the active/active backend story. HAProxy in front of two
backend-api containers + two stream-server containers ; sticky cookie
pins WS sessions to one backend, URI hash routes track_id to one
streamer for HLS cache locality.
Day 19 acceptance asks for : kill backend-api-1, HAProxy bascule, WS
sessions reconnect to backend-api-2 sans perte. The smoke test wires
that gate ; phase-2 (W5) will add keepalived for an LB pair.
- infra/ansible/roles/haproxy/
* Install HAProxy + render haproxy.cfg with frontend (HTTP, optional
HTTPS via haproxy_tls_cert_path), api_pool (round-robin + sticky
cookie SERVERID), stream_pool (URI-hash + consistent jump-hash).
* Active health check GET /api/v1/health every 5s ; fall=3, rise=2.
on-marked-down shutdown-sessions + slowstart 30s on recovery.
* Stats socket bound to 127.0.0.1:9100 for the future prometheus
haproxy_exporter sidecar.
* Mozilla Intermediate TLS cipher list ; only effective when a cert
is mounted.
- infra/ansible/roles/backend_api/
* Scaffolding for the multi-instance Go API. Creates veza-api
system user, /opt/veza/backend-api dir, /etc/veza env dir,
/var/log/veza, and a hardened systemd unit pointing at the binary.
* Binary deployment is OUT of scope (documented in README) — the
Go binary is built outside Ansible (Makefile target) and pushed
via incus file push. CI → ansible-pull integration is W5+.
- infra/ansible/playbooks/haproxy.yml : provisions the haproxy Incus
container + applies common baseline + role.
- infra/ansible/inventory/lab.yml : 3 new groups :
* haproxy (single LB node)
* backend_api_instances (backend-api-{1,2})
* stream_server_instances (stream-server-{1,2})
HAProxy template reads these groups directly to populate its
upstream blocks ; falls back to the static haproxy_backend_api_fallback
list if the group is missing (for in-isolation tests).
- infra/ansible/tests/test_backend_failover.sh
* step 0 : pre-flight — both backends UP per HAProxy stats socket.
* step 1 : 5 baseline GET /api/v1/health through the LB → all 200.
* step 2 : incus stop --force backend-api-1 ; record t0.
* step 3 : poll HAProxy stats until backend-api-1 is DOWN
(timeout 30s ; expected ~ 15s = fall × interval).
* step 4 : 5 GET requests during the down window — all must 200
(served by backend-api-2). Fails if any returns non-200.
* step 5 : incus start backend-api-1 ; poll until UP again.
Acceptance (Day 19) : smoke test passes ; HAProxy sticky cookie
keeps WS sessions on the same backend until that backend dies, at
which point the cookie is ignored and the request rebalances.
W4 progress : Day 16 done · Day 17 done · Day 18 done · Day 19 done ·
Day 20 (k6 nightly load test) pending.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 09:32:48 +00:00
|
|
|
- name: Enable + start haproxy
|
|
|
|
|
ansible.builtin.systemd:
|
|
|
|
|
name: haproxy
|
|
|
|
|
state: started
|
|
|
|
|
enabled: true
|
|
|
|
|
tags: [haproxy, service]
|
2026-04-29 13:54:05 +00:00
|
|
|
|
|
|
|
|
- name: Issue + auto-renew Let's Encrypt certs (HTTP-01 via dehydrated)
|
|
|
|
|
ansible.builtin.import_tasks: letsencrypt.yml
|
|
|
|
|
when: haproxy_letsencrypt | default(false)
|
|
|
|
|
tags: [haproxy, letsencrypt]
|