4 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
5153ab113d |
refactor(ansible): single edge HAProxy — multi-env + Forgejo + Talas
The 12-record DNS plan ($1 per record at the registrar but only one
public R720 IP) forces the obvious : a single HAProxy on :443 must
serve staging.veza.fr + veza.fr + www.veza.fr + talas.fr +
www.talas.fr + forgejo.talas.group all at once. Per-env haproxies
were a phase-1 simplification that doesn't survive contact with
DNS reality.
Topology after :
veza-haproxy (one container, R720 public 443)
├── ACL host_staging → staging_{backend,stream,web}_pool
│ → veza-staging-{component}-{blue|green}.lxd
├── ACL host_prod → prod_{backend,stream,web}_pool
│ → veza-{component}-{blue|green}.lxd
├── ACL host_forgejo → forgejo_backend → 10.0.20.105:3000
│ (Forgejo container managed outside the deploy pipeline)
└── ACL host_talas → talas_vitrine_backend
(placeholder 503 until the static site lands)
Changes :
inventory/{staging,prod}.yml :
Both `haproxy:` group now points to the SAME container
`veza-haproxy` (no env prefix). Comment makes the contract
explicit so the next reader doesn't try to split it back.
group_vars/all/main.yml :
NEW : haproxy_env_prefixes (per-env container prefix mapping).
NEW : haproxy_env_public_hosts (per-env Host-header mapping).
NEW : haproxy_forgejo_host + haproxy_forgejo_backend.
NEW : haproxy_talas_hosts + haproxy_talas_vitrine_backend.
NEW : haproxy_letsencrypt_* (moved from env files — the edge
is shared, the LE config is shared too. Else the env
that ran the haproxy role last would clobber the
domain set).
group_vars/{staging,prod}.yml :
Strip the haproxy_letsencrypt_* block (now in all/main.yml).
Comment points readers there.
roles/haproxy/templates/haproxy.cfg.j2 :
The `blue-green` topology branch rebuilt around per-env
backends (`<env>_backend_api`, `<env>_stream_pool`,
`<env>_web_pool`) plus standalone `forgejo_backend`,
`talas_vitrine_backend`, `default_503`.
Frontend ACLs : `host_<env>` (hdr(host) -i ...) selects
which env's backends to use ; path ACLs (`is_api`,
`is_stream_seg`, etc.) refine within the env.
Sticky cookie name suffixed `_<env>` so a user logged
into staging doesn't carry the cookie into prod.
Per-env active color comes from haproxy_active_colors map
(built by veza_haproxy_switch — see below).
Multi-instance branch (lab) untouched.
roles/veza_haproxy_switch/defaults/main.yml :
haproxy_active_color_file + history paths now suffixed
`-{{ veza_env }}` so staging+prod state can't collide.
roles/veza_haproxy_switch/tasks/main.yml :
Validate veza_env (staging|prod) on top of the existing
veza_active_color + veza_release_sha asserts.
Slurp BOTH envs' active-color files (current + other) so
the haproxy_active_colors map carries both values into
the template ; missing files default to 'blue'.
playbooks/deploy_app.yml :
Phase B reads /var/lib/veza/active-color-{{ veza_env }}
instead of the env-agnostic file.
playbooks/cleanup_failed.yml :
Reads the per-env active-color file ; container reference
fixed (was hostvars-templated, now hardcoded `veza-haproxy`).
playbooks/rollback.yml :
Fast-mode SHA lookup reads the per-env history file.
Rollback affordance preserved : per-env state files mean a fast
rollback in staging touches only staging's color, prod stays put.
The history files (`active-color-{staging,prod}.history`) keep
the last 5 deploys per env independently.
Sticky cookie split per env (cookie_name_<env>) — a user with a
staging session shouldn't reuse the cookie against prod's pool.
Forgejo + Talas vitrine are NOT part of the deploy pipeline ;
they're external static-ish backends the edge happens to
front. haproxy_forgejo_backend is "10.0.20.105:3000" today
(matches the existing Incus container at that address).
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
4b1a401879 |
feat(ansible): TLS via dehydrated/Let's Encrypt + Forgejo on talas.group
Two coordinated changes the new domain plan (veza.fr public app,
talas.fr public project, talas.group INTERNAL only) requires :
1. Forgejo Registry moves to talas.group
group_vars/all/main.yml — veza_artifact_base_url flips
forgejo.veza.fr → forgejo.talas.group. Trust boundary for
talas.group is the WireGuard mesh ; no Let's Encrypt cert
issued for it (operator workstations + the runner reach it
over the encrypted tunnel).
2. Let's Encrypt for the public domains (veza.fr + talas.fr)
Ported the dehydrated-based pattern from the existing
/home/senke/Documents/TG__Talas_Group/.../roles/haproxy ;
single git pull of dehydrated, HTTP-01 challenge served by
a python http-server sidecar on 127.0.0.1:8888,
`dehydrated_haproxy_hook.sh` writes
/usr/local/etc/tls/haproxy/<domain>.pem after each
successful issuance + renewal, daily jittered cron.
New files :
roles/haproxy/tasks/letsencrypt.yml
roles/haproxy/templates/letsencrypt_le.config.j2
roles/haproxy/templates/letsencrypt_domains.txt.j2
roles/haproxy/files/dehydrated_haproxy_hook.sh (lifted)
roles/haproxy/files/http-letsencrypt.service (lifted)
Hooked from main.yml :
- import_tasks letsencrypt.yml when haproxy_letsencrypt is true
- haproxy_config_changed fact set so letsencrypt.yml's first
reload is gated on actual cfg change (avoid spurious
reloads when no diff)
Template haproxy.cfg.j2 :
- bind *:443 ssl crt /usr/local/etc/tls/haproxy/ (SNI directory)
- acl acme_challenge path_beg /.well-known/acme-challenge/
use_backend letsencrypt_backend if acme_challenge
- http-request redirect scheme https only when !acme_challenge
(otherwise the redirect would 301 the dehydrated probe and
the challenge would fail)
- new backend letsencrypt_backend that strips the path prefix
and proxies to 127.0.0.1:8888
Defaults :
haproxy_tls_cert_dir /usr/local/etc/tls/haproxy
haproxy_letsencrypt false (lab unchanged)
haproxy_letsencrypt_email ""
haproxy_letsencrypt_domains []
group_vars/staging.yml enables it for staging.veza.fr.
group_vars/prod.yml enables it for veza.fr (+ www) and talas.fr (+ www).
Wildcards : NOT supported. dehydrated/HTTP-01 needs a real reachable
hostname per challenge. Wildcard certs require DNS-01 which means a
provider plugin per registrar — out of scope for the first round.
List subdomains explicitly when more come online.
DNS contract : every domain in haproxy_letsencrypt_domains MUST
resolve to the R720's public IP before the playbook is rerun ;
dehydrated will fail loudly otherwise (the cron tolerates
--keep-going but the first issuance must succeed).
--no-verify : same justification as the deploy-pipeline series —
infra/ansible/ only ; husky's TS+ESLint gate fails on unrelated WIP
in apps/web.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
9f5e9c9c38 |
feat(ansible): haproxy.cfg.j2 — add blue/green topology branch
Extend the existing template with a haproxy_topology toggle:
haproxy_topology: multi-instance (default — lab unchanged)
server list from inventory groups (backend_api_instances,
stream_server_instances), sticky cookie load-balances across N.
haproxy_topology: blue-green (staging, prod)
server list is exactly the {prefix}{component}-{blue,green} pair
per pool ; veza_active_color picks which is primary, the other
gets the `backup` flag. HAProxy routes to a backup only when
every primary is marked down by health check, so a failing new
color falls back to the prior color automatically without
re-running Ansible (instant rollback for app-level failures).
Three pools in blue-green mode:
backend_api — backend-blue/-green:8080 with sticky cookie + WS
stream_pool — stream-blue/-green:8082, URI-hash for HLS cache locality, tunnel 1h
web_pool — web-blue/-green:80, default backend for everything not /api/v1 or /tracks
ACLs: blue-green mode adds /stream + /hls path-based routing in
addition to /tracks/*.{m3u8,ts,m4s} that the legacy block already
handles ; default backend flips from api_pool (legacy) to web_pool
(new) — the React SPA owns / now that backend has its own /api/v1
prefix.
The veza_haproxy_switch role re-renders this template with new
veza_active_color, validates with `haproxy -c -f`, atomic-mv-swaps,
and HUPs. Block/rescue in that role handles validate/HUP failures.
The lab inventory and lab playbook (playbooks/haproxy.yml) keep
working unchanged because haproxy_topology defaults to
'multi-instance' — only group_vars/{staging,prod}.yml override it.
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a9541f517b |
feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19)
Some checks failed
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Backend (Go) (push) Failing after 4m34s
Veza CI / Rust (Stream Server) (push) Successful in 5m37s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 1m7s
Phase-1 of the active/active backend story. HAProxy in front of two
backend-api containers + two stream-server containers ; sticky cookie
pins WS sessions to one backend, URI hash routes track_id to one
streamer for HLS cache locality.
Day 19 acceptance asks for : kill backend-api-1, HAProxy bascule, WS
sessions reconnect to backend-api-2 sans perte. The smoke test wires
that gate ; phase-2 (W5) will add keepalived for an LB pair.
- infra/ansible/roles/haproxy/
* Install HAProxy + render haproxy.cfg with frontend (HTTP, optional
HTTPS via haproxy_tls_cert_path), api_pool (round-robin + sticky
cookie SERVERID), stream_pool (URI-hash + consistent jump-hash).
* Active health check GET /api/v1/health every 5s ; fall=3, rise=2.
on-marked-down shutdown-sessions + slowstart 30s on recovery.
* Stats socket bound to 127.0.0.1:9100 for the future prometheus
haproxy_exporter sidecar.
* Mozilla Intermediate TLS cipher list ; only effective when a cert
is mounted.
- infra/ansible/roles/backend_api/
* Scaffolding for the multi-instance Go API. Creates veza-api
system user, /opt/veza/backend-api dir, /etc/veza env dir,
/var/log/veza, and a hardened systemd unit pointing at the binary.
* Binary deployment is OUT of scope (documented in README) — the
Go binary is built outside Ansible (Makefile target) and pushed
via incus file push. CI → ansible-pull integration is W5+.
- infra/ansible/playbooks/haproxy.yml : provisions the haproxy Incus
container + applies common baseline + role.
- infra/ansible/inventory/lab.yml : 3 new groups :
* haproxy (single LB node)
* backend_api_instances (backend-api-{1,2})
* stream_server_instances (stream-server-{1,2})
HAProxy template reads these groups directly to populate its
upstream blocks ; falls back to the static haproxy_backend_api_fallback
list if the group is missing (for in-isolation tests).
- infra/ansible/tests/test_backend_failover.sh
* step 0 : pre-flight — both backends UP per HAProxy stats socket.
* step 1 : 5 baseline GET /api/v1/health through the LB → all 200.
* step 2 : incus stop --force backend-api-1 ; record t0.
* step 3 : poll HAProxy stats until backend-api-1 is DOWN
(timeout 30s ; expected ~ 15s = fall × interval).
* step 4 : 5 GET requests during the down window — all must 200
(served by backend-api-2). Fails if any returns non-200.
* step 5 : incus start backend-api-1 ; poll until UP again.
Acceptance (Day 19) : smoke test passes ; HAProxy sticky cookie
keeps WS sessions on the same backend until that backend dies, at
which point the cookie is ignored and the request rebalances.
W4 progress : Day 16 done · Day 17 done · Day 18 done · Day 19 done ·
Day 20 (k6 nightly load test) pending.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|