senke/veza - Talas Project: Beyond coding. We Forge.

senke/veza

Author	SHA1	Message	Date
senke	0bd3e563b2	fix(haproxy): incus proxy devices forward R720:80/443 → container The Orange box NAT correctly forwards :80/:443 → R720 LAN IP, but the R720 host has nothing listening there — haproxy lives in the veza-haproxy container, reachable only on the net-veza bridge (10.0.20.X). Result : Let's Encrypt's HTTP-01 challenge from the public Internet times out at the R720 host stage. Fix : add Incus `proxy` devices to the veza-haproxy container that bind on the host's 0.0.0.0:80 / 0.0.0.0:443 and forward into the container's local ports. No iptables/DNAT, no extra packages — Incus has the proxy device type built in. incus config device add veza-haproxy http proxy \ listen=tcp:0.0.0.0:80 connect=tcp:127.0.0.1:80 incus config device add veza-haproxy https proxy \ listen=tcp:0.0.0.0:443 connect=tcp:127.0.0.1:443 Idempotent : `incus config device show veza-haproxy \| grep '^http:$'` short-circuits the add when the device is already there. Operator setup unchanged : box NAT 80/443 → R720 LAN IP. Ansible now bridges the rest of the path automatically. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:27:37 +02:00
senke	e97b91f010	fix(ansible): don't apply common role to haproxy container + gate ssh.yml on sshd Two fixes for "haproxy container doesn't have sshd" : 1. playbooks/haproxy.yml — drop the `common` role play. The role's purpose is to harden a full HOST (SSH + fail2ban monitoring auth.log + node_exporter metrics surface). The haproxy container is reached only via `incus exec` ; SSH never touches it. Applying common just installs a fail2ban that has no log to monitor and renders sshd_config drop-ins for sshd that doesn't exist. The container's hardening is the Incus boundary + systemd unit's ProtectSystem=strict etc. (already in the templates). 2. roles/common/tasks/ssh.yml — gate every task on sshd presence. `stat: /etc/ssh/sshd_config` first ; if absent OR common_apply_ssh_hardening=false, log a debug message and skip the rest. Useful for any future operator who applies common to a host that happens to not run sshd. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:57:16 +02:00
senke	b9445faacc	fix(infra): rename veza-net → net-veza everywhere + drop redundant profile The R720 has 5 managed Incus bridges, organized by trust zone : net-ad 10.0.50.0/24 admin net-dmz 10.0.10.0/24 DMZ net-sandbox 10.0.30.0/24 sandbox net-veza 10.0.20.0/24 Veza (forgejo + 12 other containers) incusbr0 10.0.0.0/24 default Veza belongs on `net-veza`. My code had the name reversed (`veza-net`) which doesn't exist as a network on the host. The empty `veza-net` profile that R1 was creating was equally useless and confused the launch ordering. Changes : * group_vars/staging.yml veza_incus_network : veza-staging-net → net-veza veza_incus_subnet : 10.0.21.0/24 → 10.0.20.0/24 Comment block explains why staging+prod share net-veza in v1.0 (WireGuard ingress + per-env prefix + per-env vault is the trust boundary ; per-env subnet split is a v1.1 hardening) and how to flip to a dedicated bridge later. * group_vars/prod.yml veza_incus_network : veza-net → net-veza * playbooks/haproxy.yml incus launch ... --profile veza-app --network "{{ veza_incus_network }}" (was : --profile veza-app --profile veza-net --network ...) * playbooks/deploy_data.yml + deploy_app.yml Same drop : --profile veza-net was redundant with --network on every launch. Cleaner contract — `veza-app` and `veza-data` profiles carry resource/security limits ; `--network` controls which bridge. * scripts/bootstrap/bootstrap-remote.sh R1 Stop creating the `veza-net` profile. Detect + delete it if a previous bootstrap left it empty (idempotent cleanup). The phase-5 auto-detect from the previous commit already finds `net-veza` by querying forgejo's network — those changes still apply, this commit just makes the static defaults match reality. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:58:04 +02:00
senke	ab86ae80fa	fix(ansible): playbooks/haproxy.yml — bootstrap the SHARED veza-haproxy Two drift-fixes between the bootstrap playbook and the rest of the W5 deploy pipeline : * Container name : `haproxy` → `veza-haproxy` inventory/{staging,prod}.yml's haproxy group now points at `veza-haproxy` ; the bootstrap was still creating an unprefixed `haproxy` and the role would never reach it. * Base image : `images:ubuntu/22.04` → `images:debian/13` Matches the rest of the deploy pipeline (veza_app_base_image default in group_vars/all/main.yml). The role expects Debian-style apt + systemd unit names. * Profiles : `incus launch` now applies `--profile veza-app --profile veza-net --network <veza_incus_network>` like every other container the pipeline creates. Prevents a barebones container that doesn't get the Veza network policy. * Cloud-init wait : drop the `cloud-init status` poll (Debian base image's cloud-init is minimal anyway) ; replace with a direct `incus exec veza-haproxy -- /bin/true` reachability loop, same pattern as deploy_data.yml's launch task. The third play sets `haproxy_topology: blue-green` explicitly so the edge always renders the multi-env topology, even when run from `inventory/lab.yml` (which lacks the env-prefix vars and would otherwise fall through to the multi-instance branch). --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:34:38 +02:00
senke	a9541f517b	feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19) Some checks failed Veza CI / Frontend (Web) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details Veza CI / Notify on failure (push) Blocked by required conditions Details Veza CI / Backend (Go) (push) Failing after 4m34s Details Veza CI / Rust (Stream Server) (push) Successful in 5m37s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 1m7s Details Phase-1 of the active/active backend story. HAProxy in front of two backend-api containers + two stream-server containers ; sticky cookie pins WS sessions to one backend, URI hash routes track_id to one streamer for HLS cache locality. Day 19 acceptance asks for : kill backend-api-1, HAProxy bascule, WS sessions reconnect to backend-api-2 sans perte. The smoke test wires that gate ; phase-2 (W5) will add keepalived for an LB pair. - infra/ansible/roles/haproxy/ * Install HAProxy + render haproxy.cfg with frontend (HTTP, optional HTTPS via haproxy_tls_cert_path), api_pool (round-robin + sticky cookie SERVERID), stream_pool (URI-hash + consistent jump-hash). * Active health check GET /api/v1/health every 5s ; fall=3, rise=2. on-marked-down shutdown-sessions + slowstart 30s on recovery. * Stats socket bound to 127.0.0.1:9100 for the future prometheus haproxy_exporter sidecar. * Mozilla Intermediate TLS cipher list ; only effective when a cert is mounted. - infra/ansible/roles/backend_api/ * Scaffolding for the multi-instance Go API. Creates veza-api system user, /opt/veza/backend-api dir, /etc/veza env dir, /var/log/veza, and a hardened systemd unit pointing at the binary. * Binary deployment is OUT of scope (documented in README) — the Go binary is built outside Ansible (Makefile target) and pushed via incus file push. CI → ansible-pull integration is W5+. - infra/ansible/playbooks/haproxy.yml : provisions the haproxy Incus container + applies common baseline + role. - infra/ansible/inventory/lab.yml : 3 new groups : * haproxy (single LB node) * backend_api_instances (backend-api-{1,2}) * stream_server_instances (stream-server-{1,2}) HAProxy template reads these groups directly to populate its upstream blocks ; falls back to the static haproxy_backend_api_fallback list if the group is missing (for in-isolation tests). - infra/ansible/tests/test_backend_failover.sh * step 0 : pre-flight — both backends UP per HAProxy stats socket. * step 1 : 5 baseline GET /api/v1/health through the LB → all 200. * step 2 : incus stop --force backend-api-1 ; record t0. * step 3 : poll HAProxy stats until backend-api-1 is DOWN (timeout 30s ; expected ~ 15s = fall × interval). * step 4 : 5 GET requests during the down window — all must 200 (served by backend-api-2). Fails if any returns non-200. * step 5 : incus start backend-api-1 ; poll until UP again. Acceptance (Day 19) : smoke test passes ; HAProxy sticky cookie keeps WS sessions on the same backend until that backend dies, at which point the cookie is ignored and the request rebalances. W4 progress : Day 16 done · Day 17 done · Day 18 done · Day 19 done · Day 20 (k6 nightly load test) pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:32:48 +02:00

5 commits