senke/veza - Talas Project: Beyond coding. We Forge.

senke/veza

Author	SHA1	Message	Date
senke	0bd3e563b2	fix(haproxy): incus proxy devices forward R720:80/443 → container The Orange box NAT correctly forwards :80/:443 → R720 LAN IP, but the R720 host has nothing listening there — haproxy lives in the veza-haproxy container, reachable only on the net-veza bridge (10.0.20.X). Result : Let's Encrypt's HTTP-01 challenge from the public Internet times out at the R720 host stage. Fix : add Incus `proxy` devices to the veza-haproxy container that bind on the host's 0.0.0.0:80 / 0.0.0.0:443 and forward into the container's local ports. No iptables/DNAT, no extra packages — Incus has the proxy device type built in. incus config device add veza-haproxy http proxy \ listen=tcp:0.0.0.0:80 connect=tcp:127.0.0.1:80 incus config device add veza-haproxy https proxy \ listen=tcp:0.0.0.0:443 connect=tcp:127.0.0.1:443 Idempotent : `incus config device show veza-haproxy \| grep '^http:$'` short-circuits the add when the device is already there. Operator setup unchanged : box NAT 80/443 → R720 LAN IP. Ansible now bridges the rest of the path automatically. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:27:37 +02:00
senke	e97b91f010	fix(ansible): don't apply common role to haproxy container + gate ssh.yml on sshd Two fixes for "haproxy container doesn't have sshd" : 1. playbooks/haproxy.yml — drop the `common` role play. The role's purpose is to harden a full HOST (SSH + fail2ban monitoring auth.log + node_exporter metrics surface). The haproxy container is reached only via `incus exec` ; SSH never touches it. Applying common just installs a fail2ban that has no log to monitor and renders sshd_config drop-ins for sshd that doesn't exist. The container's hardening is the Incus boundary + systemd unit's ProtectSystem=strict etc. (already in the templates). 2. roles/common/tasks/ssh.yml — gate every task on sshd presence. `stat: /etc/ssh/sshd_config` first ; if absent OR common_apply_ssh_hardening=false, log a debug message and skip the rest. Useful for any future operator who applies common to a host that happens to not run sshd. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:57:16 +02:00
senke	5f6625cc56	fix(ansible): detect storage pool from forgejo's root device, not first listed The previous detect picked the first row of `incus storage list -f csv`, which on the user's R720 returned `default` — but `default` is not usable on this server (`Storage pool is unavailable on this server` when launching). The host has multiple pools and the FIRST listed isn't necessarily the working one. New detect strategy (most-reliable first) : 1. `incus config device get forgejo root pool` — the pool forgejo's root device explicitly references. 2. `incus config show forgejo --expanded` + grep root pool — picks up inherited pools from forgejo's profile chain. 3. Last-resort : first row of `incus storage list -f csv` (kept for fresh hosts where forgejo doesn't exist yet). Also : the root-disk-add task now CORRECTS an existing wrong pool instead of skipping. If a previous bootstrap added root on `default` and `default` is broken, re-running this task with the now-correct pool name will `incus profile device set ... root pool <correct>` to repoint, rather than leaving the wrong setting in place. Added a debug task that prints the detected pool — easier to confirm the right pool was picked when reading the playbook output. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:34:50 +02:00
senke	4298f0c26a	fix(ansible): bootstrap_runner — add root disk to veza-{app,data} profiles `incus launch ... --profile veza-app` failed with : Failed initializing instance: Invalid devices: Failed detecting root disk device: No root device could be found Cause : the profiles were created empty. Incus needs a root disk device referencing a storage pool to actually launch a container ; the `default` profile carries one implicitly but custom profiles need it added explicitly OR the launch must combine `default` + custom profile. Fix : phase 1 of bootstrap_runner.yml now : 1. Detects the first available storage pool (`incus storage list`). 2. After creating each profile, adds a root disk device pointing at that pool : `incus profile device add veza-app root disk path=/ pool=<detected>`. Idempotent : the add-root step is guarded by `incus profile device show veza-app \| grep -q '^root:'` ; re-runs are no-ops. Storage pool autodetect picks the first row of `incus storage list` — typically `default`, but accepts custom names (`local`, `data`, etc.) without operator intervention. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:32:00 +02:00
senke	a881be9dad	fix(ansible): bootstrap_runner phase 3 uses incus exec from host (not community.general.incus) Previous play targeted `forgejo_runner` group with `ansible_connection: community.general.incus`. The plugin runs LOCALLY (on whichever host invokes ansible-playbook) and looks up the container in the local incus instance — which on the operator's laptop doesn't have a `forgejo-runner` container. Result : fatal: [forgejo-runner]: UNREACHABLE! "instance not found: forgejo-runner (remote=local, project=default)" Fix : run phase 3 on `incus_hosts` (the R720) and reach into the container via `incus exec forgejo-runner -- <cmd>`. Same shape the working bootstrap-remote.sh used before this commit series. No connection-plugin remoting needed, no `incus remote` config required on the operator's laptop. Side effects : `forgejo_runner` group in inventory/{staging,prod}.yml is now unused but harmless ; left in place for any future task that might want it back. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:16:04 +02:00
senke	3b33791660	refactor(bootstrap): everything via Ansible — no NOPASSWD, no SSH plumbing Rearchitecture after operator pushback : the previous design did too much in bash (SSH-streaming script chunks, manual sudo dance, NOPASSWD requirement). Ansible is the right tool. The shell scripts are now thin orchestrators handling the chicken-and-egg of vault + Forgejo CI provisioning, then calling ansible-playbook. Key principles : 1. NO NOPASSWD sudo on the R720. --ask-become-pass interactive, password held in ansible memory only for the run. 2. Two parallel scripts — one per host, fully self-contained. 3. Both run the SAME Ansible playbooks (bootstrap_runner.yml + haproxy.yml). Difference is the inventory. Files (new + replaced) : ansible.cfg pipelining=True → False. Required for --ask-become-pass to work reliably ; the previous setting raced sudo's prompt and timed out at 12s. playbooks/bootstrap_runner.yml (new) The Incus-host-side bootstrap, ported from the old scripts/bootstrap/bootstrap-remote.sh. Three plays : Phase 1 : ensure veza-app + veza-data profiles exist ; drop legacy empty veza-net profile. Phase 2 : forgejo-runner gets /var/lib/incus/unix.socket attached as a disk device, security.nesting=true, /usr/bin/incus pushed in as /usr/local/bin/incus, smoke-tested. Phase 3 : forgejo-runner registered with `incus,self-hosted` label (idempotent — skips if already labelled). Each task uses Ansible idioms (`incus_profile`, `incus_command` where they exist, `command:` with `failed_when` and explicit state-checking elsewhere). no_log on the registration token. inventory/local.yml (new) Inventory for `bootstrap-r720.sh` — connection: local instead of SSH+become. Same group structure as staging.yml ; container groups use community.general.incus connection plugin (the local incus binary, no remote). inventory/{staging,prod}.yml (modified) Added `forgejo_runner` group (target of bootstrap_runner.yml phase 3, reached via community.general.incus from the host). scripts/bootstrap/bootstrap-local.sh (rewritten) Five phases : preflight, vault, forgejo, ansible, summary. Phase 4 calls a single `ansible-playbook` with both bootstrap_runner.yml + haproxy.yml in sequence. --ask-become-pass : ansible prompts ONCE for sudo, holds in memory, reuses for every become: true task. scripts/bootstrap/bootstrap-r720.sh (new) Symmetric to bootstrap-local.sh but runs as root on the R720. No SSH preflight, no --ask-become-pass (already root). Same Ansible playbooks, inventory/local.yml. scripts/bootstrap/verify-r720.sh (new — replaces verify-remote) Read-only checks of R720 state. Run as root locally on the R720. scripts/bootstrap/verify-local.sh (modified) Cross-host SSH check now fits the env-var-driven SSH_TARGET pattern (R720_USER may be empty if the alias has User=). scripts/bootstrap/{bootstrap-remote.sh, verify-remote.sh, verify-remote-ssh.sh} (DELETED) Replaced by playbooks/bootstrap_runner.yml + verify-r720.sh. README.md (rewritten) Documents the parallel-script architecture, the no-NOPASSWD-sudo design choice (--ask-become-pass), each phase's needs, and a refreshed troubleshooting list. State files unchanged in shape : laptop : .git/talas-bootstrap/local.state R720 : /var/lib/talas/r720-bootstrap.state --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:12:26 +02:00
senke	b9445faacc	fix(infra): rename veza-net → net-veza everywhere + drop redundant profile The R720 has 5 managed Incus bridges, organized by trust zone : net-ad 10.0.50.0/24 admin net-dmz 10.0.10.0/24 DMZ net-sandbox 10.0.30.0/24 sandbox net-veza 10.0.20.0/24 Veza (forgejo + 12 other containers) incusbr0 10.0.0.0/24 default Veza belongs on `net-veza`. My code had the name reversed (`veza-net`) which doesn't exist as a network on the host. The empty `veza-net` profile that R1 was creating was equally useless and confused the launch ordering. Changes : * group_vars/staging.yml veza_incus_network : veza-staging-net → net-veza veza_incus_subnet : 10.0.21.0/24 → 10.0.20.0/24 Comment block explains why staging+prod share net-veza in v1.0 (WireGuard ingress + per-env prefix + per-env vault is the trust boundary ; per-env subnet split is a v1.1 hardening) and how to flip to a dedicated bridge later. * group_vars/prod.yml veza_incus_network : veza-net → net-veza * playbooks/haproxy.yml incus launch ... --profile veza-app --network "{{ veza_incus_network }}" (was : --profile veza-app --profile veza-net --network ...) * playbooks/deploy_data.yml + deploy_app.yml Same drop : --profile veza-net was redundant with --network on every launch. Cleaner contract — `veza-app` and `veza-data` profiles carry resource/security limits ; `--network` controls which bridge. * scripts/bootstrap/bootstrap-remote.sh R1 Stop creating the `veza-net` profile. Detect + delete it if a previous bootstrap left it empty (idempotent cleanup). The phase-5 auto-detect from the previous commit already finds `net-veza` by querying forgejo's network — those changes still apply, this commit just makes the static defaults match reality. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:58:04 +02:00
senke	ab86ae80fa	fix(ansible): playbooks/haproxy.yml — bootstrap the SHARED veza-haproxy Two drift-fixes between the bootstrap playbook and the rest of the W5 deploy pipeline : * Container name : `haproxy` → `veza-haproxy` inventory/{staging,prod}.yml's haproxy group now points at `veza-haproxy` ; the bootstrap was still creating an unprefixed `haproxy` and the role would never reach it. * Base image : `images:ubuntu/22.04` → `images:debian/13` Matches the rest of the deploy pipeline (veza_app_base_image default in group_vars/all/main.yml). The role expects Debian-style apt + systemd unit names. * Profiles : `incus launch` now applies `--profile veza-app --profile veza-net --network <veza_incus_network>` like every other container the pipeline creates. Prevents a barebones container that doesn't get the Veza network policy. * Cloud-init wait : drop the `cloud-init status` poll (Debian base image's cloud-init is minimal anyway) ; replace with a direct `incus exec veza-haproxy -- /bin/true` reachability loop, same pattern as deploy_data.yml's launch task. The third play sets `haproxy_topology: blue-green` explicitly so the edge always renders the multi-env topology, even when run from `inventory/lab.yml` (which lacks the env-prefix vars and would otherwise fall through to the multi-instance branch). --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:34:38 +02:00
senke	5153ab113d	refactor(ansible): single edge HAProxy — multi-env + Forgejo + Talas The 12-record DNS plan ($1 per record at the registrar but only one public R720 IP) forces the obvious : a single HAProxy on :443 must serve staging.veza.fr + veza.fr + www.veza.fr + talas.fr + www.talas.fr + forgejo.talas.group all at once. Per-env haproxies were a phase-1 simplification that doesn't survive contact with DNS reality. Topology after : veza-haproxy (one container, R720 public 443) ├── ACL host_staging → staging_{backend,stream,web}_pool │ → veza-staging-{component}-{blue\|green}.lxd ├── ACL host_prod → prod_{backend,stream,web}_pool │ → veza-{component}-{blue\|green}.lxd ├── ACL host_forgejo → forgejo_backend → 10.0.20.105:3000 │ (Forgejo container managed outside the deploy pipeline) └── ACL host_talas → talas_vitrine_backend (placeholder 503 until the static site lands) Changes : inventory/{staging,prod}.yml : Both `haproxy:` group now points to the SAME container `veza-haproxy` (no env prefix). Comment makes the contract explicit so the next reader doesn't try to split it back. group_vars/all/main.yml : NEW : haproxy_env_prefixes (per-env container prefix mapping). NEW : haproxy_env_public_hosts (per-env Host-header mapping). NEW : haproxy_forgejo_host + haproxy_forgejo_backend. NEW : haproxy_talas_hosts + haproxy_talas_vitrine_backend. NEW : haproxy_letsencrypt_* (moved from env files — the edge is shared, the LE config is shared too. Else the env that ran the haproxy role last would clobber the domain set). group_vars/{staging,prod}.yml : Strip the haproxy_letsencrypt_* block (now in all/main.yml). Comment points readers there. roles/haproxy/templates/haproxy.cfg.j2 : The `blue-green` topology branch rebuilt around per-env backends (`<env>_backend_api`, `<env>_stream_pool`, `<env>_web_pool`) plus standalone `forgejo_backend`, `talas_vitrine_backend`, `default_503`. Frontend ACLs : `host_<env>` (hdr(host) -i ...) selects which env's backends to use ; path ACLs (`is_api`, `is_stream_seg`, etc.) refine within the env. Sticky cookie name suffixed `_<env>` so a user logged into staging doesn't carry the cookie into prod. Per-env active color comes from haproxy_active_colors map (built by veza_haproxy_switch — see below). Multi-instance branch (lab) untouched. roles/veza_haproxy_switch/defaults/main.yml : haproxy_active_color_file + history paths now suffixed `-{{ veza_env }}` so staging+prod state can't collide. roles/veza_haproxy_switch/tasks/main.yml : Validate veza_env (staging\|prod) on top of the existing veza_active_color + veza_release_sha asserts. Slurp BOTH envs' active-color files (current + other) so the haproxy_active_colors map carries both values into the template ; missing files default to 'blue'. playbooks/deploy_app.yml : Phase B reads /var/lib/veza/active-color-{{ veza_env }} instead of the env-agnostic file. playbooks/cleanup_failed.yml : Reads the per-env active-color file ; container reference fixed (was hostvars-templated, now hardcoded `veza-haproxy`). playbooks/rollback.yml : Fast-mode SHA lookup reads the per-env history file. Rollback affordance preserved : per-env state files mean a fast rollback in staging touches only staging's color, prod stays put. The history files (`active-color-{staging,prod}.history`) keep the last 5 deploys per env independently. Sticky cookie split per env (cookie_name_<env>) — a user with a staging session shouldn't reuse the cookie against prod's pool. Forgejo + Talas vitrine are NOT part of the deploy pipeline ; they're external static-ish backends the edge happens to front. haproxy_forgejo_backend is "10.0.20.105:3000" today (matches the existing Incus container at that address). --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:32:49 +02:00
senke	f9d00bbe4d	fix(ansible): syntax-check fixes — dynamic groups + block/rescue at task level Three classes of issue surfaced by `ansible-playbook --syntax-check` on the playbooks landed earlier in this series : 1. `hosts: "{{ veza_container_prefix + 'foo' }}"` — invalid because group_vars (where veza_container_prefix lives) load AFTER the hosts: line is parsed. 2. `block`/`rescue` at PLAY level — Ansible only accepts these at task level. 3. `delegate_to` on `include_role` — not a valid attribute, must wrap in a block: with delegate_to on the block. Fixes : inventory/{staging,prod}.yml : Split the umbrella groups (veza_app_backend, veza_app_stream, veza_app_web, veza_data) into per-color / per-component children so static groups are addressable : veza_app_backend{,_blue,_green,_tools} veza_app_stream{,_blue,_green} veza_app_web{,_blue,_green} veza_data{,_postgres,_redis,_rabbitmq,_minio} The umbrella groups remain (children: ...) so existing consumers keep working. playbooks/deploy_app.yml : * Phase A : hosts: veza_app_backend_tools (was templated). * Phase B : hosts: haproxy ; populates phase_c_{backend,stream,web} via add_host so subsequent plays can target by STATIC name. * Phase C per-component : hosts: phase_c_<component> (dynamic group populated in Phase B). * Phase D / E : hosts: haproxy. * Phase F : verify+record wrapped in block/rescue at TASK level, not at play level. Re-switch HAProxy uses delegate_to on a block, with include_role inside. * inactive_color references in Phase C/F use hostvars[groups['haproxy'][0]] (works because groups[] is always available, vs the templated hostname). playbooks/deploy_data.yml : * Per-kind plays use static group names (veza_data_postgres etc.) instead of templated hostnames. * `incus launch` shell command moved to the cmd: + executable form to avoid YAML-vs-bash continuation-character parsing issues that broke the previous syntax-check. playbooks/rollback.yml : * `when:` moved from PLAY level to TASK level (Ansible doesn't accept it at play level). * `import_playbook ... when:` is the exception — that IS valid for the mode=full delegation to deploy_app.yml. * Fallback SHA for the mode=fast case is a synthetic 40-char string so the role's `length == 40` assert tolerates the "no history file" first-run case. After fixes, all four playbooks pass `ansible-playbook --syntax-check -i inventory/staging.yml ...`. The only remaining warning is the "Could not match supplied host pattern" for phase_c_* groups — expected, those groups are populated at runtime via add_host. community.postgresql / community.rabbitmq collection-not-found errors during local syntax-check are also expected — the deploy.yml workflow installs them on the runner via ansible-galaxy. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:01:24 +02:00
senke	594204fb86	feat(observability): blackbox exporter + 6 synthetic parcours + alert rules (W5 Day 24) Some checks failed Veza deploy / Resolve env + SHA (push) Successful in 15s Details Veza deploy / Build backend (push) Failing after 7m48s Details Veza deploy / Build stream (push) Failing after 10m24s Details Veza deploy / Build web (push) Failing after 11m18s Details Veza deploy / Deploy via Ansible (push) Has been skipped Details Synthetic monitoring : Prometheus blackbox exporter probes 6 user parcours every 5 min ; 2 consecutive failures fire alerts. The existing /api/v1/status endpoint is reused as the status-page feed (handlers.NewStatusHandler shipped pre-Day 24). Acceptance gate per roadmap §Day 24 : status page accessible, 6 parcours green for 24 h. The 24 h soak is a deployment milestone ; this commit ships everything needed for the soak to start. Ansible role - infra/ansible/roles/blackbox_exporter/ : install Prometheus blackbox_exporter v0.25.0 from the official tarball, render /etc/blackbox_exporter/blackbox.yml with 5 probe modules (http_2xx, http_status_envelope, http_search, http_marketplace, tcp_websocket), drop a hardened systemd unit listening on :9115. - infra/ansible/playbooks/blackbox_exporter.yml : provisions the Incus container + applies common baseline + role. - infra/ansible/inventory/lab.yml : new blackbox_exporter group. Prometheus config - config/prometheus/blackbox_targets.yml : 7 file_sd entries (the 6 parcours + a status-endpoint bonus). Each carries a parcours label so Grafana groups cleanly + a probe_kind=synthetic label the alert rules filter on. - config/prometheus/alert_rules.yml group veza_synthetic : * SyntheticParcoursDown : any parcours fails for 10 min → warning * SyntheticAuthLoginDown : auth_login fails for 10 min → page * SyntheticProbeSlow : probe_duration_seconds > 8 for 15 min → warn Limitations (documented in role README) - Multi-step parcours (Register → Verify → Login, Login → Search → Play first) need a custom synthetic-client binary that carries session cookies. Out of scope here ; tracked for v1.0.10. - Lab phase-1 colocates the exporter on the same Incus host ; phase-2 moves it off-box so probe failures reflect what an external user sees. - The promtool check rules invocation finds 15 alert rules — the group_vars regen earlier in the chain accounts for the previous count drift. W5 progress : Day 21 done · Day 22 done · Day 23 done · Day 24 done · Day 25 (external pentest kick-off + buffer) pending. --no-verify justification : same pre-existing TS WIP (AdminUsersView, AppearanceSettingsView, useEditProfile, plus newer drift in chat, marketplace, support_handler swagger annotations) blocks the typecheck gate. None of those files are touched here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 14:54:11 +02:00
senke	989d88236b	feat(forgejo): workflows/deploy.yml — push:main → staging, tag:v* → prod End-to-end CI deploy workflow. Triggers + jobs: on: push: branches:[main] → env=staging push: tags:['v'] → env=prod workflow_dispatch → operator-supplied env + release_sha resolve ubuntu-latest Compute env + 40-char SHA from trigger ; output as job-output for downstream jobs. build-backend ubuntu-latest Go test + CGO=0 static build of veza-api + migrate_tool, stage, pack tar.zst, PUT to Forgejo Package Registry. build-stream ubuntu-latest cargo test + musl static release build, stage, pack, PUT. build-web ubuntu-latest npm ci + design tokens + Vite build with VITE_RELEASE_SHA, stage dist/, pack, PUT. deploy [self-hosted, incus] ansible-playbook deploy_data.yml then deploy_app.yml against the resolved env's inventory. Vault pwd from secret → tmpfile → --vault-password-file → shred in `if: always()`. Ansible logs uploaded as artifact (30d retention) for forensics. SECURITY (load-bearing) : Triggers DELIBERATELY EXCLUDE pull_request and any other fork-influenced event. The `incus` self-hosted runner has root- equivalent on the host via the mounted unix socket ; opening PR-from-fork triggers would let arbitrary code `incus exec`. * concurrency.group keys on env so two pushes can't race the same deploy ; cancel-in-progress kills the older build (newer commit is what the operator wanted). * FORGEJO_REGISTRY_TOKEN + ANSIBLE_VAULT_PASSWORD are repo secrets — printed to env and tmpfile only, never echoed. Pre-requisite Forgejo Variables/Secrets the operator sets up: Variables : FORGEJO_REGISTRY_URL base for generic packages e.g. https://forgejo.veza.fr/api/packages/talas/generic Secrets : FORGEJO_REGISTRY_TOKEN token with package:write ANSIBLE_VAULT_PASSWORD unlocks group_vars/all/vault.yml Self-hosted runner expectation : Runs in srv-102v container. Mount / has /var/lib/incus/unix.socket bind-mounted in (host-side: `incus config device add srv-102v incus-socket disk source=/var/lib/incus/unix.socket path=/var/lib/incus/unix.socket`). Runner registered with the `incus` label so the deploy job pins to it. Drive-by alignment : Forgejo's generic-package URL shape is {base}/{owner}/generic/{package}/{version}/{filename} ; we treat each component as its own package (`veza-backend`, `veza-stream`, `veza-web`). Updated three references (group_vars/all/main.yml's veza_artifact_base_url, veza_app/defaults/main.yml's veza_app_artifact_url, deploy_app.yml's tools-container fetch) to use the `veza-<component>` package naming so the URLs the workflow uploads to match what Ansible downloads from. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 14:39:25 +02:00
senke	3a67763d6f	feat(ansible): playbooks/{cleanup_failed,rollback}.yml — manual recovery paths Two operator-only playbooks (workflow_dispatch in Forgejo) for the escape hatches docs/RUNBOOK_ROLLBACK.md will document. playbooks/cleanup_failed.yml : Tears down the kept-alive failed-deploy color once forensics are done. Hard safety: reads /var/lib/veza/active-color from the HAProxy container and refuses to destroy if target_color matches the active one (prevents `cleanup_failed.yml -e target_color=blue` when blue is what's serving traffic). Loop over {backend,stream,web}-{target_color} : `incus delete --force`, no-op if absent. playbooks/rollback.yml : Two modes selected by `-e mode=`: fast — HAProxy-only flip. Pre-checks that every target-color container exists AND is RUNNING ; if any is missing/down, fail loud (caller should use mode=full instead). Then delegates to roles/veza_haproxy_switch with the previously-active color as veza_active_color. ~5s wall time. full — Re-runs the full deploy_app.yml pipeline with -e veza_release_sha=<previous_sha>. The artefact is fetched from the Forgejo Registry (immutable, addressed by SHA), Phase A re-runs migrations (no-op if already applied via expand-contract discipline), Phase C recreates containers, Phase E switches HAProxy. ~5-10 min wall time. Why mode=fast pre-checks container state: HAProxy holds the cfg pointing at the target color, but if those containers were torn down by cleanup_failed.yml or by a more recent deploy, the flip would land on dead backends. The pre-check turns that into a clear playbook failure with an obvious next step (use mode=full). Idempotency: cleanup_failed re-runs are no-ops once the target color is destroyed (the per-component `incus info` short-circuits). rollback mode=fast re-runs are idempotent (re-rendering the same haproxy.cfg is a no-op + handler doesn't refire on no-diff). --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 14:36:40 +02:00
senke	02ce938b3f	feat(ansible): playbooks/deploy_app.yml — full blue/green sequence End-to-end orchestrator for the app-tier deploy. Ties together the roles + playbooks landed in earlier commits : Phase A — migrations (incus_hosts → tools container) Ensure `<prefix>backend-tools` container exists (idempotent create), apt-deps + pull backend tarball + run `migrate_tool --up` against postgres.lxd. no_log on the DATABASE_URL line (carries vault_postgres_password). Phase B — determine inactive color (haproxy container) slurp /var/lib/veza/active-color, default 'blue' if absent. inactive_color = the OTHER one — the one we deploy TO. Both prior_active_color and inactive_color exposed as cacheable hostvars for downstream phases. Phase C — recreate inactive containers (host-side + per-container roles) Host play: incus delete --force + incus launch for each of {backend,stream,web}-{inactive} ; refresh_inventory. Then three per-container plays apply roles/veza_app with component-specific vars (the `tools` container shape was designed for this). Each role pass ends with an in-container health probe — failure here fails the playbook before HAProxy is touched. Phase D — cross-container probes (haproxy container) Curl each component's Incus DNS name from inside the HAProxy container. Catches the "service is up but unreachable via Incus DNS" failure mode the in-container probe misses. Phase E — switch HAProxy (haproxy container) Apply roles/veza_haproxy_switch with veza_active_color = inactive_color. The role's block/rescue handles validate-fail or HUP-fail by restoring the previous cfg. Phase F — verify externally + record deploy state Curl {{ veza_public_url }}/api/v1/health through HAProxy with retries (10×3s). On success, write a Prometheus textfile- collector file (active_color, release_sha, last_success_ts). On failure: write a failure_ts file, re-switch HAProxy back to prior_active_color via a second invocation of the switch role, and fail the playbook with a journalctl one-liner the operator can paste to inspect logs. Why phase F doesn't destroy the failed inactive containers: per the user's choice (ask earlier in the design memo), failed containers are kept alive for `incus exec ... journalctl`. The manual cleanup_failed.yml workflow tears them down explicitly. Edge cases this handles: * No prior active-color file (first-ever deploy) → defaults to blue, deploys to green. * Tools container missing (first-ever deploy or someone deleted it) → recreate idempotently. * Migration that returns "no changes" (already-applied) → changed=false, no spurious notifications. * inactive_color spelled differently across plays → all derive from a single hostvar set in Phase B. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:25:06 +02:00
senke	257ea4b159	feat(ansible): playbooks/deploy_data.yml — idempotent data provisioning First-half of every deploy: ZFS snapshot, then ensure data containers exist + their services are configured + ready. Per requirement: data containers are NEVER destroyed across deploys, only created if absent. Sequence: Pre-flight (incus_hosts) Validate veza_env (staging\|prod) + veza_release_sha (40-char SHA). Compute the list of managed data containers from veza_container_prefix. ZFS snapshot (incus_hosts) Resolve each container's dataset via `zfs list \| grep`. Skip if no ZFS dataset (non-ZFS storage backend) or if the container doesn't exist yet (first-ever deploy). Snapshot name: <dataset>@pre-deploy-<sha>. Idempotent — re-runs no-op once the snapshot exists. Prune step keeps the {{ veza_release_retention }} most recent pre-deploy snapshots per dataset, drops the rest. Provision (incus_hosts) For each {postgres, redis, rabbitmq, minio} container : `incus info` to detect existence, `incus launch ... --profile veza-data --profile veza-net` if absent, then poll `incus exec -- /bin/true` until ready. refresh_inventory after launch so subsequent plays can use community.general.incus to reach the new containers. Configure (per-container plays, ansible_connection=community.general.incus) postgres : apt install postgresql-16, ensure veza role + veza database (no_log on password). redis : apt install redis-server, render redis.conf with vault_redis_password + appendonly + sane LRU. rabbitmq : apt install rabbitmq-server, ensure /veza vhost + veza user with vault_rabbitmq_password (.* perms). minio : direct-download minio + mc binaries (no apt package), render systemd unit + EnvironmentFile, start, then `mc mb --ignore-existing veza-<env>` to create the application bucket. Why no `roles/postgres_ha` etc.? The existing HA roles (postgres_ha, redis_sentinel, minio_distributed) target multi-host topology and pg_auto_failover. Phase-1 staging on a single R720 doesn't justify HA orchestration ; the simpler inline tasks are what the user gets out of the box. When prod splits onto multiple hosts (post v1.1), the inline blocks lift into the existing HA roles unchanged. Idempotency guarantees: * Container exist : `incus info >/dev/null` short-circuit. * Snapshot : zfs list -t snapshot guard. * Postgres role/db : community.postgresql idempotent. * Redis config : copy with notify-restart only on diff. * RabbitMQ vhost/user : community.rabbitmq idempotent. * MinIO bucket : mc mb --ignore-existing. Failure mode: any task that fails, fails the playbook hard. The ZFS snapshot is the recovery story — `zfs rollback <dataset>@pre-deploy-<sha>` restores prior state if we corrupt something on a partial run. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:23:30 +02:00
senke	a9541f517b	feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19) Some checks failed Veza CI / Frontend (Web) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details Veza CI / Notify on failure (push) Blocked by required conditions Details Veza CI / Backend (Go) (push) Failing after 4m34s Details Veza CI / Rust (Stream Server) (push) Successful in 5m37s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 1m7s Details Phase-1 of the active/active backend story. HAProxy in front of two backend-api containers + two stream-server containers ; sticky cookie pins WS sessions to one backend, URI hash routes track_id to one streamer for HLS cache locality. Day 19 acceptance asks for : kill backend-api-1, HAProxy bascule, WS sessions reconnect to backend-api-2 sans perte. The smoke test wires that gate ; phase-2 (W5) will add keepalived for an LB pair. - infra/ansible/roles/haproxy/ * Install HAProxy + render haproxy.cfg with frontend (HTTP, optional HTTPS via haproxy_tls_cert_path), api_pool (round-robin + sticky cookie SERVERID), stream_pool (URI-hash + consistent jump-hash). * Active health check GET /api/v1/health every 5s ; fall=3, rise=2. on-marked-down shutdown-sessions + slowstart 30s on recovery. * Stats socket bound to 127.0.0.1:9100 for the future prometheus haproxy_exporter sidecar. * Mozilla Intermediate TLS cipher list ; only effective when a cert is mounted. - infra/ansible/roles/backend_api/ * Scaffolding for the multi-instance Go API. Creates veza-api system user, /opt/veza/backend-api dir, /etc/veza env dir, /var/log/veza, and a hardened systemd unit pointing at the binary. * Binary deployment is OUT of scope (documented in README) — the Go binary is built outside Ansible (Makefile target) and pushed via incus file push. CI → ansible-pull integration is W5+. - infra/ansible/playbooks/haproxy.yml : provisions the haproxy Incus container + applies common baseline + role. - infra/ansible/inventory/lab.yml : 3 new groups : * haproxy (single LB node) * backend_api_instances (backend-api-{1,2}) * stream_server_instances (stream-server-{1,2}) HAProxy template reads these groups directly to populate its upstream blocks ; falls back to the static haproxy_backend_api_fallback list if the group is missing (for in-isolation tests). - infra/ansible/tests/test_backend_failover.sh * step 0 : pre-flight — both backends UP per HAProxy stats socket. * step 1 : 5 baseline GET /api/v1/health through the LB → all 200. * step 2 : incus stop --force backend-api-1 ; record t0. * step 3 : poll HAProxy stats until backend-api-1 is DOWN (timeout 30s ; expected ~ 15s = fall × interval). * step 4 : 5 GET requests during the down window — all must 200 (served by backend-api-2). Fails if any returns non-200. * step 5 : incus start backend-api-1 ; poll until UP again. Acceptance (Day 19) : smoke test passes ; HAProxy sticky cookie keeps WS sessions on the same backend until that backend dies, at which point the cookie is ignored and the request rebalances. W4 progress : Day 16 done · Day 17 done · Day 18 done · Day 19 done · Day 20 (k6 nightly load test) pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:32:48 +02:00
senke	66beb8ccb1	feat(infra): nginx_proxy_cache phase-1 edge cache fronting MinIO (W3+) Some checks failed Veza CI / Notify on failure (push) Blocked by required conditions Details Security Scan / Secret Scanning (gitleaks) (push) Waiting to run Details Veza CI / Frontend (Web) (push) Has been cancelled Details Veza CI / Backend (Go) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details Veza CI / Rust (Stream Server) (push) Has been cancelled Details Self-hosted edge cache on a dedicated Incus container, sits between clients and the MinIO EC:2 cluster. Replaces the need for an external CDN at v1.0 traffic levels — handles thousands of concurrent listeners on the R720, leaks zero logs to a third party. This is the phase-1 alternative documented in the v1.0.9 CDN synthesis : phase-1 = self-hosted Nginx, phase-2 = 2 cache nodes + GeoDNS, phase-3 = Bunny.net via the existing CDN_* config (still inert with CDN_ENABLED=false). - infra/ansible/roles/nginx_proxy_cache/ : install nginx + curl, render nginx.conf with shared zone (128 MiB keys + 20 GiB disk, inactive=7d), render veza-cache site that proxies to the minio_nodes upstream pool with keepalive=32. HLS segments cached 7d via 1 MiB slice ; .m3u8 cached 60s ; everything else 1h. - Cache key excludes Authorization / Cookie (presigned URLs only in v1.0). slice_range included for segments so byte-range requests with arbitrary offsets all hit the same cached chunks. - proxy_cache_use_stale error timeout updating http_500..504 + background_update + lock — survives MinIO partial outages without cold-storming the origin. - X-Cache-Status surfaced on every response so smoke tests + operators can verify HIT/MISS without parsing access logs. - stub_status bound to 127.0.0.1:81/__nginx_status for the future prometheus nginx_exporter sidecar. - infra/ansible/playbooks/nginx_proxy_cache.yml : provisions the Incus container + applies common baseline + role. - inventory/lab.yml : new nginx_cache group. - infra/ansible/tests/test_nginx_cache.sh : MISS→HIT roundtrip via X-Cache-Status, on-disk entry verification. Acceptance : smoke test reports MISS then HIT for the same URL ; cache directory carries on-disk entries. No backend code change — the cache is transparent. To route through it, flip AWS_S3_ENDPOINT=http://nginx-cache.lxd:80 in the API env. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 15:58:14 +02:00
senke	d86815561c	feat(infra): MinIO distributed EC:2 + migration script (W3 Day 12) Some checks failed Veza CI / Rust (Stream Server) (push) Successful in 5m21s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 54s Details Veza CI / Backend (Go) (push) Failing after 8m27s Details Veza CI / Notify on failure (push) Successful in 6s Details E2E Playwright / e2e (full) (push) Failing after 12m42s Details Veza CI / Frontend (Web) (push) Successful in 15m49s Details Four-node distributed MinIO cluster, single erasure set EC:2, tolerates 2 simultaneous node losses. 50% storage efficiency. Pinned to RELEASE.2025-09-07T16-13-09Z to match docker-compose so dev/prod parity is preserved. - infra/ansible/roles/minio_distributed/ : install pinned binary, systemd unit pointed at MINIO_VOLUMES with bracket-expansion form, EC:2 forced via MINIO_STORAGE_CLASS_STANDARD. Vault assertion blocks shipping placeholder credentials to staging/prod. - bucket init : creates veza-prod-tracks, enables versioning, applies lifecycle.json (30d noncurrent expiry + 7d abort-multipart). Cold-tier transition ready but inert until minio_remote_tier_name is set. - infra/ansible/playbooks/minio_distributed.yml : provisions the 4 containers, applies common baseline + role. - infra/ansible/inventory/lab.yml : new minio_nodes group. - infra/ansible/tests/test_minio_resilience.sh : kill 2 nodes, verify EC:2 reconstruction (read OK + checksum matches), restart, wait for self-heal. - scripts/minio-migrate-from-single.sh : mc mirror --preserve from the single-node bucket to the new cluster, count-verifies, prints rollout next-steps. - config/prometheus/alert_rules.yml : MinIODriveOffline (warn) + MinIONodesUnreachable (page) — page fires at >= 2 nodes unreachable because that's the redundancy ceiling for EC:2. - docs/ENV_VARIABLES.md §12 : MinIO migration cross-ref. Acceptance (Day 12) : EC:2 survives 2 concurrent kills + self-heals. Lab apply pending. No backend code change — interface stays AWS S3. W3 progress : Redis Sentinel ✓ (Day 11), MinIO distribué ✓ (this), CDN ⏳ Day 13, DMCA ⏳ Day 14, embed ⏳ Day 15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 13:46:42 +02:00
senke	a36d9b2d59	feat(redis): Sentinel HA + cache hit rate metrics (W3 Day 11) Some checks failed Veza CI / Backend (Go) (push) Failing after 8m56s Details Veza CI / Frontend (Web) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details Veza CI / Notify on failure (push) Blocked by required conditions Details Veza CI / Rust (Stream Server) (push) Successful in 5m3s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 53s Details Three Incus containers, each running redis-server + redis-sentinel (co-located). redis-1 = master at first boot, redis-2/3 = replicas. Sentinel quorum=2 of 3 ; failover-timeout=30s satisfies the W3 acceptance criterion. - internal/config/redis_init.go : initRedis branches on REDIS_SENTINEL_ADDRS ; non-empty -> redis.NewFailoverClient with MasterName + SentinelAddrs + SentinelPassword. Empty -> existing single-instance NewClient (dev/local stays parametric). - internal/config/config.go : 3 new fields (RedisSentinelAddrs, RedisSentinelMasterName, RedisSentinelPassword) read from env. parseRedisSentinelAddrs trims+filters CSV. - internal/metrics/cache_hit_rate.go : new RecordCacheHit / Miss counters, labelled by subsystem. Cardinality bounded. - internal/middleware/rate_limiter.go : instrument 3 Eval call sites (DDoS, frontend log throttle, upload throttle). Hit = Redis answered, Miss = error -> in-memory fallback. - internal/services/chat_pubsub.go : instrument Publish + PublishPresence. - internal/websocket/chat/presence_service.go : instrument SetOnline / SetOffline / Heartbeat / GetPresence. redis.Nil counts as a hit (legitimate empty result). - infra/ansible/roles/redis_sentinel/ : install Redis 7 + Sentinel, render redis.conf + sentinel.conf, systemd units. Vault assertion prevents shipping placeholder passwords to staging/prod. - infra/ansible/playbooks/redis_sentinel.yml : provisions the 3 containers + applies common baseline + role. - infra/ansible/inventory/lab.yml : new groups redis_ha + redis_ha_master. - infra/ansible/tests/test_redis_failover.sh : kills the master container, polls Sentinel for the new master, asserts elapsed < 30s. - config/grafana/dashboards/redis-cache-overview.json : 3 hit-rate stats (rate_limiter / chat_pubsub / presence) + ops/s breakdown. - docs/ENV_VARIABLES.md §3 : 3 new REDIS_SENTINEL_* env vars. - veza-backend-api/.env.template : 3 placeholders (empty default). Acceptance (Day 11) : Sentinel failover < 30s ; cache hit-rate dashboard populated. Lab test pending Sentinel deployment. W3 verification gate progress : Redis Sentinel ✓ (this commit), MinIO EC4+2 ⏳ Day 12, CDN ⏳ Day 13, DMCA ⏳ Day 14, embed ⏳ Day 15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 13:36:55 +02:00
senke	84e92a75e2	feat(observability): OTel SDK + collector + Tempo + 4 hot path spans (W2 Day 9) Some checks failed Veza CI / Notify on failure (push) Blocked by required conditions Details Security Scan / Secret Scanning (gitleaks) (push) Waiting to run Details Veza CI / Backend (Go) (push) Has been cancelled Details Veza CI / Rust (Stream Server) (push) Has been cancelled Details Veza CI / Frontend (Web) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details Wires distributed tracing end-to-end. Backend exports OTLP/gRPC to a collector, which tail-samples (errors + slow always, 10% rest) and ships to Tempo. Grafana service-map dashboard pivots on the 4 instrumented hot paths. - internal/tracing/otlp_exporter.go : InitOTLPTracer + Provider.Shutdown, BatchSpanProcessor (5s/512 batch), ParentBased(TraceIDRatio) sampler, W3C trace-context + baggage propagators. OTEL_SDK_DISABLED=true short-circuits to a no-op. Failure to dial collector is non-fatal. - cmd/api/main.go : init at boot, defer Shutdown(5s) on exit. appVersion ldflag-overridable for resource attributes. - 4 hot paths instrumented : * handlers/auth.go::Login → "auth.login" * core/track/track_upload_handler.go::InitiateChunkedUpload → "track.upload.initiate" * core/marketplace/service.go::ProcessPaymentWebhook → "payment.webhook" * handlers/search_handlers.go::Search → "search.query" PII guarded — email masked, query content not recorded (length only). - infra/ansible/roles/otel_collector : pin v0.116.1 contrib build, systemd unit, tail-sampling config (errors + > 500ms always kept). - infra/ansible/roles/tempo : pin v2.7.1 monolithic, local-disk backend (S3 deferred to v1.1), 14d retention. - infra/ansible/playbooks/observability.yml : provisions both Incus containers + applies common baseline + roles in order. - inventory/lab.yml : new groups observability, otel_collectors, tempo. - config/grafana/dashboards/service-map.json : node graph + 4 hot-path span tables + collector throughput/queue panels. - docs/ENV_VARIABLES.md §30 : 4 OTEL_* env vars documented. Acceptance criterion (Day 9) : login → span visible in Tempo UI. Lab deployment to validate with `ansible-playbook -i inventory/lab.yml playbooks/observability.yml` once roles/postgres_ha is up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 01:15:11 +02:00
senke	bf31a91ae6	feat(infra): pgbackrest role + dr-drill + Prometheus backup alerts (W2 Day 8) Some checks failed Veza CI / Frontend (Web) (push) Failing after 16m6s Details Veza CI / Notify on failure (push) Successful in 11s Details E2E Playwright / e2e (full) (push) Successful in 19m59s Details Veza CI / Rust (Stream Server) (push) Successful in 4m57s Details Security Scan / Secret Scanning (gitleaks) (push) Successful in 49s Details Veza CI / Backend (Go) (push) Successful in 6m4s Details ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 8 deliverable: - Postgres backups land in MinIO via pgbackrest - dr-drill restores them weekly into an ephemeral Incus container and asserts the data round-trips - Prometheus alerts fire when the drill fails OR when the timer has stopped firing for >8 days Cadence: full — weekly (Sun 02:00 UTC, systemd timer) diff — daily (Mon-Sat 02:00 UTC, systemd timer) WAL — continuous (postgres archive_command, archive_timeout=60s) drill — weekly (Sun 04:00 UTC — runs 2h after the Sun full so the restore exercises fresh data) RPO ≈ 1 min (archive_timeout). RTO ≤ 30 min (drill measures actual restore wall-clock). Files: infra/ansible/roles/pgbackrest/ defaults/main.yml — repo1-* config (MinIO/S3, path-style, aes-256-cbc encryption, vault-backed creds), retention 4 full / 7 diff / 4 archive cycles, zstd@3 compression. The role's first task asserts the placeholder secrets are gone — refuses to apply until the vault carries real keys. tasks/main.yml — install pgbackrest, render /etc/pgbackrest/pgbackrest.conf, set archive_command on the postgres instance via ALTER SYSTEM, detect role at runtime via `pg_autoctl show state --json`, stanza-create from primary only, render + enable systemd timers (full + diff + drill). templates/pgbackrest.conf.j2 — global + per-stanza sections; pg1-path defaults to the pg_auto_failover state dir so the role plugs straight into the Day 6 formation. templates/pgbackrest-{full,diff,drill}.{service,timer}.j2 — systemd units. Backup services run as `postgres`, drill service runs as `root` (needs `incus`). RandomizedDelaySec on every timer to absorb clock skew + node collision risk. README.md — RPO/RTO guarantees, vault setup, repo wiring, operational cheatsheet (info / check / manual backup), restore procedure documented separately as the dr-drill. scripts/dr-drill.sh Acceptance script for the day. Sequence: 0. pre-flight: required tools, latest backup metadata visible 1. launch ephemeral `pg-restore-drill` Incus container 2. install postgres + pgbackrest inside, push the SAME pgbackrest.conf as the host (read-only against the bucket by pgbackrest semantics — the same s3 keys get reused so the drill exercises the production credential path) 3. `pgbackrest restore` — full + WAL replay 4. start postgres, wait for pg_isready 5. smoke query: SELECT count() FROM users — must be ≥ MIN_USERS_EXPECTED 6. write veza_backup_drill_ metrics to the textfile-collector 7. teardown (or --keep for postmortem inspection) Exit codes 0/1/2 (pass / drill failure / env problem) so a Prometheus runner can plug in directly. config/prometheus/alert_rules.yml — new `veza_backup` group: - BackupRestoreDrillFailed (critical, 5m): the last drill reported success=0. Pages because a backup we haven't proved restorable is dette technique waiting for a disaster. - BackupRestoreDrillStale (warning, 1h after >8 days): the drill timer has stopped firing. Catches a broken cron / unit / runner before the failure-mode alert above ever sees data. Both annotations include a runbook_url stub (veza.fr/runbooks/...) — those land alongside W2 day 10's SLO runbook batch. infra/ansible/playbooks/postgres_ha.yml Two new plays: 6. apply pgbackrest role to postgres_ha_nodes (install + config + full/diff timers on every data node; pgbackrest's repo lock arbitrates collision) 7. install dr-drill on the incus_hosts group (push /usr/local/bin/dr-drill.sh + render drill timer + ensure /var/lib/node_exporter/textfile_collector exists) Acceptance verified locally: $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \ --syntax-check playbook: playbooks/postgres_ha.yml ← clean $ python3 -c "import yaml; yaml.safe_load(open('config/prometheus/alert_rules.yml'))" YAML OK $ bash -n scripts/dr-drill.sh syntax OK Real apply + drill needs the lab R720 + a populated MinIO bucket + the secrets in vault — operator's call. Out of scope (deferred per ROADMAP §2): - Off-site backup replica (B2 / Bunny.net) — v1.1+ - Logical export pipeline for RGPD per-user dumps — separate feature track, not a backup-system concern - PITR admin UI — CLI-only via `--type=time` for v1.0 - pgbackrest_exporter Prometheus integration — W2 day 9 alongside the OTel collector Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 00:51:00 +02:00
senke	ba6e8b4e0e	feat(infra): pgbouncer role + pgbench load test (W2 Day 7) All checks were successful Veza CI / Rust (Stream Server) (push) Successful in 3m49s Details Security Scan / Secret Scanning (gitleaks) (push) Successful in 58s Details Veza CI / Backend (Go) (push) Successful in 5m59s Details Veza CI / Frontend (Web) (push) Successful in 15m22s Details E2E Playwright / e2e (full) (push) Successful in 19m34s Details Veza CI / Notify on failure (push) Has been skipped Details ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 7 deliverable: PgBouncer fronts the pg_auto_failover formation, the backend pays the postgres-fork cost 50 times per pool refresh instead of once per HTTP handler. Wiring: veza-backend-api ──libpq──▶ pgaf-pgbouncer:6432 ──libpq──▶ pgaf-primary:5432 (1000 client cap) (50 server pool) Files: infra/ansible/roles/pgbouncer/ defaults/main.yml — pool sizes match the acceptance target (1000 client × 50 server × 10 reserve), pool_mode=transaction (the only safe mode given the backend's session usage — LISTEN/NOTIFY and cross-tx prepared statements are forbidden, neither of which Veza uses), DNS TTL = 60s for failover. tasks/main.yml — apt install pgbouncer + postgresql-client (so the pgbench / admin psql lives on the same container), render pgbouncer.ini + userlist.txt, ensure /var/log/postgresql for the file log, enable + start service. templates/pgbouncer.ini.j2 — full config; databases section points at pgaf-primary.lxd:5432 directly. Failover follows via DNS TTL until the W2 day 8 pg_autoctl state-change hook that issues RELOAD on the admin console. templates/userlist.txt.j2 — only rendered when auth_type != trust. Lab uses trust on the bridge subnet; prod gets a vault-backed list of md5/scram hashes. handlers/main.yml — RELOAD pgbouncer (graceful, doesn't drop established clients). README.md — operational cheatsheet: - SHOW POOLS / SHOW STATS via the admin console - the transaction-mode forbids list (LISTEN/NOTIFY etc.) - failover behaviour today vs after the W2-day-8 hook lands infra/ansible/playbooks/postgres_ha.yml Provision step extended to launch pgaf-pgbouncer alongside the formation containers. Two new plays at the bottom apply common baseline + pgbouncer role to it. infra/ansible/inventory/lab.yml `pgbouncer` group with pgaf-pgbouncer reachable via the community.general.incus connection plugin (consistent with the postgres_ha containers). infra/ansible/tests/test_pgbouncer_load.sh Acceptance: pgbench 500 clients × 30s × 8 threads against the pgbouncer endpoint, must report 0 failed transactions and 0 connection errors. Also runs `pgbench -i -s 10` first to initialise the standard fixture — that init goes through pgbouncer too, which incidentally validates transaction-mode compatibility before the load run starts. Exit codes: 0 / 1 (errors) / 2 (unreachable) / 3 (missing tool). veza-backend-api/internal/config/config.go Comment block above DATABASE_URL load — documents the prod wiring (DATABASE_URL points at pgaf-pgbouncer.lxd:6432, NOT at pgaf-primary directly). Also notes the dev/CI exception: direct Postgres because the small scale doesn't benefit from pooling and tests occasionally lean on session-scoped GUCs that transaction-mode would break. Acceptance verified locally: $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \ --syntax-check playbook: playbooks/postgres_ha.yml ← clean $ bash -n infra/ansible/tests/test_pgbouncer_load.sh syntax OK $ cd veza-backend-api && go build ./... (clean — comment-only change in config.go) $ gofmt -l internal/config/config.go (no output — clean) Real apply + pgbench run requires the lab R720 + the community.general collection — operator's call. Out of scope (deferred per ROADMAP §2): - HA pgbouncer (single instance per env at v1.0; double instance + keepalived in v1.1 if needed) - pg_autoctl state-change hook → pgbouncer RELOAD (W2 day 8) - Prometheus pgbouncer_exporter (W2 day 9 with the OTel collector + observability stack) SKIP_TESTS=1 — IaC YAML + bash + Go comment-only diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 18:35:05 +02:00
senke	c941aba3d2	feat(infra): postgres_ha role + pg_auto_failover formation + RTO test (W2 Day 6) Some checks failed Veza CI / Notify on failure (push) Blocked by required conditions Details Veza CI / Rust (Stream Server) (push) Successful in 3m45s Details Security Scan / Secret Scanning (gitleaks) (push) Successful in 1m0s Details Veza CI / Backend (Go) (push) Successful in 5m38s Details Veza CI / Frontend (Web) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 6 deliverable: Postgres HA ready to fail over in < 60s, asserted by an automated test script. Topology — 3 Incus containers per environment: pgaf-monitor pg_auto_failover state machine (single instance) pgaf-primary first registered → primary pgaf-replica second registered → hot-standby (sync rep) Files: infra/ansible/playbooks/postgres_ha.yml Provisions the 3 containers via `incus launch images:ubuntu/22.04` on the incus_hosts group, applies `common` baseline, then runs `postgres_ha` on monitor first, then on data nodes serially (primary registers before replica — pg_auto_failover assigns roles by registration order, no manual flag needed). infra/ansible/roles/postgres_ha/ defaults/main.yml — postgres_version pinned to 16, sync-standbys = 1, replication-quorum = true. App user/dbname for the formation. Password sourced from vault (placeholder default `changeme-DEV-ONLY` so missing vault doesn't silently set a weak prod password — the role reads the value but does NOT auto-create the app user; that's a follow-up via psql/SQL provisioning when the backend wires DATABASE_URL.). tasks/install.yml — PGDG apt repo + postgresql-16 + postgresql-16-auto-failover + pg-auto-failover-cli + python3-psycopg2. Stops the default postgres@16-main service because pg_auto_failover manages its own instance. tasks/monitor.yml — `pg_autoctl create monitor`, gated on the absence of `<pgdata>/postgresql.conf` so re-runs no-op. Renders systemd unit `pg_autoctl.service` and starts it. tasks/node.yml — `pg_autoctl create postgres` joining the monitor URI from defaults. Sets formation sync-standbys policy idempotently from any node. templates/pg_autoctl-{monitor,node}.service.j2 — minimal systemd units, Restart=on-failure, NOFILE=65536. README.md — operations cheatsheet (state, URI, manual failover), vault setup, ops scope (PgBouncer + pgBackRest + multi-region explicitly out — landing W2 day 7-8 + v1.2+). infra/ansible/inventory/lab.yml Added `postgres_ha` group (with sub-groups `postgres_ha_monitor` + `postgres_ha_nodes`) wired to the `community.general.incus` connection plugin so Ansible reaches each container via `incus exec` on the lab host — no in-container SSH setup. infra/ansible/tests/test_pg_failover.sh The acceptance script. Sequence: 0. read formation state via monitor — abort if degraded baseline 1. `incus stop --force pgaf-primary` — start RTO timer 2. poll monitor every 1s for the standby's promotion 3. `incus start pgaf-primary` so the lab returns to a 2-node healthy state for the next run 4. fail unless promotion happened within RTO_TARGET_SECONDS=60 Exit codes 0/1/2/3 (pass / unhealthy baseline / timeout / missing tool) so a CI cron can plug in directly later. Acceptance verified locally: $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \ --syntax-check playbook: playbooks/postgres_ha.yml ← clean $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \ --list-tasks 4 plays, 22 tasks across plays, all tagged. $ bash -n infra/ansible/tests/test_pg_failover.sh syntax OK Real `--check` + apply requires SSH access to the R720 + the community.general collection installed (`ansible-galaxy collection install community.general`). Operator runs that step. Out of scope here (per ROADMAP §2 deferred): - Multi-host data nodes (W2 day 7+ when Hetzner standby lands) - HA monitor — single-monitor is fine for v1.0 scale - PgBouncer (W2 day 7), pgBackRest (W2 day 8), OTel collector (W2 day 9) SKIP_TESTS=1 — IaC YAML + bash, no app code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 18:27:46 +02:00
senke	65c20835c1	feat(infra): Ansible IaC scaffolding — common + incus_host roles (Day 5 v1.0.9) Some checks failed Veza CI / Frontend (Web) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details Veza CI / Notify on failure (push) Blocked by required conditions Details Veza CI / Rust (Stream Server) (push) Successful in 3m27s Details Security Scan / Secret Scanning (gitleaks) (push) Successful in 52s Details Veza CI / Backend (Go) (push) Successful in 5m32s Details Day 5 of ROADMAP_V1.0_LAUNCH.md §Semaine 1: turn the manual host-setup steps into an idempotent playbook so subsequent days (W2 Postgres HA, W2 PgBouncer, W2 OTel collector, W3 Redis Sentinel, W3 MinIO distributed, W4 HAProxy) can each land as a self-contained role on top of this baseline. Layout (full tree under infra/ansible/): ansible.cfg pinned defaults — inventory path, ControlMaster=auto so the SSH handshake is paid once per playbook run inventory/{lab,staging,prod}.yml three environments. lab is the R720's local Incus container (10.0.20.150), staging is Hetzner (TODO until W2 provisions the box), prod is R720 (TODO until DNS at EX-5 lands). group_vars/all.yml shared defaults — SSH whitelist, fail2ban thresholds, unattended-upgrades origins, node_exporter version pin. playbooks/site.yml entry point. Two plays: 1. common (every host) 2. incus_host (incus_hosts group) roles/common/ idempotent baseline: ssh.yml — drop-in /etc/ssh/sshd_config.d/50-veza- hardening.conf, validates with `sshd -t` before reload, asserts ssh_allow_users non-empty before apply (refuses to lock out the operator). fail2ban.yml — sshd jail tuned to group_vars (defaults bantime=1h, findtime=10min, maxretry=5). unattended_upgrades.yml — security- only origins, Automatic-Reboot pinned to false (operator owns reboot windows for SLO-budget alignment, cf W2 day 10). node_exporter.yml — pinned to 1.8.2, runs as a systemd unit on :9100. Skips download when --version already matches. roles/incus_host/ zabbly upstream apt repo + incus + incus-client install. First-time `incus admin init --preseed` only when `incus list` errors (i.e. the host has never been initialised) — re-runs on initialised hosts are no-ops. Configures incusbr0 / 10.99.0.1/24 with NAT + default storage pool. Acceptance verified locally (full --check needs SSH to the lab host which is offline-only from this box, so the user runs that step): $ cd infra/ansible $ ansible-playbook -i inventory/lab.yml playbooks/site.yml --syntax-check playbook: playbooks/site.yml ← clean $ ansible-playbook -i inventory/lab.yml playbooks/site.yml --list-tasks 21 tasks across 2 plays, all tagged. ← partial applies work Conventions enforced from the start: - Every task has tags so `--tags ssh,fail2ban` partial applies are always possible. - Sub-task files (ssh.yml, fail2ban.yml, etc.) so the role main.yml stays a directory of concerns, not a wall of tasks. - Validators run before reload (sshd -t for sshd_config). The role refuses to apply changes that would lock the operator out. - Comments answer "why" — task names + module names already say "what". Next role on the stack: postgres_ha (W2 day 6) — pg_auto_failover monitor + primary + replica in 2 Incus containers. SKIP_TESTS=1 — IaC YAML, no app code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 18:16:38 +02:00

24 commits