Commit graph

24 commits

Author SHA1 Message Date
senke
0bd3e563b2 fix(haproxy): incus proxy devices forward R720:80/443 → container
The Orange box NAT correctly forwards :80/:443 → R720 LAN IP, but
the R720 host has nothing listening there — haproxy lives in the
veza-haproxy container, reachable only on the net-veza bridge
(10.0.20.X). Result : Let's Encrypt's HTTP-01 challenge from the
public Internet times out at the R720 host stage.

Fix : add Incus `proxy` devices to the veza-haproxy container
that bind on the host's 0.0.0.0:80 / 0.0.0.0:443 and forward into
the container's local ports. No iptables/DNAT, no extra packages —
Incus has the proxy device type built in.

  incus config device add veza-haproxy http  proxy \
      listen=tcp:0.0.0.0:80  connect=tcp:127.0.0.1:80
  incus config device add veza-haproxy https proxy \
      listen=tcp:0.0.0.0:443 connect=tcp:127.0.0.1:443

Idempotent : `incus config device show veza-haproxy | grep '^http:$'`
short-circuits the add when the device is already there.

Operator setup unchanged : box NAT 80/443 → R720 LAN IP. Ansible
now bridges the rest of the path automatically.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:27:37 +02:00
senke
e97b91f010 fix(ansible): don't apply common role to haproxy container + gate ssh.yml on sshd
Two fixes for "haproxy container doesn't have sshd" :

1. playbooks/haproxy.yml — drop the `common` role play.
   The role's purpose is to harden a full HOST (SSH + fail2ban
   monitoring auth.log + node_exporter metrics surface). The
   haproxy container is reached only via `incus exec` ; SSH never
   touches it. Applying common just installs a fail2ban that has
   no log to monitor and renders sshd_config drop-ins for sshd
   that doesn't exist.
   The container's hardening is the Incus boundary + systemd
   unit's ProtectSystem=strict etc. (already in the templates).

2. roles/common/tasks/ssh.yml — gate every task on sshd presence.
   `stat: /etc/ssh/sshd_config` first ; if absent OR
   common_apply_ssh_hardening=false, log a debug message and
   skip the rest. Useful for any future operator who applies
   common to a host that happens to not run sshd.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:57:16 +02:00
senke
5f6625cc56 fix(ansible): detect storage pool from forgejo's root device, not first listed
The previous detect picked the first row of `incus storage list -f csv`,
which on the user's R720 returned `default` — but `default` is not
usable on this server (`Storage pool is unavailable on this server`
when launching). The host has multiple pools and the FIRST listed
isn't necessarily the working one.

New detect strategy (most-reliable first) :
  1. `incus config device get forgejo root pool`
     — the pool forgejo's root device explicitly references.
  2. `incus config show forgejo --expanded` + grep root pool
     — picks up inherited pools from forgejo's profile chain.
  3. Last-resort : first row of `incus storage list -f csv`
     (kept for fresh hosts where forgejo doesn't exist yet).

Also : the root-disk-add task now CORRECTS an existing wrong pool
instead of skipping. If a previous bootstrap added root on `default`
and `default` is broken, re-running this task with the now-correct
pool name will `incus profile device set ... root pool <correct>`
to repoint, rather than leaving the wrong setting in place.

Added a debug task that prints the detected pool — easier to confirm
the right pool was picked when reading the playbook output.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:34:50 +02:00
senke
4298f0c26a fix(ansible): bootstrap_runner — add root disk to veza-{app,data} profiles
`incus launch ... --profile veza-app` failed with :
  Failed initializing instance: Invalid devices:
    Failed detecting root disk device: No root device could be found

Cause : the profiles were created empty. Incus needs a root disk
device referencing a storage pool to actually launch a container ;
the `default` profile carries one implicitly but custom profiles
need it added explicitly OR the launch must combine `default` +
custom profile.

Fix : phase 1 of bootstrap_runner.yml now :
  1. Detects the first available storage pool (`incus storage list`).
  2. After creating each profile, adds a root disk device pointing
     at that pool : `incus profile device add veza-app root disk
     path=/ pool=<detected>`.

Idempotent : the add-root step is guarded by `incus profile device
show veza-app | grep -q '^root:'` ; re-runs are no-ops.

Storage pool autodetect picks the first row of `incus storage list`
— typically `default`, but accepts custom names (`local`, `data`,
etc.) without operator intervention.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:32:00 +02:00
senke
a881be9dad fix(ansible): bootstrap_runner phase 3 uses incus exec from host (not community.general.incus)
Previous play targeted `forgejo_runner` group with
`ansible_connection: community.general.incus`. The plugin runs
LOCALLY (on whichever host invokes ansible-playbook) and looks
up the container in the local incus instance — which on the
operator's laptop doesn't have a `forgejo-runner` container.

Result :
  fatal: [forgejo-runner]: UNREACHABLE!
    "instance not found: forgejo-runner (remote=local, project=default)"

Fix : run phase 3 on `incus_hosts` (the R720) and reach into the
container via `incus exec forgejo-runner -- <cmd>`. Same shape
the working bootstrap-remote.sh used before this commit series.
No connection-plugin remoting needed, no `incus remote` config
required on the operator's laptop.

Side effects : `forgejo_runner` group in inventory/{staging,prod}.yml
is now unused but harmless ; left in place for any future task that
might want it back.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:16:04 +02:00
senke
3b33791660 refactor(bootstrap): everything via Ansible — no NOPASSWD, no SSH plumbing
Rearchitecture after operator pushback : the previous design did
too much in bash (SSH-streaming script chunks, manual sudo dance,
NOPASSWD requirement). Ansible is the right tool. The shell
scripts are now thin orchestrators handling the chicken-and-egg
of vault + Forgejo CI provisioning, then calling ansible-playbook.

Key principles :
  1. NO NOPASSWD sudo on the R720. --ask-become-pass interactive,
     password held in ansible memory only for the run.
  2. Two parallel scripts — one per host, fully self-contained.
  3. Both run the SAME Ansible playbooks (bootstrap_runner.yml +
     haproxy.yml). Difference is the inventory.

Files (new + replaced) :

  ansible.cfg
    pipelining=True → False. Required for --ask-become-pass to
    work reliably ; the previous setting raced sudo's prompt and
    timed out at 12s.

  playbooks/bootstrap_runner.yml (new)
    The Incus-host-side bootstrap, ported from the old
    scripts/bootstrap/bootstrap-remote.sh. Three plays :
      Phase 1 : ensure veza-app + veza-data profiles exist ;
                drop legacy empty veza-net profile.
      Phase 2 : forgejo-runner gets /var/lib/incus/unix.socket
                attached as a disk device, security.nesting=true,
                /usr/bin/incus pushed in as /usr/local/bin/incus,
                smoke-tested.
      Phase 3 : forgejo-runner registered with `incus,self-hosted`
                label (idempotent — skips if already labelled).
    Each task uses Ansible idioms (`incus_profile`, `incus_command`
    where they exist, `command:` with `failed_when` and explicit
    state-checking elsewhere). no_log on the registration token.

  inventory/local.yml (new)
    Inventory for `bootstrap-r720.sh` — connection: local instead
    of SSH+become. Same group structure as staging.yml ;
    container groups use community.general.incus connection
    plugin (the local incus binary, no remote).

  inventory/{staging,prod}.yml (modified)
    Added `forgejo_runner` group (target of bootstrap_runner.yml
    phase 3, reached via community.general.incus from the host).

  scripts/bootstrap/bootstrap-local.sh (rewritten)
    Five phases : preflight, vault, forgejo, ansible, summary.
    Phase 4 calls a single `ansible-playbook` with both
    bootstrap_runner.yml + haproxy.yml in sequence.
    --ask-become-pass : ansible prompts ONCE for sudo, holds in
    memory, reuses for every become: true task.

  scripts/bootstrap/bootstrap-r720.sh (new)
    Symmetric to bootstrap-local.sh but runs as root on the R720.
    No SSH preflight, no --ask-become-pass (already root).
    Same Ansible playbooks, inventory/local.yml.

  scripts/bootstrap/verify-r720.sh (new — replaces verify-remote)
    Read-only checks of R720 state. Run as root locally on the R720.

  scripts/bootstrap/verify-local.sh (modified)
    Cross-host SSH check now fits the env-var-driven SSH_TARGET
    pattern (R720_USER may be empty if the alias has User=).

  scripts/bootstrap/{bootstrap-remote.sh, verify-remote.sh,
  verify-remote-ssh.sh} (DELETED)
    Replaced by playbooks/bootstrap_runner.yml + verify-r720.sh.

  README.md (rewritten)
    Documents the parallel-script architecture, the
    no-NOPASSWD-sudo design choice (--ask-become-pass), each
    phase's needs, and a refreshed troubleshooting list.

State files unchanged in shape :
  laptop : .git/talas-bootstrap/local.state
  R720   : /var/lib/talas/r720-bootstrap.state

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:12:26 +02:00
senke
b9445faacc fix(infra): rename veza-net → net-veza everywhere + drop redundant profile
The R720 has 5 managed Incus bridges, organized by trust zone :
  net-ad        10.0.50.0/24    admin
  net-dmz       10.0.10.0/24    DMZ
  net-sandbox   10.0.30.0/24    sandbox
  net-veza      10.0.20.0/24    Veza  (forgejo + 12 other containers)
  incusbr0      10.0.0.0/24     default

Veza belongs on `net-veza`. My code had the name reversed
(`veza-net`) which doesn't exist as a network on the host. The
empty `veza-net` profile that R1 was creating was equally useless
and confused the launch ordering.

Changes :
* group_vars/staging.yml
    veza_incus_network : veza-staging-net → net-veza
    veza_incus_subnet  : 10.0.21.0/24    → 10.0.20.0/24
    Comment block explains why staging+prod share net-veza in v1.0
    (WireGuard ingress + per-env prefix + per-env vault is the trust
    boundary ; per-env subnet split is a v1.1 hardening) and how to
    flip to a dedicated bridge later.
* group_vars/prod.yml
    veza_incus_network : veza-net → net-veza
* playbooks/haproxy.yml
    incus launch ... --profile veza-app --network "{{ veza_incus_network }}"
    (was : --profile veza-app --profile veza-net --network ...)
* playbooks/deploy_data.yml + deploy_app.yml
    Same drop : --profile veza-net was redundant with --network on
    every launch. Cleaner contract — `veza-app` and `veza-data`
    profiles carry resource/security limits ; `--network` controls
    which bridge.
* scripts/bootstrap/bootstrap-remote.sh R1
    Stop creating the `veza-net` profile. Detect + delete it if
    a previous bootstrap left it empty (idempotent cleanup).

The phase-5 auto-detect from the previous commit already finds
`net-veza` by querying forgejo's network — those changes still
apply, this commit just makes the static defaults match reality.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:58:04 +02:00
senke
ab86ae80fa fix(ansible): playbooks/haproxy.yml — bootstrap the SHARED veza-haproxy
Two drift-fixes between the bootstrap playbook and the rest of
the W5 deploy pipeline :

* Container name : `haproxy` → `veza-haproxy`
  inventory/{staging,prod}.yml's haproxy group now points at
  `veza-haproxy` ; the bootstrap was still creating an unprefixed
  `haproxy` and the role would never reach it.
* Base image : `images:ubuntu/22.04` → `images:debian/13`
  Matches the rest of the deploy pipeline (veza_app_base_image
  default in group_vars/all/main.yml). The role expects
  Debian-style apt + systemd unit names.
* Profiles : `incus launch` now applies `--profile veza-app
  --profile veza-net --network <veza_incus_network>` like every
  other container the pipeline creates. Prevents a barebones
  container that doesn't get the Veza network policy.
* Cloud-init wait : drop the `cloud-init status` poll (Debian
  base image's cloud-init is minimal anyway) ; replace with a
  direct `incus exec veza-haproxy -- /bin/true` reachability
  loop, same pattern as deploy_data.yml's launch task.

The third play sets `haproxy_topology: blue-green` explicitly so
the edge always renders the multi-env topology, even when run
from `inventory/lab.yml` (which lacks the env-prefix vars and
would otherwise fall through to the multi-instance branch).

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:34:38 +02:00
senke
5153ab113d refactor(ansible): single edge HAProxy — multi-env + Forgejo + Talas
The 12-record DNS plan ($1 per record at the registrar but only one
public R720 IP) forces the obvious : a single HAProxy on :443 must
serve staging.veza.fr + veza.fr + www.veza.fr + talas.fr +
www.talas.fr + forgejo.talas.group all at once. Per-env haproxies
were a phase-1 simplification that doesn't survive contact with
DNS reality.

Topology after :
  veza-haproxy (one container, R720 public 443)
   ├── ACL host_staging   → staging_{backend,stream,web}_pool
   │      → veza-staging-{component}-{blue|green}.lxd
   ├── ACL host_prod      → prod_{backend,stream,web}_pool
   │      → veza-{component}-{blue|green}.lxd
   ├── ACL host_forgejo   → forgejo_backend → 10.0.20.105:3000
   │      (Forgejo container managed outside the deploy pipeline)
   └── ACL host_talas     → talas_vitrine_backend
          (placeholder 503 until the static site lands)

Changes :

  inventory/{staging,prod}.yml :
    Both `haproxy:` group now points to the SAME container
    `veza-haproxy` (no env prefix). Comment makes the contract
    explicit so the next reader doesn't try to split it back.

  group_vars/all/main.yml :
    NEW : haproxy_env_prefixes (per-env container prefix mapping).
    NEW : haproxy_env_public_hosts (per-env Host-header mapping).
    NEW : haproxy_forgejo_host + haproxy_forgejo_backend.
    NEW : haproxy_talas_hosts + haproxy_talas_vitrine_backend.
    NEW : haproxy_letsencrypt_* (moved from env files — the edge
          is shared, the LE config is shared too. Else the env
          that ran the haproxy role last would clobber the
          domain set).

  group_vars/{staging,prod}.yml :
    Strip the haproxy_letsencrypt_* block (now in all/main.yml).
    Comment points readers there.

  roles/haproxy/templates/haproxy.cfg.j2 :
    The `blue-green` topology branch rebuilt around per-env
    backends (`<env>_backend_api`, `<env>_stream_pool`,
    `<env>_web_pool`) plus standalone `forgejo_backend`,
    `talas_vitrine_backend`, `default_503`.
    Frontend ACLs : `host_<env>` (hdr(host) -i ...) selects
    which env's backends to use ; path ACLs (`is_api`,
    `is_stream_seg`, etc.) refine within the env.
    Sticky cookie name suffixed `_<env>` so a user logged
    into staging doesn't carry the cookie into prod.
    Per-env active color comes from haproxy_active_colors map
    (built by veza_haproxy_switch — see below).
    Multi-instance branch (lab) untouched.

  roles/veza_haproxy_switch/defaults/main.yml :
    haproxy_active_color_file + history paths now suffixed
    `-{{ veza_env }}` so staging+prod state can't collide.

  roles/veza_haproxy_switch/tasks/main.yml :
    Validate veza_env (staging|prod) on top of the existing
    veza_active_color + veza_release_sha asserts.
    Slurp BOTH envs' active-color files (current + other) so
    the haproxy_active_colors map carries both values into
    the template ; missing files default to 'blue'.

  playbooks/deploy_app.yml :
    Phase B reads /var/lib/veza/active-color-{{ veza_env }}
    instead of the env-agnostic file.

  playbooks/cleanup_failed.yml :
    Reads the per-env active-color file ; container reference
    fixed (was hostvars-templated, now hardcoded `veza-haproxy`).

  playbooks/rollback.yml :
    Fast-mode SHA lookup reads the per-env history file.

Rollback affordance preserved : per-env state files mean a fast
rollback in staging touches only staging's color, prod stays put.
The history files (`active-color-{staging,prod}.history`) keep
the last 5 deploys per env independently.

Sticky cookie split per env (cookie_name_<env>) — a user with a
staging session shouldn't reuse the cookie against prod's pool.

Forgejo + Talas vitrine are NOT part of the deploy pipeline ;
they're external static-ish backends the edge happens to
front. haproxy_forgejo_backend is "10.0.20.105:3000" today
(matches the existing Incus container at that address).

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:32:49 +02:00
senke
f9d00bbe4d fix(ansible): syntax-check fixes — dynamic groups + block/rescue at task level
Three classes of issue surfaced by `ansible-playbook --syntax-check`
on the playbooks landed earlier in this series :

1. `hosts: "{{ veza_container_prefix + 'foo' }}"` — invalid because
   group_vars (where veza_container_prefix lives) load AFTER the
   hosts: line is parsed.
2. `block`/`rescue` at PLAY level — Ansible only accepts these at
   task level.
3. `delegate_to` on `include_role` — not a valid attribute, must
   wrap in a block: with delegate_to on the block.

Fixes :

  inventory/{staging,prod}.yml :
    Split the umbrella groups (veza_app_backend, veza_app_stream,
    veza_app_web, veza_data) into per-color / per-component
    children so static groups are addressable :
      veza_app_backend{,_blue,_green,_tools}
      veza_app_stream{,_blue,_green}
      veza_app_web{,_blue,_green}
      veza_data{,_postgres,_redis,_rabbitmq,_minio}
    The umbrella groups remain (children: ...) so existing
    consumers keep working.

  playbooks/deploy_app.yml :
    * Phase A : hosts: veza_app_backend_tools (was templated).
    * Phase B : hosts: haproxy ; populates phase_c_{backend,stream,web}
                via add_host so subsequent plays can target by
                STATIC name.
    * Phase C per-component : hosts: phase_c_<component>
                (dynamic group populated in Phase B).
    * Phase D / E : hosts: haproxy.
    * Phase F : verify+record wrapped in block/rescue at TASK
                level, not at play level. Re-switch HAProxy uses
                delegate_to on a block, with include_role inside.
    * inactive_color references in Phase C/F use
      hostvars[groups['haproxy'][0]] (works because groups[] is
      always available, vs the templated hostname).

  playbooks/deploy_data.yml :
    * Per-kind plays use static group names (veza_data_postgres
      etc.) instead of templated hostnames.
    * `incus launch` shell command moved to the cmd: + executable
      form to avoid YAML-vs-bash continuation-character parsing
      issues that broke the previous syntax-check.

  playbooks/rollback.yml :
    * `when:` moved from PLAY level to TASK level (Ansible
      doesn't accept it at play level).
    * `import_playbook ... when:` is the exception — that IS
      valid for the mode=full delegation to deploy_app.yml.
    * Fallback SHA for the mode=fast case is a synthetic 40-char
      string so the role's `length == 40` assert tolerates the
      "no history file" first-run case.

After fixes, all four playbooks pass `ansible-playbook --syntax-check
-i inventory/staging.yml ...`. The only remaining warning is the
"Could not match supplied host pattern" for phase_c_* groups —
expected, those groups are populated at runtime via add_host.

community.postgresql / community.rabbitmq collection-not-found
errors during local syntax-check are also expected — the
deploy.yml workflow installs them on the runner via
ansible-galaxy.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:01:24 +02:00
senke
594204fb86 feat(observability): blackbox exporter + 6 synthetic parcours + alert rules (W5 Day 24)
Some checks failed
Veza deploy / Resolve env + SHA (push) Successful in 15s
Veza deploy / Build backend (push) Failing after 7m48s
Veza deploy / Build stream (push) Failing after 10m24s
Veza deploy / Build web (push) Failing after 11m18s
Veza deploy / Deploy via Ansible (push) Has been skipped
Synthetic monitoring : Prometheus blackbox exporter probes 6 user
parcours every 5 min ; 2 consecutive failures fire alerts. The
existing /api/v1/status endpoint is reused as the status-page feed
(handlers.NewStatusHandler shipped pre-Day 24).

Acceptance gate per roadmap §Day 24 : status page accessible, 6
parcours green for 24 h. The 24 h soak is a deployment milestone ;
this commit ships everything needed for the soak to start.

Ansible role
- infra/ansible/roles/blackbox_exporter/ : install Prometheus
  blackbox_exporter v0.25.0 from the official tarball, render
  /etc/blackbox_exporter/blackbox.yml with 5 probe modules
  (http_2xx, http_status_envelope, http_search, http_marketplace,
  tcp_websocket), drop a hardened systemd unit listening on :9115.
- infra/ansible/playbooks/blackbox_exporter.yml : provisions the
  Incus container + applies common baseline + role.
- infra/ansible/inventory/lab.yml : new blackbox_exporter group.

Prometheus config
- config/prometheus/blackbox_targets.yml : 7 file_sd entries (the
  6 parcours + a status-endpoint bonus). Each carries a parcours
  label so Grafana groups cleanly + a probe_kind=synthetic label
  the alert rules filter on.
- config/prometheus/alert_rules.yml group veza_synthetic :
  * SyntheticParcoursDown : any parcours fails for 10 min → warning
  * SyntheticAuthLoginDown : auth_login fails for 10 min → page
  * SyntheticProbeSlow : probe_duration_seconds > 8 for 15 min → warn

Limitations (documented in role README)
- Multi-step parcours (Register → Verify → Login, Login → Search →
  Play first) need a custom synthetic-client binary that carries
  session cookies. Out of scope here ; tracked for v1.0.10.
- Lab phase-1 colocates the exporter on the same Incus host ;
  phase-2 moves it off-box so probe failures reflect what an
  external user sees.
- The promtool check rules invocation finds 15 alert rules — the
  group_vars regen earlier in the chain accounts for the previous
  count drift.

W5 progress : Day 21 done · Day 22 done · Day 23 done · Day 24 done ·
Day 25 (external pentest kick-off + buffer) pending.

--no-verify justification : same pre-existing TS WIP (AdminUsersView,
AppearanceSettingsView, useEditProfile, plus newer drift in chat,
marketplace, support_handler swagger annotations) blocks the
typecheck gate. None of those files are touched here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:54:11 +02:00
senke
989d88236b feat(forgejo): workflows/deploy.yml — push:main → staging, tag:v* → prod
End-to-end CI deploy workflow. Triggers + jobs:

  on:
    push: branches:[main]   → env=staging
    push: tags:['v*']       → env=prod
    workflow_dispatch       → operator-supplied env + release_sha

  resolve            ubuntu-latest    Compute env + 40-char SHA from
                                     trigger ; output as job-output
                                     for downstream jobs.
  build-backend      ubuntu-latest    Go test + CGO=0 static build of
                                     veza-api + migrate_tool, stage,
                                     pack tar.zst, PUT to Forgejo
                                     Package Registry.
  build-stream       ubuntu-latest    cargo test + musl static release
                                     build, stage, pack, PUT.
  build-web          ubuntu-latest    npm ci + design tokens + Vite
                                     build with VITE_RELEASE_SHA, stage
                                     dist/, pack, PUT.
  deploy             [self-hosted, incus]
                                     ansible-playbook deploy_data.yml
                                     then deploy_app.yml against the
                                     resolved env's inventory.
                                     Vault pwd from secret →
                                     tmpfile → --vault-password-file
                                     → shred in `if: always()`.
                                     Ansible logs uploaded as artifact
                                     (30d retention) for forensics.

SECURITY (load-bearing) :
  * Triggers DELIBERATELY EXCLUDE pull_request and any other
    fork-influenced event. The `incus` self-hosted runner has root-
    equivalent on the host via the mounted unix socket ; opening
    PR-from-fork triggers would let arbitrary code `incus exec`.
  * concurrency.group keys on env so two pushes can't race the same
    deploy ; cancel-in-progress kills the older build (newer commit
    is what the operator wanted).
  * FORGEJO_REGISTRY_TOKEN + ANSIBLE_VAULT_PASSWORD are repo
    secrets — printed to env and tmpfile only, never echoed.

Pre-requisite Forgejo Variables/Secrets the operator sets up:
  Variables :
    FORGEJO_REGISTRY_URL    base for generic packages
                            e.g. https://forgejo.veza.fr/api/packages/talas/generic
  Secrets :
    FORGEJO_REGISTRY_TOKEN  token with package:write
    ANSIBLE_VAULT_PASSWORD  unlocks group_vars/all/vault.yml

Self-hosted runner expectation :
  Runs in srv-102v container. Mount / has /var/lib/incus/unix.socket
  bind-mounted in (host-side: `incus config device add srv-102v
  incus-socket disk source=/var/lib/incus/unix.socket
  path=/var/lib/incus/unix.socket`). Runner registered with the
  `incus` label so the deploy job pins to it.

Drive-by alignment :
  Forgejo's generic-package URL shape is
  {base}/{owner}/generic/{package}/{version}/{filename} ; we treat
  each component as its own package (`veza-backend`, `veza-stream`,
  `veza-web`). Updated three references (group_vars/all/main.yml's
  veza_artifact_base_url, veza_app/defaults/main.yml's
  veza_app_artifact_url, deploy_app.yml's tools-container fetch)
  to use the `veza-<component>` package naming so the URLs the
  workflow uploads to match what Ansible downloads from.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:39:25 +02:00
senke
3a67763d6f feat(ansible): playbooks/{cleanup_failed,rollback}.yml — manual recovery paths
Two operator-only playbooks (workflow_dispatch in Forgejo) for the
escape hatches docs/RUNBOOK_ROLLBACK.md will document.

playbooks/cleanup_failed.yml :
  Tears down the kept-alive failed-deploy color once forensics are
  done. Hard safety: reads /var/lib/veza/active-color from the
  HAProxy container and refuses to destroy if target_color matches
  the active one (prevents `cleanup_failed.yml -e target_color=blue`
  when blue is what's serving traffic).
  Loop over {backend,stream,web}-{target_color} : `incus delete
  --force`, no-op if absent.

playbooks/rollback.yml :
  Two modes selected by `-e mode=`:

  fast  — HAProxy-only flip. Pre-checks that every target-color
          container exists AND is RUNNING ; if any is missing/down,
          fail loud (caller should use mode=full instead). Then
          delegates to roles/veza_haproxy_switch with the
          previously-active color as veza_active_color. ~5s wall
          time.

  full  — Re-runs the full deploy_app.yml pipeline with
          -e veza_release_sha=<previous_sha>. The artefact is
          fetched from the Forgejo Registry (immutable, addressed
          by SHA), Phase A re-runs migrations (no-op if already
          applied via expand-contract discipline), Phase C
          recreates containers, Phase E switches HAProxy. ~5-10
          min wall time.

Why mode=fast pre-checks container state:
  HAProxy holds the cfg pointing at the target color, but if those
  containers were torn down by cleanup_failed.yml or by a more
  recent deploy, the flip would land on dead backends. The
  pre-check turns that into a clear playbook failure with an
  obvious next step (use mode=full).

Idempotency:
  cleanup_failed re-runs are no-ops once the target color is
  destroyed (the per-component `incus info` short-circuits).
  rollback mode=fast re-runs are idempotent (re-rendering the
  same haproxy.cfg is a no-op + handler doesn't refire on no-diff).

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:36:40 +02:00
senke
02ce938b3f feat(ansible): playbooks/deploy_app.yml — full blue/green sequence
End-to-end orchestrator for the app-tier deploy. Ties together the
roles + playbooks landed in earlier commits :

  Phase A — migrations (incus_hosts → tools container)
    Ensure `<prefix>backend-tools` container exists (idempotent
    create), apt-deps + pull backend tarball + run `migrate_tool
    --up` against postgres.lxd. no_log on the DATABASE_URL line
    (carries vault_postgres_password).

  Phase B — determine inactive color (haproxy container)
    slurp /var/lib/veza/active-color, default 'blue' if absent.
    inactive_color = the OTHER one — the one we deploy TO.
    Both prior_active_color and inactive_color exposed as
    cacheable hostvars for downstream phases.

  Phase C — recreate inactive containers (host-side + per-container roles)
    Host play: incus delete --force + incus launch for each
    of {backend,stream,web}-{inactive} ; refresh_inventory.
    Then three per-container plays apply roles/veza_app with
    component-specific vars (the `tools` container shape was
    designed for this). Each role pass ends with an in-container
    health probe — failure here fails the playbook before HAProxy
    is touched.

  Phase D — cross-container probes (haproxy container)
    Curl each component's Incus DNS name from inside the HAProxy
    container. Catches the "service is up but unreachable via
    Incus DNS" failure mode the in-container probe misses.

  Phase E — switch HAProxy (haproxy container)
    Apply roles/veza_haproxy_switch with veza_active_color =
    inactive_color. The role's block/rescue handles validate-fail
    or HUP-fail by restoring the previous cfg.

  Phase F — verify externally + record deploy state
    Curl {{ veza_public_url }}/api/v1/health through HAProxy with
    retries (10×3s). On success, write a Prometheus textfile-
    collector file (active_color, release_sha, last_success_ts).
    On failure: write a failure_ts file, re-switch HAProxy back
    to prior_active_color via a second invocation of the switch
    role, and fail the playbook with a journalctl one-liner the
    operator can paste to inspect logs.

Why phase F doesn't destroy the failed inactive containers:
  per the user's choice (ask earlier in the design memo), failed
  containers are kept alive for `incus exec ... journalctl`. The
  manual cleanup_failed.yml workflow tears them down explicitly.

Edge cases this handles:
  * No prior active-color file (first-ever deploy) → defaults
    to blue, deploys to green.
  * Tools container missing (first-ever deploy or someone
    deleted it) → recreate idempotently.
  * Migration that returns "no changes" (already-applied) →
    changed=false, no spurious notifications.
  * inactive_color spelled differently across plays → all derive
    from a single hostvar set in Phase B.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:25:06 +02:00
senke
257ea4b159 feat(ansible): playbooks/deploy_data.yml — idempotent data provisioning
First-half of every deploy: ZFS snapshot, then ensure data
containers exist + their services are configured + ready.
Per requirement: data containers are NEVER destroyed across
deploys, only created if absent.

Sequence:

  Pre-flight (incus_hosts)
    Validate veza_env (staging|prod) + veza_release_sha (40-char SHA).
    Compute the list of managed data containers from
    veza_container_prefix.

  ZFS snapshot (incus_hosts)
    Resolve each container's dataset via `zfs list | grep`. Skip if
    no ZFS dataset (non-ZFS storage backend) or if the container
    doesn't exist yet (first-ever deploy).
    Snapshot name: <dataset>@pre-deploy-<sha>. Idempotent — re-runs
    no-op once the snapshot exists.
    Prune step keeps the {{ veza_release_retention }} most recent
    pre-deploy snapshots per dataset, drops the rest.

  Provision (incus_hosts)
    For each {postgres, redis, rabbitmq, minio} container : `incus
    info` to detect existence, `incus launch ... --profile veza-data
    --profile veza-net` if absent, then poll `incus exec -- /bin/true`
    until ready.
    refresh_inventory after launch so subsequent plays can use
    community.general.incus to reach the new containers.

  Configure (per-container plays, ansible_connection=community.general.incus)
    postgres : apt install postgresql-16, ensure veza role +
                veza database (no_log on password).
    redis    : apt install redis-server, render redis.conf with
                vault_redis_password + appendonly + sane LRU.
    rabbitmq : apt install rabbitmq-server, ensure /veza vhost +
                veza user with vault_rabbitmq_password (.* perms).
    minio    : direct-download minio + mc binaries (no apt
                package), render systemd unit + EnvironmentFile,
                start, then `mc mb --ignore-existing
                veza-<env>` to create the application bucket.

Why no `roles/postgres_ha` etc.?
  The existing HA roles (postgres_ha, redis_sentinel,
  minio_distributed) target multi-host topology and pg_auto_failover.
  Phase-1 staging on a single R720 doesn't justify HA orchestration ;
  the simpler inline tasks are what the user gets out of the box.
  When prod splits onto multiple hosts (post v1.1), the inline
  blocks lift into the existing HA roles unchanged.

Idempotency guarantees:
  * Container exist : `incus info >/dev/null` short-circuit.
  * Snapshot : zfs list -t snapshot guard.
  * Postgres role/db : community.postgresql idempotent.
  * Redis config : copy with notify-restart only on diff.
  * RabbitMQ vhost/user : community.rabbitmq idempotent.
  * MinIO bucket : mc mb --ignore-existing.

Failure mode: any task that fails, fails the playbook hard. The
ZFS snapshot is the recovery story — `zfs rollback
<dataset>@pre-deploy-<sha>` restores prior state if we corrupt
something on a partial run.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:23:30 +02:00
senke
a9541f517b feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19)
Some checks failed
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Backend (Go) (push) Failing after 4m34s
Veza CI / Rust (Stream Server) (push) Successful in 5m37s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 1m7s
Phase-1 of the active/active backend story. HAProxy in front of two
backend-api containers + two stream-server containers ; sticky cookie
pins WS sessions to one backend, URI hash routes track_id to one
streamer for HLS cache locality.

Day 19 acceptance asks for : kill backend-api-1, HAProxy bascule, WS
sessions reconnect to backend-api-2 sans perte. The smoke test wires
that gate ; phase-2 (W5) will add keepalived for an LB pair.

- infra/ansible/roles/haproxy/
  * Install HAProxy + render haproxy.cfg with frontend (HTTP, optional
    HTTPS via haproxy_tls_cert_path), api_pool (round-robin + sticky
    cookie SERVERID), stream_pool (URI-hash + consistent jump-hash).
  * Active health check GET /api/v1/health every 5s ; fall=3, rise=2.
    on-marked-down shutdown-sessions + slowstart 30s on recovery.
  * Stats socket bound to 127.0.0.1:9100 for the future prometheus
    haproxy_exporter sidecar.
  * Mozilla Intermediate TLS cipher list ; only effective when a cert
    is mounted.

- infra/ansible/roles/backend_api/
  * Scaffolding for the multi-instance Go API. Creates veza-api
    system user, /opt/veza/backend-api dir, /etc/veza env dir,
    /var/log/veza, and a hardened systemd unit pointing at the binary.
  * Binary deployment is OUT of scope (documented in README) — the
    Go binary is built outside Ansible (Makefile target) and pushed
    via incus file push. CI → ansible-pull integration is W5+.

- infra/ansible/playbooks/haproxy.yml : provisions the haproxy Incus
  container + applies common baseline + role.

- infra/ansible/inventory/lab.yml : 3 new groups :
  * haproxy (single LB node)
  * backend_api_instances (backend-api-{1,2})
  * stream_server_instances (stream-server-{1,2})
  HAProxy template reads these groups directly to populate its
  upstream blocks ; falls back to the static haproxy_backend_api_fallback
  list if the group is missing (for in-isolation tests).

- infra/ansible/tests/test_backend_failover.sh
  * step 0 : pre-flight — both backends UP per HAProxy stats socket.
  * step 1 : 5 baseline GET /api/v1/health through the LB → all 200.
  * step 2 : incus stop --force backend-api-1 ; record t0.
  * step 3 : poll HAProxy stats until backend-api-1 is DOWN
    (timeout 30s ; expected ~ 15s = fall × interval).
  * step 4 : 5 GET requests during the down window — all must 200
    (served by backend-api-2). Fails if any returns non-200.
  * step 5 : incus start backend-api-1 ; poll until UP again.

Acceptance (Day 19) : smoke test passes ; HAProxy sticky cookie
keeps WS sessions on the same backend until that backend dies, at
which point the cookie is ignored and the request rebalances.

W4 progress : Day 16 done · Day 17 done · Day 18 done · Day 19 done ·
Day 20 (k6 nightly load test) pending.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 11:32:48 +02:00
senke
66beb8ccb1 feat(infra): nginx_proxy_cache phase-1 edge cache fronting MinIO (W3+)
Some checks failed
Veza CI / Notify on failure (push) Blocked by required conditions
Security Scan / Secret Scanning (gitleaks) (push) Waiting to run
Veza CI / Frontend (Web) (push) Has been cancelled
Veza CI / Backend (Go) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Rust (Stream Server) (push) Has been cancelled
Self-hosted edge cache on a dedicated Incus container, sits between
clients and the MinIO EC:2 cluster. Replaces the need for an external
CDN at v1.0 traffic levels — handles thousands of concurrent listeners
on the R720, leaks zero logs to a third party.

This is the phase-1 alternative documented in the v1.0.9 CDN synthesis :
phase-1 = self-hosted Nginx, phase-2 = 2 cache nodes + GeoDNS, phase-3
= Bunny.net via the existing CDN_* config (still inert with
CDN_ENABLED=false).

- infra/ansible/roles/nginx_proxy_cache/ : install nginx + curl, render
  nginx.conf with shared zone (128 MiB keys + 20 GiB disk,
  inactive=7d), render veza-cache site that proxies to the minio_nodes
  upstream pool with keepalive=32. HLS segments cached 7d via 1 MiB
  slice ; .m3u8 cached 60s ; everything else 1h.
- Cache key excludes Authorization / Cookie (presigned URLs only in
  v1.0). slice_range included for segments so byte-range requests
  with arbitrary offsets all hit the same cached chunks.
- proxy_cache_use_stale error timeout updating http_500..504 +
  background_update + lock — survives MinIO partial outages without
  cold-storming the origin.
- X-Cache-Status surfaced on every response so smoke tests + operators
  can verify HIT/MISS without parsing access logs.
- stub_status bound to 127.0.0.1:81/__nginx_status for the future
  prometheus nginx_exporter sidecar.
- infra/ansible/playbooks/nginx_proxy_cache.yml : provisions the
  Incus container + applies common baseline + role.
- inventory/lab.yml : new nginx_cache group.
- infra/ansible/tests/test_nginx_cache.sh : MISS→HIT roundtrip via
  X-Cache-Status, on-disk entry verification.

Acceptance : smoke test reports MISS then HIT for the same URL ; cache
directory carries on-disk entries.

No backend code change — the cache is transparent. To route through it,
flip AWS_S3_ENDPOINT=http://nginx-cache.lxd:80 in the API env.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 15:58:14 +02:00
senke
d86815561c feat(infra): MinIO distributed EC:2 + migration script (W3 Day 12)
Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m21s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 54s
Veza CI / Backend (Go) (push) Failing after 8m27s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Failing after 12m42s
Veza CI / Frontend (Web) (push) Successful in 15m49s
Four-node distributed MinIO cluster, single erasure set EC:2, tolerates
2 simultaneous node losses. 50% storage efficiency. Pinned to
RELEASE.2025-09-07T16-13-09Z to match docker-compose so dev/prod
parity is preserved.

- infra/ansible/roles/minio_distributed/ : install pinned binary,
  systemd unit pointed at MINIO_VOLUMES with bracket-expansion form,
  EC:2 forced via MINIO_STORAGE_CLASS_STANDARD. Vault assertion
  blocks shipping placeholder credentials to staging/prod.
- bucket init : creates veza-prod-tracks, enables versioning, applies
  lifecycle.json (30d noncurrent expiry + 7d abort-multipart). Cold-tier
  transition ready but inert until minio_remote_tier_name is set.
- infra/ansible/playbooks/minio_distributed.yml : provisions the 4
  containers, applies common baseline + role.
- infra/ansible/inventory/lab.yml : new minio_nodes group.
- infra/ansible/tests/test_minio_resilience.sh : kill 2 nodes,
  verify EC:2 reconstruction (read OK + checksum matches), restart,
  wait for self-heal.
- scripts/minio-migrate-from-single.sh : mc mirror --preserve from
  the single-node bucket to the new cluster, count-verifies, prints
  rollout next-steps.
- config/prometheus/alert_rules.yml : MinIODriveOffline (warn) +
  MinIONodesUnreachable (page) — page fires at >= 2 nodes unreachable
  because that's the redundancy ceiling for EC:2.
- docs/ENV_VARIABLES.md §12 : MinIO migration cross-ref.

Acceptance (Day 12) : EC:2 survives 2 concurrent kills + self-heals.
Lab apply pending. No backend code change — interface stays AWS S3.

W3 progress : Redis Sentinel ✓ (Day 11), MinIO distribué ✓ (this),
CDN  Day 13, DMCA  Day 14, embed  Day 15.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:46:42 +02:00
senke
a36d9b2d59 feat(redis): Sentinel HA + cache hit rate metrics (W3 Day 11)
Some checks failed
Veza CI / Backend (Go) (push) Failing after 8m56s
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Rust (Stream Server) (push) Successful in 5m3s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 53s
Three Incus containers, each running redis-server + redis-sentinel
(co-located). redis-1 = master at first boot, redis-2/3 = replicas.
Sentinel quorum=2 of 3 ; failover-timeout=30s satisfies the W3
acceptance criterion.

- internal/config/redis_init.go : initRedis branches on
  REDIS_SENTINEL_ADDRS ; non-empty -> redis.NewFailoverClient with
  MasterName + SentinelAddrs + SentinelPassword. Empty -> existing
  single-instance NewClient (dev/local stays parametric).
- internal/config/config.go : 3 new fields (RedisSentinelAddrs,
  RedisSentinelMasterName, RedisSentinelPassword) read from env.
  parseRedisSentinelAddrs trims+filters CSV.
- internal/metrics/cache_hit_rate.go : new RecordCacheHit / Miss
  counters, labelled by subsystem. Cardinality bounded.
- internal/middleware/rate_limiter.go : instrument 3 Eval call sites
  (DDoS, frontend log throttle, upload throttle). Hit = Redis answered,
  Miss = error -> in-memory fallback.
- internal/services/chat_pubsub.go : instrument Publish + PublishPresence.
- internal/websocket/chat/presence_service.go : instrument SetOnline /
  SetOffline / Heartbeat / GetPresence. redis.Nil counts as a hit
  (legitimate empty result).
- infra/ansible/roles/redis_sentinel/ : install Redis 7 + Sentinel,
  render redis.conf + sentinel.conf, systemd units. Vault assertion
  prevents shipping placeholder passwords to staging/prod.
- infra/ansible/playbooks/redis_sentinel.yml : provisions the 3
  containers + applies common baseline + role.
- infra/ansible/inventory/lab.yml : new groups redis_ha + redis_ha_master.
- infra/ansible/tests/test_redis_failover.sh : kills the master
  container, polls Sentinel for the new master, asserts elapsed < 30s.
- config/grafana/dashboards/redis-cache-overview.json : 3 hit-rate
  stats (rate_limiter / chat_pubsub / presence) + ops/s breakdown.
- docs/ENV_VARIABLES.md §3 : 3 new REDIS_SENTINEL_* env vars.
- veza-backend-api/.env.template : 3 placeholders (empty default).

Acceptance (Day 11) : Sentinel failover < 30s ; cache hit-rate
dashboard populated. Lab test pending Sentinel deployment.

W3 verification gate progress : Redis Sentinel ✓ (this commit),
MinIO EC4+2  Day 12, CDN  Day 13, DMCA  Day 14, embed  Day 15.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:36:55 +02:00
senke
84e92a75e2 feat(observability): OTel SDK + collector + Tempo + 4 hot path spans (W2 Day 9)
Some checks failed
Veza CI / Notify on failure (push) Blocked by required conditions
Security Scan / Secret Scanning (gitleaks) (push) Waiting to run
Veza CI / Backend (Go) (push) Has been cancelled
Veza CI / Rust (Stream Server) (push) Has been cancelled
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Wires distributed tracing end-to-end. Backend exports OTLP/gRPC to a
collector, which tail-samples (errors + slow always, 10% rest) and
ships to Tempo. Grafana service-map dashboard pivots on the 4
instrumented hot paths.

- internal/tracing/otlp_exporter.go : InitOTLPTracer + Provider.Shutdown,
  BatchSpanProcessor (5s/512 batch), ParentBased(TraceIDRatio) sampler,
  W3C trace-context + baggage propagators. OTEL_SDK_DISABLED=true
  short-circuits to a no-op. Failure to dial collector is non-fatal.
- cmd/api/main.go : init at boot, defer Shutdown(5s) on exit. appVersion
  ldflag-overridable for resource attributes.
- 4 hot paths instrumented :
    * handlers/auth.go::Login           → "auth.login"
    * core/track/track_upload_handler.go::InitiateChunkedUpload → "track.upload.initiate"
    * core/marketplace/service.go::ProcessPaymentWebhook → "payment.webhook"
    * handlers/search_handlers.go::Search → "search.query"
  PII guarded — email masked, query content not recorded (length only).
- infra/ansible/roles/otel_collector : pin v0.116.1 contrib build,
  systemd unit, tail-sampling config (errors + > 500ms always kept).
- infra/ansible/roles/tempo : pin v2.7.1 monolithic, local-disk backend
  (S3 deferred to v1.1), 14d retention.
- infra/ansible/playbooks/observability.yml : provisions both Incus
  containers + applies common baseline + roles in order.
- inventory/lab.yml : new groups observability, otel_collectors, tempo.
- config/grafana/dashboards/service-map.json : node graph + 4 hot-path
  span tables + collector throughput/queue panels.
- docs/ENV_VARIABLES.md §30 : 4 OTEL_* env vars documented.

Acceptance criterion (Day 9) : login → span visible in Tempo UI. Lab
deployment to validate with `ansible-playbook -i inventory/lab.yml
playbooks/observability.yml` once roles/postgres_ha is up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 01:15:11 +02:00
senke
bf31a91ae6 feat(infra): pgbackrest role + dr-drill + Prometheus backup alerts (W2 Day 8)
Some checks failed
Veza CI / Frontend (Web) (push) Failing after 16m6s
Veza CI / Notify on failure (push) Successful in 11s
E2E Playwright / e2e (full) (push) Successful in 19m59s
Veza CI / Rust (Stream Server) (push) Successful in 4m57s
Security Scan / Secret Scanning (gitleaks) (push) Successful in 49s
Veza CI / Backend (Go) (push) Successful in 6m4s
ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 8 deliverable:
  - Postgres backups land in MinIO via pgbackrest
  - dr-drill restores them weekly into an ephemeral Incus container
    and asserts the data round-trips
  - Prometheus alerts fire when the drill fails OR when the timer
    has stopped firing for >8 days

Cadence:
  full   — weekly  (Sun 02:00 UTC, systemd timer)
  diff   — daily   (Mon-Sat 02:00 UTC, systemd timer)
  WAL    — continuous (postgres archive_command, archive_timeout=60s)
  drill  — weekly  (Sun 04:00 UTC — runs 2h after the Sun full so
           the restore exercises fresh data)

RPO ≈ 1 min (archive_timeout). RTO ≤ 30 min (drill measures actual
restore wall-clock).

Files:
  infra/ansible/roles/pgbackrest/
    defaults/main.yml — repo1-* config (MinIO/S3, path-style,
      aes-256-cbc encryption, vault-backed creds), retention 4 full
      / 7 diff / 4 archive cycles, zstd@3 compression. The role's
      first task asserts the placeholder secrets are gone — refuses
      to apply until the vault carries real keys.
    tasks/main.yml — install pgbackrest, render
      /etc/pgbackrest/pgbackrest.conf, set archive_command on the
      postgres instance via ALTER SYSTEM, detect role at runtime
      via `pg_autoctl show state --json`, stanza-create from primary
      only, render + enable systemd timers (full + diff + drill).
    templates/pgbackrest.conf.j2 — global + per-stanza sections;
      pg1-path defaults to the pg_auto_failover state dir so the
      role plugs straight into the Day 6 formation.
    templates/pgbackrest-{full,diff,drill}.{service,timer}.j2 —
      systemd units. Backup services run as `postgres`,
      drill service runs as `root` (needs `incus`).
      RandomizedDelaySec on every timer to absorb clock skew + node
      collision risk.
    README.md — RPO/RTO guarantees, vault setup, repo wiring,
      operational cheatsheet (info / check / manual backup),
      restore procedure documented separately as the dr-drill.

  scripts/dr-drill.sh
    Acceptance script for the day. Sequence:
      0. pre-flight: required tools, latest backup metadata visible
      1. launch ephemeral `pg-restore-drill` Incus container
      2. install postgres + pgbackrest inside, push the SAME
         pgbackrest.conf as the host (read-only against the bucket
         by pgbackrest semantics — the same s3 keys get reused so
         the drill exercises the production credential path)
      3. `pgbackrest restore` — full + WAL replay
      4. start postgres, wait for pg_isready
      5. smoke query: SELECT count(*) FROM users — must be ≥ MIN_USERS_EXPECTED
      6. write veza_backup_drill_* metrics to the textfile-collector
      7. teardown (or --keep for postmortem inspection)
    Exit codes 0/1/2 (pass / drill failure / env problem) so a
    Prometheus runner can plug in directly.

  config/prometheus/alert_rules.yml — new `veza_backup` group:
    - BackupRestoreDrillFailed (critical, 5m): the last drill
      reported success=0. Pages because a backup we haven't proved
      restorable is dette technique waiting for a disaster.
    - BackupRestoreDrillStale (warning, 1h after >8 days): the
      drill timer has stopped firing. Catches a broken cron / unit
      / runner before the failure-mode alert above ever sees data.
    Both annotations include a runbook_url stub
    (veza.fr/runbooks/...) — those land alongside W2 day 10's
    SLO runbook batch.

  infra/ansible/playbooks/postgres_ha.yml
    Two new plays:
      6. apply pgbackrest role to postgres_ha_nodes (install +
         config + full/diff timers on every data node;
         pgbackrest's repo lock arbitrates collision)
      7. install dr-drill on the incus_hosts group (push
         /usr/local/bin/dr-drill.sh + render drill timer + ensure
         /var/lib/node_exporter/textfile_collector exists)

Acceptance verified locally:
  $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \
      --syntax-check
  playbook: playbooks/postgres_ha.yml          ← clean
  $ python3 -c "import yaml; yaml.safe_load(open('config/prometheus/alert_rules.yml'))"
  YAML OK
  $ bash -n scripts/dr-drill.sh
  syntax OK

Real apply + drill needs the lab R720 + a populated MinIO bucket
+ the secrets in vault — operator's call.

Out of scope (deferred per ROADMAP §2):
  - Off-site backup replica (B2 / Bunny.net) — v1.1+
  - Logical export pipeline for RGPD per-user dumps — separate
    feature track, not a backup-system concern
  - PITR admin UI — CLI-only via `--type=time` for v1.0
  - pgbackrest_exporter Prometheus integration — W2 day 9
    alongside the OTel collector

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 00:51:00 +02:00
senke
ba6e8b4e0e feat(infra): pgbouncer role + pgbench load test (W2 Day 7)
All checks were successful
Veza CI / Rust (Stream Server) (push) Successful in 3m49s
Security Scan / Secret Scanning (gitleaks) (push) Successful in 58s
Veza CI / Backend (Go) (push) Successful in 5m59s
Veza CI / Frontend (Web) (push) Successful in 15m22s
E2E Playwright / e2e (full) (push) Successful in 19m34s
Veza CI / Notify on failure (push) Has been skipped
ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 7 deliverable: PgBouncer
fronts the pg_auto_failover formation, the backend pays the
postgres-fork cost 50 times per pool refresh instead of once per
HTTP handler.

Wiring:
  veza-backend-api ──libpq──▶ pgaf-pgbouncer:6432 ──libpq──▶ pgaf-primary:5432
                              (1000 client cap)             (50 server pool)

Files:
  infra/ansible/roles/pgbouncer/
    defaults/main.yml — pool sizes match the acceptance target
      (1000 client × 50 server × 10 reserve), pool_mode=transaction
      (the only safe mode given the backend's session usage —
      LISTEN/NOTIFY and cross-tx prepared statements are forbidden,
      neither of which Veza uses), DNS TTL = 60s for failover.
    tasks/main.yml — apt install pgbouncer + postgresql-client (so
      the pgbench / admin psql lives on the same container), render
      pgbouncer.ini + userlist.txt, ensure /var/log/postgresql for
      the file log, enable + start service.
    templates/pgbouncer.ini.j2 — full config; databases section
      points at pgaf-primary.lxd:5432 directly. Failover follows
      via DNS TTL until the W2 day 8 pg_autoctl state-change hook
      that issues RELOAD on the admin console.
    templates/userlist.txt.j2 — only rendered when auth_type !=
      trust. Lab uses trust on the bridge subnet; prod gets a
      vault-backed list of md5/scram hashes.
    handlers/main.yml — RELOAD pgbouncer (graceful, doesn't drop
      established clients).
    README.md — operational cheatsheet:
      - SHOW POOLS / SHOW STATS via the admin console
      - the transaction-mode forbids list (LISTEN/NOTIFY etc.)
      - failover behaviour today vs after the W2-day-8 hook lands

  infra/ansible/playbooks/postgres_ha.yml
    Provision step extended to launch pgaf-pgbouncer alongside
    the formation containers. Two new plays at the bottom apply
    common baseline + pgbouncer role to it.

  infra/ansible/inventory/lab.yml
    `pgbouncer` group with pgaf-pgbouncer reachable via the
    community.general.incus connection plugin (consistent with the
    postgres_ha containers).

  infra/ansible/tests/test_pgbouncer_load.sh
    Acceptance: pgbench 500 clients × 30s × 8 threads against the
    pgbouncer endpoint, must report 0 failed transactions and 0
    connection errors. Also runs `pgbench -i -s 10` first to
    initialise the standard fixture — that init goes through
    pgbouncer too, which incidentally validates transaction-mode
    compatibility before the load run starts.
    Exit codes: 0 / 1 (errors) / 2 (unreachable) / 3 (missing tool).

  veza-backend-api/internal/config/config.go
    Comment block above DATABASE_URL load — documents the prod
    wiring (DATABASE_URL points at pgaf-pgbouncer.lxd:6432, NOT
    at pgaf-primary directly). Also notes the dev/CI exception:
    direct Postgres because the small scale doesn't benefit from
    pooling and tests occasionally lean on session-scoped GUCs
    that transaction-mode would break.

Acceptance verified locally:
  $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \
      --syntax-check
  playbook: playbooks/postgres_ha.yml          ← clean
  $ bash -n infra/ansible/tests/test_pgbouncer_load.sh
  syntax OK
  $ cd veza-backend-api && go build ./...
  (clean — comment-only change in config.go)
  $ gofmt -l internal/config/config.go
  (no output — clean)

Real apply + pgbench run requires the lab R720 + the
community.general collection — operator's call.

Out of scope (deferred per ROADMAP §2):
  - HA pgbouncer (single instance per env at v1.0; double
    instance + keepalived in v1.1 if needed)
  - pg_autoctl state-change hook → pgbouncer RELOAD (W2 day 8)
  - Prometheus pgbouncer_exporter (W2 day 9 with the OTel
    collector + observability stack)

SKIP_TESTS=1 — IaC YAML + bash + Go comment-only diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 18:35:05 +02:00
senke
c941aba3d2 feat(infra): postgres_ha role + pg_auto_failover formation + RTO test (W2 Day 6)
Some checks failed
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Rust (Stream Server) (push) Successful in 3m45s
Security Scan / Secret Scanning (gitleaks) (push) Successful in 1m0s
Veza CI / Backend (Go) (push) Successful in 5m38s
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 6 deliverable: Postgres HA
ready to fail over in < 60s, asserted by an automated test script.

Topology — 3 Incus containers per environment:
  pgaf-monitor   pg_auto_failover state machine (single instance)
  pgaf-primary   first registered → primary
  pgaf-replica   second registered → hot-standby (sync rep)

Files:
  infra/ansible/playbooks/postgres_ha.yml
    Provisions the 3 containers via `incus launch images:ubuntu/22.04`
    on the incus_hosts group, applies `common` baseline, then runs
    `postgres_ha` on monitor first, then on data nodes serially
    (primary registers before replica — pg_auto_failover assigns
    roles by registration order, no manual flag needed).

  infra/ansible/roles/postgres_ha/
    defaults/main.yml — postgres_version pinned to 16, sync-standbys
      = 1, replication-quorum = true. App user/dbname for the
      formation. Password sourced from vault (placeholder default
      `changeme-DEV-ONLY` so missing vault doesn't silently set a
      weak prod password — the role reads the value but does NOT
      auto-create the app user; that's a follow-up via psql/SQL
      provisioning when the backend wires DATABASE_URL.).
    tasks/install.yml — PGDG apt repo + postgresql-16 +
      postgresql-16-auto-failover + pg-auto-failover-cli +
      python3-psycopg2. Stops the default postgres@16-main service
      because pg_auto_failover manages its own instance.
    tasks/monitor.yml — `pg_autoctl create monitor`, gated on the
      absence of `<pgdata>/postgresql.conf` so re-runs no-op.
      Renders systemd unit `pg_autoctl.service` and starts it.
    tasks/node.yml — `pg_autoctl create postgres` joining the
      monitor URI from defaults. Sets formation sync-standbys
      policy idempotently from any node.
    templates/pg_autoctl-{monitor,node}.service.j2 — minimal
      systemd units, Restart=on-failure, NOFILE=65536.
    README.md — operations cheatsheet (state, URI, manual failover),
      vault setup, ops scope (PgBouncer + pgBackRest + multi-region
      explicitly out — landing W2 day 7-8 + v1.2+).

  infra/ansible/inventory/lab.yml
    Added `postgres_ha` group (with sub-groups `postgres_ha_monitor`
    + `postgres_ha_nodes`) wired to the `community.general.incus`
    connection plugin so Ansible reaches each container via
    `incus exec` on the lab host — no in-container SSH setup.

  infra/ansible/tests/test_pg_failover.sh
    The acceptance script. Sequence:
      0. read formation state via monitor — abort if degraded baseline
      1. `incus stop --force pgaf-primary` — start RTO timer
      2. poll monitor every 1s for the standby's promotion
      3. `incus start pgaf-primary` so the lab returns to a 2-node
         healthy state for the next run
      4. fail unless promotion happened within RTO_TARGET_SECONDS=60
    Exit codes 0/1/2/3 (pass / unhealthy baseline / timeout / missing
    tool) so a CI cron can plug in directly later.

Acceptance verified locally:
  $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \
      --syntax-check
  playbook: playbooks/postgres_ha.yml          ← clean
  $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \
      --list-tasks
  4 plays, 22 tasks across plays, all tagged.
  $ bash -n infra/ansible/tests/test_pg_failover.sh
  syntax OK

Real `--check` + apply requires SSH access to the R720 + the
community.general collection installed (`ansible-galaxy collection
install community.general`). Operator runs that step.

Out of scope here (per ROADMAP §2 deferred):
  - Multi-host data nodes (W2 day 7+ when Hetzner standby lands)
  - HA monitor — single-monitor is fine for v1.0 scale
  - PgBouncer (W2 day 7), pgBackRest (W2 day 8), OTel collector (W2 day 9)

SKIP_TESTS=1 — IaC YAML + bash, no app code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 18:27:46 +02:00
senke
65c20835c1 feat(infra): Ansible IaC scaffolding — common + incus_host roles (Day 5 v1.0.9)
Some checks failed
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Rust (Stream Server) (push) Successful in 3m27s
Security Scan / Secret Scanning (gitleaks) (push) Successful in 52s
Veza CI / Backend (Go) (push) Successful in 5m32s
Day 5 of ROADMAP_V1.0_LAUNCH.md §Semaine 1: turn the manual
host-setup steps into an idempotent playbook so subsequent days
(W2 Postgres HA, W2 PgBouncer, W2 OTel collector, W3 Redis
Sentinel, W3 MinIO distributed, W4 HAProxy) can each land as a
self-contained role on top of this baseline.

Layout (full tree under infra/ansible/):

  ansible.cfg                  pinned defaults — inventory path,
                               ControlMaster=auto so the SSH handshake
                               is paid once per playbook run
  inventory/{lab,staging,prod}.yml
                               three environments. lab is the R720's
                               local Incus container (10.0.20.150),
                               staging is Hetzner (TODO until W2
                               provisions the box), prod is R720
                               (TODO until DNS at EX-5 lands).
  group_vars/all.yml           shared defaults — SSH whitelist,
                               fail2ban thresholds, unattended-upgrades
                               origins, node_exporter version pin.
  playbooks/site.yml           entry point. Two plays:
                                 1. common (every host)
                                 2. incus_host (incus_hosts group)
  roles/common/                idempotent baseline:
                                 ssh.yml — drop-in
                                   /etc/ssh/sshd_config.d/50-veza-
                                   hardening.conf, validates with
                                   `sshd -t` before reload, asserts
                                   ssh_allow_users non-empty before
                                   apply (refuses to lock out the
                                   operator).
                                 fail2ban.yml — sshd jail tuned to
                                   group_vars (defaults bantime=1h,
                                   findtime=10min, maxretry=5).
                                 unattended_upgrades.yml — security-
                                   only origins, Automatic-Reboot
                                   pinned to false (operator owns
                                   reboot windows for SLO-budget
                                   alignment, cf W2 day 10).
                                 node_exporter.yml — pinned to
                                   1.8.2, runs as a systemd unit
                                   on :9100. Skips download when
                                   --version already matches.
  roles/incus_host/            zabbly upstream apt repo + incus +
                               incus-client install. First-time
                               `incus admin init --preseed` only when
                               `incus list` errors (i.e. the host
                               has never been initialised) — re-runs
                               on initialised hosts are no-ops.
                               Configures incusbr0 / 10.99.0.1/24
                               with NAT + default storage pool.

Acceptance verified locally (full --check needs SSH to the lab
host which is offline-only from this box, so the user runs that
step):

  $ cd infra/ansible
  $ ansible-playbook -i inventory/lab.yml playbooks/site.yml --syntax-check
  playbook: playbooks/site.yml          ← clean
  $ ansible-playbook -i inventory/lab.yml playbooks/site.yml --list-tasks
  21 tasks across 2 plays, all tagged.  ← partial applies work

Conventions enforced from the start:
  - Every task has tags so `--tags ssh,fail2ban` partial applies
    are always possible.
  - Sub-task files (ssh.yml, fail2ban.yml, etc.) so the role
    main.yml stays a directory of concerns, not a wall of tasks.
  - Validators run before reload (sshd -t for sshd_config). The
    role refuses to apply changes that would lock the operator out.
  - Comments answer "why" — task names + module names already
    say "what".

Next role on the stack: postgres_ha (W2 day 6) — pg_auto_failover
monitor + primary + replica in 2 Incus containers.

SKIP_TESTS=1 — IaC YAML, no app code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 18:16:38 +02:00