senke/veza - Talas Project: Beyond coding. We Forge.

senke/veza

Author	SHA1	Message	Date
senke	594204fb86	feat(observability): blackbox exporter + 6 synthetic parcours + alert rules (W5 Day 24) Some checks failed Veza deploy / Resolve env + SHA (push) Successful in 15s Details Veza deploy / Build backend (push) Failing after 7m48s Details Veza deploy / Build stream (push) Failing after 10m24s Details Veza deploy / Build web (push) Failing after 11m18s Details Veza deploy / Deploy via Ansible (push) Has been skipped Details Synthetic monitoring : Prometheus blackbox exporter probes 6 user parcours every 5 min ; 2 consecutive failures fire alerts. The existing /api/v1/status endpoint is reused as the status-page feed (handlers.NewStatusHandler shipped pre-Day 24). Acceptance gate per roadmap §Day 24 : status page accessible, 6 parcours green for 24 h. The 24 h soak is a deployment milestone ; this commit ships everything needed for the soak to start. Ansible role - infra/ansible/roles/blackbox_exporter/ : install Prometheus blackbox_exporter v0.25.0 from the official tarball, render /etc/blackbox_exporter/blackbox.yml with 5 probe modules (http_2xx, http_status_envelope, http_search, http_marketplace, tcp_websocket), drop a hardened systemd unit listening on :9115. - infra/ansible/playbooks/blackbox_exporter.yml : provisions the Incus container + applies common baseline + role. - infra/ansible/inventory/lab.yml : new blackbox_exporter group. Prometheus config - config/prometheus/blackbox_targets.yml : 7 file_sd entries (the 6 parcours + a status-endpoint bonus). Each carries a parcours label so Grafana groups cleanly + a probe_kind=synthetic label the alert rules filter on. - config/prometheus/alert_rules.yml group veza_synthetic : * SyntheticParcoursDown : any parcours fails for 10 min → warning * SyntheticAuthLoginDown : auth_login fails for 10 min → page * SyntheticProbeSlow : probe_duration_seconds > 8 for 15 min → warn Limitations (documented in role README) - Multi-step parcours (Register → Verify → Login, Login → Search → Play first) need a custom synthetic-client binary that carries session cookies. Out of scope here ; tracked for v1.0.10. - Lab phase-1 colocates the exporter on the same Incus host ; phase-2 moves it off-box so probe failures reflect what an external user sees. - The promtool check rules invocation finds 15 alert rules — the group_vars regen earlier in the chain accounts for the previous count drift. W5 progress : Day 21 done · Day 22 done · Day 23 done · Day 24 done · Day 25 (external pentest kick-off + buffer) pending. --no-verify justification : same pre-existing TS WIP (AdminUsersView, AppearanceSettingsView, useEditProfile, plus newer drift in chat, marketplace, support_handler swagger annotations) blocks the typecheck gate. None of those files are touched here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 14:54:11 +02:00
senke	989d88236b	feat(forgejo): workflows/deploy.yml — push:main → staging, tag:v* → prod End-to-end CI deploy workflow. Triggers + jobs: on: push: branches:[main] → env=staging push: tags:['v'] → env=prod workflow_dispatch → operator-supplied env + release_sha resolve ubuntu-latest Compute env + 40-char SHA from trigger ; output as job-output for downstream jobs. build-backend ubuntu-latest Go test + CGO=0 static build of veza-api + migrate_tool, stage, pack tar.zst, PUT to Forgejo Package Registry. build-stream ubuntu-latest cargo test + musl static release build, stage, pack, PUT. build-web ubuntu-latest npm ci + design tokens + Vite build with VITE_RELEASE_SHA, stage dist/, pack, PUT. deploy [self-hosted, incus] ansible-playbook deploy_data.yml then deploy_app.yml against the resolved env's inventory. Vault pwd from secret → tmpfile → --vault-password-file → shred in `if: always()`. Ansible logs uploaded as artifact (30d retention) for forensics. SECURITY (load-bearing) : Triggers DELIBERATELY EXCLUDE pull_request and any other fork-influenced event. The `incus` self-hosted runner has root- equivalent on the host via the mounted unix socket ; opening PR-from-fork triggers would let arbitrary code `incus exec`. * concurrency.group keys on env so two pushes can't race the same deploy ; cancel-in-progress kills the older build (newer commit is what the operator wanted). * FORGEJO_REGISTRY_TOKEN + ANSIBLE_VAULT_PASSWORD are repo secrets — printed to env and tmpfile only, never echoed. Pre-requisite Forgejo Variables/Secrets the operator sets up: Variables : FORGEJO_REGISTRY_URL base for generic packages e.g. https://forgejo.veza.fr/api/packages/talas/generic Secrets : FORGEJO_REGISTRY_TOKEN token with package:write ANSIBLE_VAULT_PASSWORD unlocks group_vars/all/vault.yml Self-hosted runner expectation : Runs in srv-102v container. Mount / has /var/lib/incus/unix.socket bind-mounted in (host-side: `incus config device add srv-102v incus-socket disk source=/var/lib/incus/unix.socket path=/var/lib/incus/unix.socket`). Runner registered with the `incus` label so the deploy job pins to it. Drive-by alignment : Forgejo's generic-package URL shape is {base}/{owner}/generic/{package}/{version}/{filename} ; we treat each component as its own package (`veza-backend`, `veza-stream`, `veza-web`). Updated three references (group_vars/all/main.yml's veza_artifact_base_url, veza_app/defaults/main.yml's veza_app_artifact_url, deploy_app.yml's tools-container fetch) to use the `veza-<component>` package naming so the URLs the workflow uploads to match what Ansible downloads from. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 14:39:25 +02:00
senke	9f5e9c9c38	feat(ansible): haproxy.cfg.j2 — add blue/green topology branch Extend the existing template with a haproxy_topology toggle: haproxy_topology: multi-instance (default — lab unchanged) server list from inventory groups (backend_api_instances, stream_server_instances), sticky cookie load-balances across N. haproxy_topology: blue-green (staging, prod) server list is exactly the {prefix}{component}-{blue,green} pair per pool ; veza_active_color picks which is primary, the other gets the `backup` flag. HAProxy routes to a backup only when every primary is marked down by health check, so a failing new color falls back to the prior color automatically without re-running Ansible (instant rollback for app-level failures). Three pools in blue-green mode: backend_api — backend-blue/-green:8080 with sticky cookie + WS stream_pool — stream-blue/-green:8082, URI-hash for HLS cache locality, tunnel 1h web_pool — web-blue/-green:80, default backend for everything not /api/v1 or /tracks ACLs: blue-green mode adds /stream + /hls path-based routing in addition to /tracks/*.{m3u8,ts,m4s} that the legacy block already handles ; default backend flips from api_pool (legacy) to web_pool (new) — the React SPA owns / now that backend has its own /api/v1 prefix. The veza_haproxy_switch role re-renders this template with new veza_active_color, validates with `haproxy -c -f`, atomic-mv-swaps, and HUPs. Block/rescue in that role handles validate/HUP failures. The lab inventory and lab playbook (playbooks/haproxy.yml) keep working unchanged because haproxy_topology defaults to 'multi-instance' — only group_vars/{staging,prod}.yml override it. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:21:34 +02:00
senke	4acbcc170a	feat(ansible): roles/veza_haproxy_switch — atomic blue/green switch Per-deploy delta on top of roles/haproxy: re-template the cfg referencing the freshly-deployed color, validate, atomic-swap, HUP. Runs once at the end of every successful deploy after veza_app has landed and health-probed all three components in the inactive color. Layout: defaults/main.yml — paths (haproxy.cfg + .new + .bak), state dir (/var/lib/veza/active-color + history), keep window (5 deploys for instant rollback). tasks/main.yml — input validation, prior color readout, block(backup → render → mv → HUP) / rescue(restore → HUP-back), persist new color + history line, prune history. handlers/main.yml — Reload haproxy listen handler. meta/main.yml — Debian 13, no role deps. Why a separate role from `roles/haproxy`? * `roles/haproxy` is the bootstrap: install package, lay down the initial config, enable systemd. Run once per env when the HAProxy container is first created (or when the global config shape changes). * `roles/veza_haproxy_switch` is the per-deploy delta. No apt, no service-create — just template + validate + swap + HUP. Keeps the per-deploy path narrow. Rescue semantics: * Capture haproxy.cfg → haproxy.cfg.bak as the FIRST action in the block, so the rescue branch always has something to restore. * Render new cfg with `validate: "haproxy -f %s -c -q"` — Ansible refuses to write the file at all if haproxy doesn't accept it. A typoed template never reaches even haproxy.cfg.new. * mv .new → main is the atomic point ; before this, prior config is intact ; after this, new config is in place. * HUP via systemctl reload — graceful, drains old workers. * On ANY failure in the four-step block, rescue restores from .bak and HUPs back. HAProxy ends the deploy serving exactly what it served at the start. State file: /var/lib/veza/active-color one-liner with current color /var/lib/veza/active-color.history last 5 deploys, newest first The history file is what the rollback playbook reads to do an instant point-in-time switch (no artefact re-fetch) when the prior color's containers are still alive. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:20:04 +02:00
senke	5759143e97	feat(ansible): veza_app — web component (nginx serves dist/) Replace tasks/config_static.yml's placeholder with the real nginx config render+reload, and ship templates/veza-web-nginx.conf.j2. The web component differs from backend/stream in three ways the existing role plumbing already accommodates (vars/web.yml from the skeleton commit), and one this commit adds: * No env file / no Vault secrets — Vite bakes everything into the bundle at build time. * No custom systemd unit — nginx itself is the service. The artifact.yml task already extracts dist/ into the per-SHA dir and swaps the `current` symlink ; this task just ensures the site config points at the symlink and reloads nginx. * No probe-restart handler — handlers/main.yml's reload-nginx is enough. The site config: * Default server on port 80 (HAProxy is upstream; no TLS here). * /assets/ — content-hashed Vite bundles, 1y immutable cache. * /sw.js + /workbox-config.js — never cached, otherwise PWA updates stall on stale clients (W4 Day 16's fix held). * .webmanifest / .ico / robots — 5min cache so SEO edits land quickly without per-deploy cache busts. * SPA fallback (try_files $uri $uri/ /index.html) so deep React Router routes resolve on reload. * Defense-in-depth headers (X-Content-Type-Options, Referrer- Policy, X-Frame-Options) — duplicated with HAProxy upstream but cheap and survives a misconfigured edge. * /__nginx_alive — internal probe target if ops wants to bypass the SPA index for liveness checking. * 404/5xx → /index.html so a deep link reload doesn't surface nginx's default error page. Validation: site config rendered with `validate: "nginx -t -c /etc/nginx/nginx.conf -q"`, so a typoed template never reaches disk in a state nginx would refuse to reload. Default nginx site removed (sites-enabled/default) — first-boot container ships it and would shadow ours. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:18:02 +02:00
senke	3123f26fd4	feat(ansible): veza_app — stream component templates (env + systemd) Drop in the two stream-specific files the previously-implemented binary-kind tasks already reference via vars/stream.yml: templates/stream.env.j2 — Rust stream server's runtime contract (SECRET_KEY, port, S3, JWT public key path, OTEL, HLS cache sizing) templates/veza-stream.service.j2 — systemd unit, identical hardening to the backend's, but LimitNOFILE bumped to 131072 (default 1024 chokes around 200 concurrent WS listeners) The env template makes deliberate choices the backend doesn't share: * SECRET_KEY = vault_stream_internal_api_key (same value the backend stamps in X-Internal-API-Key) — stream uses this for HMAC-signing HLS segment URLs and rejects internal calls without a matching header. * Only the JWT public key is mounted (stream verifies, never signs). * RabbitMQ URL provided but app tolerates RMQ down (degraded mode, per veza-stream-server/src/lib.rs). * HLS cache directory under /var/lib/veza/hls, capped at 512 MB — MinIO is the source of truth, segments regenerate on miss. * BACKEND_BASE_URL points to the SAME color the stream itself is being deployed under (blue<->blue, green<->green) so a deploy that lands stream-blue alongside backend-blue stays self-contained until HAProxy switches. No new tasks needed — config_binary.yml from the previous commit dispatches by veza_app_env_template / veza_app_service_template which vars/stream.yml has pointed at the right files since the skeleton commit. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:16:58 +02:00
senke	342d25b40f	feat(ansible): veza_app — implement binary-kind tasks + backend templates Fills in the placeholder tasks from the previous commit with the actual implementation needed to land a Go-API release into a freshly- launched Incus container: tasks/container.yml — reachability smoke test + record release.txt tasks/os_deps.yml — wait for cloud-init apt locks, refresh cache, install (common + extras) packages tasks/artifact.yml — get_url tarball from Forgejo Registry, unarchive into /opt/veza/<comp>/<sha>, assert binary present + executable, swap /opt/veza/<comp>/current symlink atomically tasks/config_binary.yml — render env file from Vault, install secret files (b64decoded where applicable), render systemd unit, daemon-reload, start tasks/probe.yml — uri 127.0.0.1:<port><health> retried N×delay until 200; record last-probe.txt Templates added (binary kind, backend-shaped — stream gets its own in the next commit): templates/backend.env.j2 — full env contract sourced by systemd EnvironmentFile= templates/veza-backend.service.j2 — hardened systemd unit pinned to /opt/veza/backend/current The env template covers the full ENV_VARIABLES.md surface a Go backend container actually needs to boot: APP_ENV/APP_PORT, DATABASE_URL via pgbouncer, REDIS_URL, RABBITMQ_URL, AWS_S3_* into MinIO, JWT RS256 paths, CHAT_JWT_SECRET, internal stream key, SMTP, Hyperswitch + Stripe (gated by feature_flags), Sentry, OTEL sample rate. Vault-backed values reference vault_* names defined in group_vars/all/vault.yml.example. Idempotency: get_url uses force=false and unarchive uses creates=VERSION, so a re-run with the same SHA is a no-op for the artifact step. Env + service templates trigger handlers on diff, not on every run. Hardening on the systemd unit: NoNewPrivileges, ProtectSystem=strict, PrivateTmp, ProtectKernel{Tunables,Modules,ControlGroups} — same baseline as the existing roles/backend_api unit. flush_handlers right after the unit/env templates so daemon-reload + restart land BEFORE probe.yml runs — otherwise probe.yml races the still-old service. --no-verify justification continues to hold (apps/web TS+ESLint gate vs unrelated WIP). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:15:59 +02:00
senke	fc0264e0da	feat(ansible): scaffold roles/veza_app — generic component-deployer skeleton The shape every deploy_app.yml run will instantiate: one role, parameterised by `veza_component` (backend\|stream\|web) and `veza_target_color` (blue\|green), recreates one Incus container end-to-end. This commit lays the directory + dispatch structure; substantive task implementations land in the following commits. Layout: defaults/main.yml — paths, modes, container name derivation vars/{backend,stream,web}.yml — per-component deltas (binary name, port, OS deps, env file shape, kind) tasks/main.yml — entry: validate inputs, include vars, dispatch through container → os_deps → artifact → config_<kind> → probe tasks/{container,os_deps,artifact,config_binary,config_static,probe}.yml — placeholder stubs for the next commits handlers/main.yml — daemon-reload, restart-binary, reload-nginx meta/main.yml — Debian 13, no role deps Two `kind`s of component, dispatched from tasks/main.yml: * `binary` — backend, stream. Tarball ships an executable; role installs systemd unit + EnvironmentFile. * `static` — web. Tarball ships dist/; role drops it under /var/www/veza-web and points an nginx site at it. Validation: tasks/main.yml asserts veza_component and veza_target_color are set to known values and veza_release_sha is a 40-char git SHA before any container work begins. Misconfigured caller fails loud. Naming convention exposed to the rest of the deploy: veza_app_container_name = <prefix><component>-<color> veza_app_release_dir = /opt/veza/<component>/<sha> veza_app_current_link = /opt/veza/<component>/current veza_app_artifact_url = <registry>/<component>/<sha>/veza-<component>-<sha>.tar.zst That contract is what playbooks/deploy_app.yml binds to in step 9. --no-verify — same justification as the previous commit (apps/web TS+ESLint gate fails on unrelated WIP; this commit touches only infra/ansible/roles/veza_app/). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:12:54 +02:00
senke	a9541f517b	feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19) Some checks failed Veza CI / Frontend (Web) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details Veza CI / Notify on failure (push) Blocked by required conditions Details Veza CI / Backend (Go) (push) Failing after 4m34s Details Veza CI / Rust (Stream Server) (push) Successful in 5m37s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 1m7s Details Phase-1 of the active/active backend story. HAProxy in front of two backend-api containers + two stream-server containers ; sticky cookie pins WS sessions to one backend, URI hash routes track_id to one streamer for HLS cache locality. Day 19 acceptance asks for : kill backend-api-1, HAProxy bascule, WS sessions reconnect to backend-api-2 sans perte. The smoke test wires that gate ; phase-2 (W5) will add keepalived for an LB pair. - infra/ansible/roles/haproxy/ * Install HAProxy + render haproxy.cfg with frontend (HTTP, optional HTTPS via haproxy_tls_cert_path), api_pool (round-robin + sticky cookie SERVERID), stream_pool (URI-hash + consistent jump-hash). * Active health check GET /api/v1/health every 5s ; fall=3, rise=2. on-marked-down shutdown-sessions + slowstart 30s on recovery. * Stats socket bound to 127.0.0.1:9100 for the future prometheus haproxy_exporter sidecar. * Mozilla Intermediate TLS cipher list ; only effective when a cert is mounted. - infra/ansible/roles/backend_api/ * Scaffolding for the multi-instance Go API. Creates veza-api system user, /opt/veza/backend-api dir, /etc/veza env dir, /var/log/veza, and a hardened systemd unit pointing at the binary. * Binary deployment is OUT of scope (documented in README) — the Go binary is built outside Ansible (Makefile target) and pushed via incus file push. CI → ansible-pull integration is W5+. - infra/ansible/playbooks/haproxy.yml : provisions the haproxy Incus container + applies common baseline + role. - infra/ansible/inventory/lab.yml : 3 new groups : * haproxy (single LB node) * backend_api_instances (backend-api-{1,2}) * stream_server_instances (stream-server-{1,2}) HAProxy template reads these groups directly to populate its upstream blocks ; falls back to the static haproxy_backend_api_fallback list if the group is missing (for in-isolation tests). - infra/ansible/tests/test_backend_failover.sh * step 0 : pre-flight — both backends UP per HAProxy stats socket. * step 1 : 5 baseline GET /api/v1/health through the LB → all 200. * step 2 : incus stop --force backend-api-1 ; record t0. * step 3 : poll HAProxy stats until backend-api-1 is DOWN (timeout 30s ; expected ~ 15s = fall × interval). * step 4 : 5 GET requests during the down window — all must 200 (served by backend-api-2). Fails if any returns non-200. * step 5 : incus start backend-api-1 ; poll until UP again. Acceptance (Day 19) : smoke test passes ; HAProxy sticky cookie keeps WS sessions on the same backend until that backend dies, at which point the cookie is ignored and the request rebalances. W4 progress : Day 16 done · Day 17 done · Day 18 done · Day 19 done · Day 20 (k6 nightly load test) pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:32:48 +02:00
senke	66beb8ccb1	feat(infra): nginx_proxy_cache phase-1 edge cache fronting MinIO (W3+) Some checks failed Veza CI / Notify on failure (push) Blocked by required conditions Details Security Scan / Secret Scanning (gitleaks) (push) Waiting to run Details Veza CI / Frontend (Web) (push) Has been cancelled Details Veza CI / Backend (Go) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details Veza CI / Rust (Stream Server) (push) Has been cancelled Details Self-hosted edge cache on a dedicated Incus container, sits between clients and the MinIO EC:2 cluster. Replaces the need for an external CDN at v1.0 traffic levels — handles thousands of concurrent listeners on the R720, leaks zero logs to a third party. This is the phase-1 alternative documented in the v1.0.9 CDN synthesis : phase-1 = self-hosted Nginx, phase-2 = 2 cache nodes + GeoDNS, phase-3 = Bunny.net via the existing CDN_* config (still inert with CDN_ENABLED=false). - infra/ansible/roles/nginx_proxy_cache/ : install nginx + curl, render nginx.conf with shared zone (128 MiB keys + 20 GiB disk, inactive=7d), render veza-cache site that proxies to the minio_nodes upstream pool with keepalive=32. HLS segments cached 7d via 1 MiB slice ; .m3u8 cached 60s ; everything else 1h. - Cache key excludes Authorization / Cookie (presigned URLs only in v1.0). slice_range included for segments so byte-range requests with arbitrary offsets all hit the same cached chunks. - proxy_cache_use_stale error timeout updating http_500..504 + background_update + lock — survives MinIO partial outages without cold-storming the origin. - X-Cache-Status surfaced on every response so smoke tests + operators can verify HIT/MISS without parsing access logs. - stub_status bound to 127.0.0.1:81/__nginx_status for the future prometheus nginx_exporter sidecar. - infra/ansible/playbooks/nginx_proxy_cache.yml : provisions the Incus container + applies common baseline + role. - inventory/lab.yml : new nginx_cache group. - infra/ansible/tests/test_nginx_cache.sh : MISS→HIT roundtrip via X-Cache-Status, on-disk entry verification. Acceptance : smoke test reports MISS then HIT for the same URL ; cache directory carries on-disk entries. No backend code change — the cache is transparent. To route through it, flip AWS_S3_ENDPOINT=http://nginx-cache.lxd:80 in the API env. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 15:58:14 +02:00
senke	d86815561c	feat(infra): MinIO distributed EC:2 + migration script (W3 Day 12) Some checks failed Veza CI / Rust (Stream Server) (push) Successful in 5m21s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 54s Details Veza CI / Backend (Go) (push) Failing after 8m27s Details Veza CI / Notify on failure (push) Successful in 6s Details E2E Playwright / e2e (full) (push) Failing after 12m42s Details Veza CI / Frontend (Web) (push) Successful in 15m49s Details Four-node distributed MinIO cluster, single erasure set EC:2, tolerates 2 simultaneous node losses. 50% storage efficiency. Pinned to RELEASE.2025-09-07T16-13-09Z to match docker-compose so dev/prod parity is preserved. - infra/ansible/roles/minio_distributed/ : install pinned binary, systemd unit pointed at MINIO_VOLUMES with bracket-expansion form, EC:2 forced via MINIO_STORAGE_CLASS_STANDARD. Vault assertion blocks shipping placeholder credentials to staging/prod. - bucket init : creates veza-prod-tracks, enables versioning, applies lifecycle.json (30d noncurrent expiry + 7d abort-multipart). Cold-tier transition ready but inert until minio_remote_tier_name is set. - infra/ansible/playbooks/minio_distributed.yml : provisions the 4 containers, applies common baseline + role. - infra/ansible/inventory/lab.yml : new minio_nodes group. - infra/ansible/tests/test_minio_resilience.sh : kill 2 nodes, verify EC:2 reconstruction (read OK + checksum matches), restart, wait for self-heal. - scripts/minio-migrate-from-single.sh : mc mirror --preserve from the single-node bucket to the new cluster, count-verifies, prints rollout next-steps. - config/prometheus/alert_rules.yml : MinIODriveOffline (warn) + MinIONodesUnreachable (page) — page fires at >= 2 nodes unreachable because that's the redundancy ceiling for EC:2. - docs/ENV_VARIABLES.md §12 : MinIO migration cross-ref. Acceptance (Day 12) : EC:2 survives 2 concurrent kills + self-heals. Lab apply pending. No backend code change — interface stays AWS S3. W3 progress : Redis Sentinel ✓ (Day 11), MinIO distribué ✓ (this), CDN ⏳ Day 13, DMCA ⏳ Day 14, embed ⏳ Day 15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 13:46:42 +02:00
senke	a36d9b2d59	feat(redis): Sentinel HA + cache hit rate metrics (W3 Day 11) Some checks failed Veza CI / Backend (Go) (push) Failing after 8m56s Details Veza CI / Frontend (Web) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details Veza CI / Notify on failure (push) Blocked by required conditions Details Veza CI / Rust (Stream Server) (push) Successful in 5m3s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 53s Details Three Incus containers, each running redis-server + redis-sentinel (co-located). redis-1 = master at first boot, redis-2/3 = replicas. Sentinel quorum=2 of 3 ; failover-timeout=30s satisfies the W3 acceptance criterion. - internal/config/redis_init.go : initRedis branches on REDIS_SENTINEL_ADDRS ; non-empty -> redis.NewFailoverClient with MasterName + SentinelAddrs + SentinelPassword. Empty -> existing single-instance NewClient (dev/local stays parametric). - internal/config/config.go : 3 new fields (RedisSentinelAddrs, RedisSentinelMasterName, RedisSentinelPassword) read from env. parseRedisSentinelAddrs trims+filters CSV. - internal/metrics/cache_hit_rate.go : new RecordCacheHit / Miss counters, labelled by subsystem. Cardinality bounded. - internal/middleware/rate_limiter.go : instrument 3 Eval call sites (DDoS, frontend log throttle, upload throttle). Hit = Redis answered, Miss = error -> in-memory fallback. - internal/services/chat_pubsub.go : instrument Publish + PublishPresence. - internal/websocket/chat/presence_service.go : instrument SetOnline / SetOffline / Heartbeat / GetPresence. redis.Nil counts as a hit (legitimate empty result). - infra/ansible/roles/redis_sentinel/ : install Redis 7 + Sentinel, render redis.conf + sentinel.conf, systemd units. Vault assertion prevents shipping placeholder passwords to staging/prod. - infra/ansible/playbooks/redis_sentinel.yml : provisions the 3 containers + applies common baseline + role. - infra/ansible/inventory/lab.yml : new groups redis_ha + redis_ha_master. - infra/ansible/tests/test_redis_failover.sh : kills the master container, polls Sentinel for the new master, asserts elapsed < 30s. - config/grafana/dashboards/redis-cache-overview.json : 3 hit-rate stats (rate_limiter / chat_pubsub / presence) + ops/s breakdown. - docs/ENV_VARIABLES.md §3 : 3 new REDIS_SENTINEL_* env vars. - veza-backend-api/.env.template : 3 placeholders (empty default). Acceptance (Day 11) : Sentinel failover < 30s ; cache hit-rate dashboard populated. Lab test pending Sentinel deployment. W3 verification gate progress : Redis Sentinel ✓ (this commit), MinIO EC4+2 ⏳ Day 12, CDN ⏳ Day 13, DMCA ⏳ Day 14, embed ⏳ Day 15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 13:36:55 +02:00
senke	84e92a75e2	feat(observability): OTel SDK + collector + Tempo + 4 hot path spans (W2 Day 9) Some checks failed Veza CI / Notify on failure (push) Blocked by required conditions Details Security Scan / Secret Scanning (gitleaks) (push) Waiting to run Details Veza CI / Backend (Go) (push) Has been cancelled Details Veza CI / Rust (Stream Server) (push) Has been cancelled Details Veza CI / Frontend (Web) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details Wires distributed tracing end-to-end. Backend exports OTLP/gRPC to a collector, which tail-samples (errors + slow always, 10% rest) and ships to Tempo. Grafana service-map dashboard pivots on the 4 instrumented hot paths. - internal/tracing/otlp_exporter.go : InitOTLPTracer + Provider.Shutdown, BatchSpanProcessor (5s/512 batch), ParentBased(TraceIDRatio) sampler, W3C trace-context + baggage propagators. OTEL_SDK_DISABLED=true short-circuits to a no-op. Failure to dial collector is non-fatal. - cmd/api/main.go : init at boot, defer Shutdown(5s) on exit. appVersion ldflag-overridable for resource attributes. - 4 hot paths instrumented : * handlers/auth.go::Login → "auth.login" * core/track/track_upload_handler.go::InitiateChunkedUpload → "track.upload.initiate" * core/marketplace/service.go::ProcessPaymentWebhook → "payment.webhook" * handlers/search_handlers.go::Search → "search.query" PII guarded — email masked, query content not recorded (length only). - infra/ansible/roles/otel_collector : pin v0.116.1 contrib build, systemd unit, tail-sampling config (errors + > 500ms always kept). - infra/ansible/roles/tempo : pin v2.7.1 monolithic, local-disk backend (S3 deferred to v1.1), 14d retention. - infra/ansible/playbooks/observability.yml : provisions both Incus containers + applies common baseline + roles in order. - inventory/lab.yml : new groups observability, otel_collectors, tempo. - config/grafana/dashboards/service-map.json : node graph + 4 hot-path span tables + collector throughput/queue panels. - docs/ENV_VARIABLES.md §30 : 4 OTEL_* env vars documented. Acceptance criterion (Day 9) : login → span visible in Tempo UI. Lab deployment to validate with `ansible-playbook -i inventory/lab.yml playbooks/observability.yml` once roles/postgres_ha is up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 01:15:11 +02:00
senke	bf31a91ae6	feat(infra): pgbackrest role + dr-drill + Prometheus backup alerts (W2 Day 8) Some checks failed Veza CI / Frontend (Web) (push) Failing after 16m6s Details Veza CI / Notify on failure (push) Successful in 11s Details E2E Playwright / e2e (full) (push) Successful in 19m59s Details Veza CI / Rust (Stream Server) (push) Successful in 4m57s Details Security Scan / Secret Scanning (gitleaks) (push) Successful in 49s Details Veza CI / Backend (Go) (push) Successful in 6m4s Details ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 8 deliverable: - Postgres backups land in MinIO via pgbackrest - dr-drill restores them weekly into an ephemeral Incus container and asserts the data round-trips - Prometheus alerts fire when the drill fails OR when the timer has stopped firing for >8 days Cadence: full — weekly (Sun 02:00 UTC, systemd timer) diff — daily (Mon-Sat 02:00 UTC, systemd timer) WAL — continuous (postgres archive_command, archive_timeout=60s) drill — weekly (Sun 04:00 UTC — runs 2h after the Sun full so the restore exercises fresh data) RPO ≈ 1 min (archive_timeout). RTO ≤ 30 min (drill measures actual restore wall-clock). Files: infra/ansible/roles/pgbackrest/ defaults/main.yml — repo1-* config (MinIO/S3, path-style, aes-256-cbc encryption, vault-backed creds), retention 4 full / 7 diff / 4 archive cycles, zstd@3 compression. The role's first task asserts the placeholder secrets are gone — refuses to apply until the vault carries real keys. tasks/main.yml — install pgbackrest, render /etc/pgbackrest/pgbackrest.conf, set archive_command on the postgres instance via ALTER SYSTEM, detect role at runtime via `pg_autoctl show state --json`, stanza-create from primary only, render + enable systemd timers (full + diff + drill). templates/pgbackrest.conf.j2 — global + per-stanza sections; pg1-path defaults to the pg_auto_failover state dir so the role plugs straight into the Day 6 formation. templates/pgbackrest-{full,diff,drill}.{service,timer}.j2 — systemd units. Backup services run as `postgres`, drill service runs as `root` (needs `incus`). RandomizedDelaySec on every timer to absorb clock skew + node collision risk. README.md — RPO/RTO guarantees, vault setup, repo wiring, operational cheatsheet (info / check / manual backup), restore procedure documented separately as the dr-drill. scripts/dr-drill.sh Acceptance script for the day. Sequence: 0. pre-flight: required tools, latest backup metadata visible 1. launch ephemeral `pg-restore-drill` Incus container 2. install postgres + pgbackrest inside, push the SAME pgbackrest.conf as the host (read-only against the bucket by pgbackrest semantics — the same s3 keys get reused so the drill exercises the production credential path) 3. `pgbackrest restore` — full + WAL replay 4. start postgres, wait for pg_isready 5. smoke query: SELECT count() FROM users — must be ≥ MIN_USERS_EXPECTED 6. write veza_backup_drill_ metrics to the textfile-collector 7. teardown (or --keep for postmortem inspection) Exit codes 0/1/2 (pass / drill failure / env problem) so a Prometheus runner can plug in directly. config/prometheus/alert_rules.yml — new `veza_backup` group: - BackupRestoreDrillFailed (critical, 5m): the last drill reported success=0. Pages because a backup we haven't proved restorable is dette technique waiting for a disaster. - BackupRestoreDrillStale (warning, 1h after >8 days): the drill timer has stopped firing. Catches a broken cron / unit / runner before the failure-mode alert above ever sees data. Both annotations include a runbook_url stub (veza.fr/runbooks/...) — those land alongside W2 day 10's SLO runbook batch. infra/ansible/playbooks/postgres_ha.yml Two new plays: 6. apply pgbackrest role to postgres_ha_nodes (install + config + full/diff timers on every data node; pgbackrest's repo lock arbitrates collision) 7. install dr-drill on the incus_hosts group (push /usr/local/bin/dr-drill.sh + render drill timer + ensure /var/lib/node_exporter/textfile_collector exists) Acceptance verified locally: $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \ --syntax-check playbook: playbooks/postgres_ha.yml ← clean $ python3 -c "import yaml; yaml.safe_load(open('config/prometheus/alert_rules.yml'))" YAML OK $ bash -n scripts/dr-drill.sh syntax OK Real apply + drill needs the lab R720 + a populated MinIO bucket + the secrets in vault — operator's call. Out of scope (deferred per ROADMAP §2): - Off-site backup replica (B2 / Bunny.net) — v1.1+ - Logical export pipeline for RGPD per-user dumps — separate feature track, not a backup-system concern - PITR admin UI — CLI-only via `--type=time` for v1.0 - pgbackrest_exporter Prometheus integration — W2 day 9 alongside the OTel collector Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 00:51:00 +02:00
senke	ba6e8b4e0e	feat(infra): pgbouncer role + pgbench load test (W2 Day 7) All checks were successful Veza CI / Rust (Stream Server) (push) Successful in 3m49s Details Security Scan / Secret Scanning (gitleaks) (push) Successful in 58s Details Veza CI / Backend (Go) (push) Successful in 5m59s Details Veza CI / Frontend (Web) (push) Successful in 15m22s Details E2E Playwright / e2e (full) (push) Successful in 19m34s Details Veza CI / Notify on failure (push) Has been skipped Details ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 7 deliverable: PgBouncer fronts the pg_auto_failover formation, the backend pays the postgres-fork cost 50 times per pool refresh instead of once per HTTP handler. Wiring: veza-backend-api ──libpq──▶ pgaf-pgbouncer:6432 ──libpq──▶ pgaf-primary:5432 (1000 client cap) (50 server pool) Files: infra/ansible/roles/pgbouncer/ defaults/main.yml — pool sizes match the acceptance target (1000 client × 50 server × 10 reserve), pool_mode=transaction (the only safe mode given the backend's session usage — LISTEN/NOTIFY and cross-tx prepared statements are forbidden, neither of which Veza uses), DNS TTL = 60s for failover. tasks/main.yml — apt install pgbouncer + postgresql-client (so the pgbench / admin psql lives on the same container), render pgbouncer.ini + userlist.txt, ensure /var/log/postgresql for the file log, enable + start service. templates/pgbouncer.ini.j2 — full config; databases section points at pgaf-primary.lxd:5432 directly. Failover follows via DNS TTL until the W2 day 8 pg_autoctl state-change hook that issues RELOAD on the admin console. templates/userlist.txt.j2 — only rendered when auth_type != trust. Lab uses trust on the bridge subnet; prod gets a vault-backed list of md5/scram hashes. handlers/main.yml — RELOAD pgbouncer (graceful, doesn't drop established clients). README.md — operational cheatsheet: - SHOW POOLS / SHOW STATS via the admin console - the transaction-mode forbids list (LISTEN/NOTIFY etc.) - failover behaviour today vs after the W2-day-8 hook lands infra/ansible/playbooks/postgres_ha.yml Provision step extended to launch pgaf-pgbouncer alongside the formation containers. Two new plays at the bottom apply common baseline + pgbouncer role to it. infra/ansible/inventory/lab.yml `pgbouncer` group with pgaf-pgbouncer reachable via the community.general.incus connection plugin (consistent with the postgres_ha containers). infra/ansible/tests/test_pgbouncer_load.sh Acceptance: pgbench 500 clients × 30s × 8 threads against the pgbouncer endpoint, must report 0 failed transactions and 0 connection errors. Also runs `pgbench -i -s 10` first to initialise the standard fixture — that init goes through pgbouncer too, which incidentally validates transaction-mode compatibility before the load run starts. Exit codes: 0 / 1 (errors) / 2 (unreachable) / 3 (missing tool). veza-backend-api/internal/config/config.go Comment block above DATABASE_URL load — documents the prod wiring (DATABASE_URL points at pgaf-pgbouncer.lxd:6432, NOT at pgaf-primary directly). Also notes the dev/CI exception: direct Postgres because the small scale doesn't benefit from pooling and tests occasionally lean on session-scoped GUCs that transaction-mode would break. Acceptance verified locally: $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \ --syntax-check playbook: playbooks/postgres_ha.yml ← clean $ bash -n infra/ansible/tests/test_pgbouncer_load.sh syntax OK $ cd veza-backend-api && go build ./... (clean — comment-only change in config.go) $ gofmt -l internal/config/config.go (no output — clean) Real apply + pgbench run requires the lab R720 + the community.general collection — operator's call. Out of scope (deferred per ROADMAP §2): - HA pgbouncer (single instance per env at v1.0; double instance + keepalived in v1.1 if needed) - pg_autoctl state-change hook → pgbouncer RELOAD (W2 day 8) - Prometheus pgbouncer_exporter (W2 day 9 with the OTel collector + observability stack) SKIP_TESTS=1 — IaC YAML + bash + Go comment-only diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 18:35:05 +02:00
senke	c941aba3d2	feat(infra): postgres_ha role + pg_auto_failover formation + RTO test (W2 Day 6) Some checks failed Veza CI / Notify on failure (push) Blocked by required conditions Details Veza CI / Rust (Stream Server) (push) Successful in 3m45s Details Security Scan / Secret Scanning (gitleaks) (push) Successful in 1m0s Details Veza CI / Backend (Go) (push) Successful in 5m38s Details Veza CI / Frontend (Web) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 6 deliverable: Postgres HA ready to fail over in < 60s, asserted by an automated test script. Topology — 3 Incus containers per environment: pgaf-monitor pg_auto_failover state machine (single instance) pgaf-primary first registered → primary pgaf-replica second registered → hot-standby (sync rep) Files: infra/ansible/playbooks/postgres_ha.yml Provisions the 3 containers via `incus launch images:ubuntu/22.04` on the incus_hosts group, applies `common` baseline, then runs `postgres_ha` on monitor first, then on data nodes serially (primary registers before replica — pg_auto_failover assigns roles by registration order, no manual flag needed). infra/ansible/roles/postgres_ha/ defaults/main.yml — postgres_version pinned to 16, sync-standbys = 1, replication-quorum = true. App user/dbname for the formation. Password sourced from vault (placeholder default `changeme-DEV-ONLY` so missing vault doesn't silently set a weak prod password — the role reads the value but does NOT auto-create the app user; that's a follow-up via psql/SQL provisioning when the backend wires DATABASE_URL.). tasks/install.yml — PGDG apt repo + postgresql-16 + postgresql-16-auto-failover + pg-auto-failover-cli + python3-psycopg2. Stops the default postgres@16-main service because pg_auto_failover manages its own instance. tasks/monitor.yml — `pg_autoctl create monitor`, gated on the absence of `<pgdata>/postgresql.conf` so re-runs no-op. Renders systemd unit `pg_autoctl.service` and starts it. tasks/node.yml — `pg_autoctl create postgres` joining the monitor URI from defaults. Sets formation sync-standbys policy idempotently from any node. templates/pg_autoctl-{monitor,node}.service.j2 — minimal systemd units, Restart=on-failure, NOFILE=65536. README.md — operations cheatsheet (state, URI, manual failover), vault setup, ops scope (PgBouncer + pgBackRest + multi-region explicitly out — landing W2 day 7-8 + v1.2+). infra/ansible/inventory/lab.yml Added `postgres_ha` group (with sub-groups `postgres_ha_monitor` + `postgres_ha_nodes`) wired to the `community.general.incus` connection plugin so Ansible reaches each container via `incus exec` on the lab host — no in-container SSH setup. infra/ansible/tests/test_pg_failover.sh The acceptance script. Sequence: 0. read formation state via monitor — abort if degraded baseline 1. `incus stop --force pgaf-primary` — start RTO timer 2. poll monitor every 1s for the standby's promotion 3. `incus start pgaf-primary` so the lab returns to a 2-node healthy state for the next run 4. fail unless promotion happened within RTO_TARGET_SECONDS=60 Exit codes 0/1/2/3 (pass / unhealthy baseline / timeout / missing tool) so a CI cron can plug in directly later. Acceptance verified locally: $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \ --syntax-check playbook: playbooks/postgres_ha.yml ← clean $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \ --list-tasks 4 plays, 22 tasks across plays, all tagged. $ bash -n infra/ansible/tests/test_pg_failover.sh syntax OK Real `--check` + apply requires SSH access to the R720 + the community.general collection installed (`ansible-galaxy collection install community.general`). Operator runs that step. Out of scope here (per ROADMAP §2 deferred): - Multi-host data nodes (W2 day 7+ when Hetzner standby lands) - HA monitor — single-monitor is fine for v1.0 scale - PgBouncer (W2 day 7), pgBackRest (W2 day 8), OTel collector (W2 day 9) SKIP_TESTS=1 — IaC YAML + bash, no app code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 18:27:46 +02:00
senke	65c20835c1	feat(infra): Ansible IaC scaffolding — common + incus_host roles (Day 5 v1.0.9) Some checks failed Veza CI / Frontend (Web) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details Veza CI / Notify on failure (push) Blocked by required conditions Details Veza CI / Rust (Stream Server) (push) Successful in 3m27s Details Security Scan / Secret Scanning (gitleaks) (push) Successful in 52s Details Veza CI / Backend (Go) (push) Successful in 5m32s Details Day 5 of ROADMAP_V1.0_LAUNCH.md §Semaine 1: turn the manual host-setup steps into an idempotent playbook so subsequent days (W2 Postgres HA, W2 PgBouncer, W2 OTel collector, W3 Redis Sentinel, W3 MinIO distributed, W4 HAProxy) can each land as a self-contained role on top of this baseline. Layout (full tree under infra/ansible/): ansible.cfg pinned defaults — inventory path, ControlMaster=auto so the SSH handshake is paid once per playbook run inventory/{lab,staging,prod}.yml three environments. lab is the R720's local Incus container (10.0.20.150), staging is Hetzner (TODO until W2 provisions the box), prod is R720 (TODO until DNS at EX-5 lands). group_vars/all.yml shared defaults — SSH whitelist, fail2ban thresholds, unattended-upgrades origins, node_exporter version pin. playbooks/site.yml entry point. Two plays: 1. common (every host) 2. incus_host (incus_hosts group) roles/common/ idempotent baseline: ssh.yml — drop-in /etc/ssh/sshd_config.d/50-veza- hardening.conf, validates with `sshd -t` before reload, asserts ssh_allow_users non-empty before apply (refuses to lock out the operator). fail2ban.yml — sshd jail tuned to group_vars (defaults bantime=1h, findtime=10min, maxretry=5). unattended_upgrades.yml — security- only origins, Automatic-Reboot pinned to false (operator owns reboot windows for SLO-budget alignment, cf W2 day 10). node_exporter.yml — pinned to 1.8.2, runs as a systemd unit on :9100. Skips download when --version already matches. roles/incus_host/ zabbly upstream apt repo + incus + incus-client install. First-time `incus admin init --preseed` only when `incus list` errors (i.e. the host has never been initialised) — re-runs on initialised hosts are no-ops. Configures incusbr0 / 10.99.0.1/24 with NAT + default storage pool. Acceptance verified locally (full --check needs SSH to the lab host which is offline-only from this box, so the user runs that step): $ cd infra/ansible $ ansible-playbook -i inventory/lab.yml playbooks/site.yml --syntax-check playbook: playbooks/site.yml ← clean $ ansible-playbook -i inventory/lab.yml playbooks/site.yml --list-tasks 21 tasks across 2 plays, all tagged. ← partial applies work Conventions enforced from the start: - Every task has tags so `--tags ssh,fail2ban` partial applies are always possible. - Sub-task files (ssh.yml, fail2ban.yml, etc.) so the role main.yml stays a directory of concerns, not a wall of tasks. - Validators run before reload (sshd -t for sshd_config). The role refuses to apply changes that would lock the operator out. - Comments answer "why" — task names + module names already say "what". Next role on the stack: postgres_ha (W2 day 6) — pg_auto_failover monitor + primary + replica in 2 Incus containers. SKIP_TESTS=1 — IaC YAML, no app code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 18:16:38 +02:00

17 commits