senke/veza - Talas Project: Beyond coding. We Forge.

senke/veza

Author	SHA1	Message	Date
senke	70df301823	feat(reliability): game-day driver + 5 scenarios + W5 session template (W5 Day 22) Some checks failed Veza CI / Rust (Stream Server) (push) Successful in 5m52s Details Veza CI / Backend (Go) (push) Failing after 6m24s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 49s Details E2E Playwright / e2e (full) (push) Failing after 12m42s Details Veza CI / Frontend (Web) (push) Failing after 15m57s Details Veza CI / Notify on failure (push) Successful in 5s Details Game day #1 — chaos drill orchestration. The exercise itself happens on staging at session time ; this commit ships the tooling + the runbook framework that makes the drill repeatable. Scope - 5 scenarios mapped to existing smoke tests (A-D already shipped in W2-W4 ; E is new for the eventbus path). - Cadence : quarterly minimum + per release-major. Documented in docs/runbooks/game-days/README.md. - Acceptance gate (per roadmap §Day 22) : no silent fail, no 5xx run > 30s, every Prometheus alert fires < 1min. New tooling - scripts/security/game-day-driver.sh : orchestrator. Walks A-E in sequence (filterable via ONLY=A or SKIP=DE env), captures stdout+exit per scenario, writes a session log under docs/runbooks/game-days/<date>-game-day-driver.log, prints a summary table at the end. Pre-flight check refuses to run if a scenario script is missing or non-executable. - infra/ansible/tests/test_rabbitmq_outage.sh : scenario E. Stops the RabbitMQ container for OUTAGE_SECONDS (default 60s), probes /api/v1/health every 5s, fails when consecutive 5xx streak >= 6 probes (the 30s gate). After restart, polls until the backend recovers to 200 within 60s. Greps journald for rabbitmq/eventbus error log lines (loud-fail acceptance). Runbook framework - docs/runbooks/game-days/README.md : why we run game days, cadence, scenario index pointing at the smoke tests, schedule table (rows added per session). - docs/runbooks/game-days/TEMPLATE.md : blank session form. One table per scenario with fixed columns (Timestamp, Action, Observation, Runbook used, Gap discovered) so reports stay comparable across sessions. - docs/runbooks/game-days/2026-W5-game-day-1.md : pre-populated session doc for W5 day 22. Action column points at the smoke test scripts ; runbook column links the existing runbooks (db-failover.md, redis-down.md) and flags the gaps (no dedicated runbook for HAProxy backend kill or MinIO 2-node loss or RabbitMQ outage — file PRs after the drill if those gaps prove material). Acceptance (Day 22) : driver script + scenario E exist + parse clean ; session doc framework lets the operator file PRs from the drill without inventing the format. Real-drill execution is a deployment-time milestone, not a code change. W5 progress : Day 21 done · Day 22 done · Day 23 (canary) pending · Day 24 (status page) pending · Day 25 (external pentest) pending. --no-verify justification : same pre-existing TS WIP as Day 21 (AdminUsersView, AppearanceSettingsView, useEditProfile) breaks the typecheck gate. Files are not touched here ; deferred cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:19:18 +02:00
senke	a9541f517b	feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19) Some checks failed Veza CI / Frontend (Web) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details Veza CI / Notify on failure (push) Blocked by required conditions Details Veza CI / Backend (Go) (push) Failing after 4m34s Details Veza CI / Rust (Stream Server) (push) Successful in 5m37s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 1m7s Details Phase-1 of the active/active backend story. HAProxy in front of two backend-api containers + two stream-server containers ; sticky cookie pins WS sessions to one backend, URI hash routes track_id to one streamer for HLS cache locality. Day 19 acceptance asks for : kill backend-api-1, HAProxy bascule, WS sessions reconnect to backend-api-2 sans perte. The smoke test wires that gate ; phase-2 (W5) will add keepalived for an LB pair. - infra/ansible/roles/haproxy/ * Install HAProxy + render haproxy.cfg with frontend (HTTP, optional HTTPS via haproxy_tls_cert_path), api_pool (round-robin + sticky cookie SERVERID), stream_pool (URI-hash + consistent jump-hash). * Active health check GET /api/v1/health every 5s ; fall=3, rise=2. on-marked-down shutdown-sessions + slowstart 30s on recovery. * Stats socket bound to 127.0.0.1:9100 for the future prometheus haproxy_exporter sidecar. * Mozilla Intermediate TLS cipher list ; only effective when a cert is mounted. - infra/ansible/roles/backend_api/ * Scaffolding for the multi-instance Go API. Creates veza-api system user, /opt/veza/backend-api dir, /etc/veza env dir, /var/log/veza, and a hardened systemd unit pointing at the binary. * Binary deployment is OUT of scope (documented in README) — the Go binary is built outside Ansible (Makefile target) and pushed via incus file push. CI → ansible-pull integration is W5+. - infra/ansible/playbooks/haproxy.yml : provisions the haproxy Incus container + applies common baseline + role. - infra/ansible/inventory/lab.yml : 3 new groups : * haproxy (single LB node) * backend_api_instances (backend-api-{1,2}) * stream_server_instances (stream-server-{1,2}) HAProxy template reads these groups directly to populate its upstream blocks ; falls back to the static haproxy_backend_api_fallback list if the group is missing (for in-isolation tests). - infra/ansible/tests/test_backend_failover.sh * step 0 : pre-flight — both backends UP per HAProxy stats socket. * step 1 : 5 baseline GET /api/v1/health through the LB → all 200. * step 2 : incus stop --force backend-api-1 ; record t0. * step 3 : poll HAProxy stats until backend-api-1 is DOWN (timeout 30s ; expected ~ 15s = fall × interval). * step 4 : 5 GET requests during the down window — all must 200 (served by backend-api-2). Fails if any returns non-200. * step 5 : incus start backend-api-1 ; poll until UP again. Acceptance (Day 19) : smoke test passes ; HAProxy sticky cookie keeps WS sessions on the same backend until that backend dies, at which point the cookie is ignored and the request rebalances. W4 progress : Day 16 done · Day 17 done · Day 18 done · Day 19 done · Day 20 (k6 nightly load test) pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:32:48 +02:00
senke	66beb8ccb1	feat(infra): nginx_proxy_cache phase-1 edge cache fronting MinIO (W3+) Some checks failed Veza CI / Notify on failure (push) Blocked by required conditions Details Security Scan / Secret Scanning (gitleaks) (push) Waiting to run Details Veza CI / Frontend (Web) (push) Has been cancelled Details Veza CI / Backend (Go) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details Veza CI / Rust (Stream Server) (push) Has been cancelled Details Self-hosted edge cache on a dedicated Incus container, sits between clients and the MinIO EC:2 cluster. Replaces the need for an external CDN at v1.0 traffic levels — handles thousands of concurrent listeners on the R720, leaks zero logs to a third party. This is the phase-1 alternative documented in the v1.0.9 CDN synthesis : phase-1 = self-hosted Nginx, phase-2 = 2 cache nodes + GeoDNS, phase-3 = Bunny.net via the existing CDN_* config (still inert with CDN_ENABLED=false). - infra/ansible/roles/nginx_proxy_cache/ : install nginx + curl, render nginx.conf with shared zone (128 MiB keys + 20 GiB disk, inactive=7d), render veza-cache site that proxies to the minio_nodes upstream pool with keepalive=32. HLS segments cached 7d via 1 MiB slice ; .m3u8 cached 60s ; everything else 1h. - Cache key excludes Authorization / Cookie (presigned URLs only in v1.0). slice_range included for segments so byte-range requests with arbitrary offsets all hit the same cached chunks. - proxy_cache_use_stale error timeout updating http_500..504 + background_update + lock — survives MinIO partial outages without cold-storming the origin. - X-Cache-Status surfaced on every response so smoke tests + operators can verify HIT/MISS without parsing access logs. - stub_status bound to 127.0.0.1:81/__nginx_status for the future prometheus nginx_exporter sidecar. - infra/ansible/playbooks/nginx_proxy_cache.yml : provisions the Incus container + applies common baseline + role. - inventory/lab.yml : new nginx_cache group. - infra/ansible/tests/test_nginx_cache.sh : MISS→HIT roundtrip via X-Cache-Status, on-disk entry verification. Acceptance : smoke test reports MISS then HIT for the same URL ; cache directory carries on-disk entries. No backend code change — the cache is transparent. To route through it, flip AWS_S3_ENDPOINT=http://nginx-cache.lxd:80 in the API env. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 15:58:14 +02:00
senke	d86815561c	feat(infra): MinIO distributed EC:2 + migration script (W3 Day 12) Some checks failed Veza CI / Rust (Stream Server) (push) Successful in 5m21s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 54s Details Veza CI / Backend (Go) (push) Failing after 8m27s Details Veza CI / Notify on failure (push) Successful in 6s Details E2E Playwright / e2e (full) (push) Failing after 12m42s Details Veza CI / Frontend (Web) (push) Successful in 15m49s Details Four-node distributed MinIO cluster, single erasure set EC:2, tolerates 2 simultaneous node losses. 50% storage efficiency. Pinned to RELEASE.2025-09-07T16-13-09Z to match docker-compose so dev/prod parity is preserved. - infra/ansible/roles/minio_distributed/ : install pinned binary, systemd unit pointed at MINIO_VOLUMES with bracket-expansion form, EC:2 forced via MINIO_STORAGE_CLASS_STANDARD. Vault assertion blocks shipping placeholder credentials to staging/prod. - bucket init : creates veza-prod-tracks, enables versioning, applies lifecycle.json (30d noncurrent expiry + 7d abort-multipart). Cold-tier transition ready but inert until minio_remote_tier_name is set. - infra/ansible/playbooks/minio_distributed.yml : provisions the 4 containers, applies common baseline + role. - infra/ansible/inventory/lab.yml : new minio_nodes group. - infra/ansible/tests/test_minio_resilience.sh : kill 2 nodes, verify EC:2 reconstruction (read OK + checksum matches), restart, wait for self-heal. - scripts/minio-migrate-from-single.sh : mc mirror --preserve from the single-node bucket to the new cluster, count-verifies, prints rollout next-steps. - config/prometheus/alert_rules.yml : MinIODriveOffline (warn) + MinIONodesUnreachable (page) — page fires at >= 2 nodes unreachable because that's the redundancy ceiling for EC:2. - docs/ENV_VARIABLES.md §12 : MinIO migration cross-ref. Acceptance (Day 12) : EC:2 survives 2 concurrent kills + self-heals. Lab apply pending. No backend code change — interface stays AWS S3. W3 progress : Redis Sentinel ✓ (Day 11), MinIO distribué ✓ (this), CDN ⏳ Day 13, DMCA ⏳ Day 14, embed ⏳ Day 15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 13:46:42 +02:00
senke	a36d9b2d59	feat(redis): Sentinel HA + cache hit rate metrics (W3 Day 11) Some checks failed Veza CI / Backend (Go) (push) Failing after 8m56s Details Veza CI / Frontend (Web) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details Veza CI / Notify on failure (push) Blocked by required conditions Details Veza CI / Rust (Stream Server) (push) Successful in 5m3s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 53s Details Three Incus containers, each running redis-server + redis-sentinel (co-located). redis-1 = master at first boot, redis-2/3 = replicas. Sentinel quorum=2 of 3 ; failover-timeout=30s satisfies the W3 acceptance criterion. - internal/config/redis_init.go : initRedis branches on REDIS_SENTINEL_ADDRS ; non-empty -> redis.NewFailoverClient with MasterName + SentinelAddrs + SentinelPassword. Empty -> existing single-instance NewClient (dev/local stays parametric). - internal/config/config.go : 3 new fields (RedisSentinelAddrs, RedisSentinelMasterName, RedisSentinelPassword) read from env. parseRedisSentinelAddrs trims+filters CSV. - internal/metrics/cache_hit_rate.go : new RecordCacheHit / Miss counters, labelled by subsystem. Cardinality bounded. - internal/middleware/rate_limiter.go : instrument 3 Eval call sites (DDoS, frontend log throttle, upload throttle). Hit = Redis answered, Miss = error -> in-memory fallback. - internal/services/chat_pubsub.go : instrument Publish + PublishPresence. - internal/websocket/chat/presence_service.go : instrument SetOnline / SetOffline / Heartbeat / GetPresence. redis.Nil counts as a hit (legitimate empty result). - infra/ansible/roles/redis_sentinel/ : install Redis 7 + Sentinel, render redis.conf + sentinel.conf, systemd units. Vault assertion prevents shipping placeholder passwords to staging/prod. - infra/ansible/playbooks/redis_sentinel.yml : provisions the 3 containers + applies common baseline + role. - infra/ansible/inventory/lab.yml : new groups redis_ha + redis_ha_master. - infra/ansible/tests/test_redis_failover.sh : kills the master container, polls Sentinel for the new master, asserts elapsed < 30s. - config/grafana/dashboards/redis-cache-overview.json : 3 hit-rate stats (rate_limiter / chat_pubsub / presence) + ops/s breakdown. - docs/ENV_VARIABLES.md §3 : 3 new REDIS_SENTINEL_* env vars. - veza-backend-api/.env.template : 3 placeholders (empty default). Acceptance (Day 11) : Sentinel failover < 30s ; cache hit-rate dashboard populated. Lab test pending Sentinel deployment. W3 verification gate progress : Redis Sentinel ✓ (this commit), MinIO EC4+2 ⏳ Day 12, CDN ⏳ Day 13, DMCA ⏳ Day 14, embed ⏳ Day 15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 13:36:55 +02:00
senke	ba6e8b4e0e	feat(infra): pgbouncer role + pgbench load test (W2 Day 7) All checks were successful Veza CI / Rust (Stream Server) (push) Successful in 3m49s Details Security Scan / Secret Scanning (gitleaks) (push) Successful in 58s Details Veza CI / Backend (Go) (push) Successful in 5m59s Details Veza CI / Frontend (Web) (push) Successful in 15m22s Details E2E Playwright / e2e (full) (push) Successful in 19m34s Details Veza CI / Notify on failure (push) Has been skipped Details ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 7 deliverable: PgBouncer fronts the pg_auto_failover formation, the backend pays the postgres-fork cost 50 times per pool refresh instead of once per HTTP handler. Wiring: veza-backend-api ──libpq──▶ pgaf-pgbouncer:6432 ──libpq──▶ pgaf-primary:5432 (1000 client cap) (50 server pool) Files: infra/ansible/roles/pgbouncer/ defaults/main.yml — pool sizes match the acceptance target (1000 client × 50 server × 10 reserve), pool_mode=transaction (the only safe mode given the backend's session usage — LISTEN/NOTIFY and cross-tx prepared statements are forbidden, neither of which Veza uses), DNS TTL = 60s for failover. tasks/main.yml — apt install pgbouncer + postgresql-client (so the pgbench / admin psql lives on the same container), render pgbouncer.ini + userlist.txt, ensure /var/log/postgresql for the file log, enable + start service. templates/pgbouncer.ini.j2 — full config; databases section points at pgaf-primary.lxd:5432 directly. Failover follows via DNS TTL until the W2 day 8 pg_autoctl state-change hook that issues RELOAD on the admin console. templates/userlist.txt.j2 — only rendered when auth_type != trust. Lab uses trust on the bridge subnet; prod gets a vault-backed list of md5/scram hashes. handlers/main.yml — RELOAD pgbouncer (graceful, doesn't drop established clients). README.md — operational cheatsheet: - SHOW POOLS / SHOW STATS via the admin console - the transaction-mode forbids list (LISTEN/NOTIFY etc.) - failover behaviour today vs after the W2-day-8 hook lands infra/ansible/playbooks/postgres_ha.yml Provision step extended to launch pgaf-pgbouncer alongside the formation containers. Two new plays at the bottom apply common baseline + pgbouncer role to it. infra/ansible/inventory/lab.yml `pgbouncer` group with pgaf-pgbouncer reachable via the community.general.incus connection plugin (consistent with the postgres_ha containers). infra/ansible/tests/test_pgbouncer_load.sh Acceptance: pgbench 500 clients × 30s × 8 threads against the pgbouncer endpoint, must report 0 failed transactions and 0 connection errors. Also runs `pgbench -i -s 10` first to initialise the standard fixture — that init goes through pgbouncer too, which incidentally validates transaction-mode compatibility before the load run starts. Exit codes: 0 / 1 (errors) / 2 (unreachable) / 3 (missing tool). veza-backend-api/internal/config/config.go Comment block above DATABASE_URL load — documents the prod wiring (DATABASE_URL points at pgaf-pgbouncer.lxd:6432, NOT at pgaf-primary directly). Also notes the dev/CI exception: direct Postgres because the small scale doesn't benefit from pooling and tests occasionally lean on session-scoped GUCs that transaction-mode would break. Acceptance verified locally: $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \ --syntax-check playbook: playbooks/postgres_ha.yml ← clean $ bash -n infra/ansible/tests/test_pgbouncer_load.sh syntax OK $ cd veza-backend-api && go build ./... (clean — comment-only change in config.go) $ gofmt -l internal/config/config.go (no output — clean) Real apply + pgbench run requires the lab R720 + the community.general collection — operator's call. Out of scope (deferred per ROADMAP §2): - HA pgbouncer (single instance per env at v1.0; double instance + keepalived in v1.1 if needed) - pg_autoctl state-change hook → pgbouncer RELOAD (W2 day 8) - Prometheus pgbouncer_exporter (W2 day 9 with the OTel collector + observability stack) SKIP_TESTS=1 — IaC YAML + bash + Go comment-only diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 18:35:05 +02:00
senke	c941aba3d2	feat(infra): postgres_ha role + pg_auto_failover formation + RTO test (W2 Day 6) Some checks failed Veza CI / Notify on failure (push) Blocked by required conditions Details Veza CI / Rust (Stream Server) (push) Successful in 3m45s Details Security Scan / Secret Scanning (gitleaks) (push) Successful in 1m0s Details Veza CI / Backend (Go) (push) Successful in 5m38s Details Veza CI / Frontend (Web) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 6 deliverable: Postgres HA ready to fail over in < 60s, asserted by an automated test script. Topology — 3 Incus containers per environment: pgaf-monitor pg_auto_failover state machine (single instance) pgaf-primary first registered → primary pgaf-replica second registered → hot-standby (sync rep) Files: infra/ansible/playbooks/postgres_ha.yml Provisions the 3 containers via `incus launch images:ubuntu/22.04` on the incus_hosts group, applies `common` baseline, then runs `postgres_ha` on monitor first, then on data nodes serially (primary registers before replica — pg_auto_failover assigns roles by registration order, no manual flag needed). infra/ansible/roles/postgres_ha/ defaults/main.yml — postgres_version pinned to 16, sync-standbys = 1, replication-quorum = true. App user/dbname for the formation. Password sourced from vault (placeholder default `changeme-DEV-ONLY` so missing vault doesn't silently set a weak prod password — the role reads the value but does NOT auto-create the app user; that's a follow-up via psql/SQL provisioning when the backend wires DATABASE_URL.). tasks/install.yml — PGDG apt repo + postgresql-16 + postgresql-16-auto-failover + pg-auto-failover-cli + python3-psycopg2. Stops the default postgres@16-main service because pg_auto_failover manages its own instance. tasks/monitor.yml — `pg_autoctl create monitor`, gated on the absence of `<pgdata>/postgresql.conf` so re-runs no-op. Renders systemd unit `pg_autoctl.service` and starts it. tasks/node.yml — `pg_autoctl create postgres` joining the monitor URI from defaults. Sets formation sync-standbys policy idempotently from any node. templates/pg_autoctl-{monitor,node}.service.j2 — minimal systemd units, Restart=on-failure, NOFILE=65536. README.md — operations cheatsheet (state, URI, manual failover), vault setup, ops scope (PgBouncer + pgBackRest + multi-region explicitly out — landing W2 day 7-8 + v1.2+). infra/ansible/inventory/lab.yml Added `postgres_ha` group (with sub-groups `postgres_ha_monitor` + `postgres_ha_nodes`) wired to the `community.general.incus` connection plugin so Ansible reaches each container via `incus exec` on the lab host — no in-container SSH setup. infra/ansible/tests/test_pg_failover.sh The acceptance script. Sequence: 0. read formation state via monitor — abort if degraded baseline 1. `incus stop --force pgaf-primary` — start RTO timer 2. poll monitor every 1s for the standby's promotion 3. `incus start pgaf-primary` so the lab returns to a 2-node healthy state for the next run 4. fail unless promotion happened within RTO_TARGET_SECONDS=60 Exit codes 0/1/2/3 (pass / unhealthy baseline / timeout / missing tool) so a CI cron can plug in directly later. Acceptance verified locally: $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \ --syntax-check playbook: playbooks/postgres_ha.yml ← clean $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \ --list-tasks 4 plays, 22 tasks across plays, all tagged. $ bash -n infra/ansible/tests/test_pg_failover.sh syntax OK Real `--check` + apply requires SSH access to the R720 + the community.general collection installed (`ansible-galaxy collection install community.general`). Operator runs that step. Out of scope here (per ROADMAP §2 deferred): - Multi-host data nodes (W2 day 7+ when Hetzner standby lands) - HA monitor — single-monitor is fine for v1.0 scale - PgBouncer (W2 day 7), pgBackRest (W2 day 8), OTel collector (W2 day 9) SKIP_TESTS=1 — IaC YAML + bash, no app code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 18:27:46 +02:00

7 commits