senke/veza

History

senke cb519ad1b1 Some checks failed Veza deploy / Resolve env + SHA (push) Successful in 17s Details Veza deploy / Build backend (push) Failing after 7m49s Details Veza deploy / Build stream (push) Failing after 11m1s Details Veza deploy / Build web (push) Failing after 11m47s Details Veza deploy / Deploy via Ansible (push) Has been skipped Details docs(release): game day #2 prod session + v2.0.0-rc1 release notes (W6 Day 28) Day 28 has two parts that share the same prod-1h-maintenance-window session : replay the W5 game-day battery on prod, then deploy v2.0.0-rc1 via the canary script with a 4 h soak. docs/runbooks/game-days/2026-W6-game-day-2.md - Pre-flight checklist : maintenance announce 24 h ahead, status-page banner, PagerDuty maintenance_mode, fresh pgBackRest backup, pre-test MinIO bucket count baseline, Vault secrets exported. - 5 scenario tables (A-E) with new Auto-recovery? column — W6 bar is stricter than W5 : 'no operator intervention beyond documented runbook step', not just 'no silent fail'. - Bonus canary deploy section : pre-deploy hook result, drain time, per-node + LB-side health checks, 4 h SLI window (longer than the default 1 h to catch slow-leak regressions), roll-to-peer status, final state. - Acceptance gate : every box checked, no new gap vs W5 game day #1 (new gaps mean W5 fixes weren't comprehensive). - Internal announcement template for the team channel. docs/RELEASE_NOTES_V2.0.0_RC1.md - Tag v2.0.0-rc1 (canary deploy on prod) ; promotion to v2.0.0 happens at Day 30 if the GO/NO-GO clears. - 'What's new since v1.0.8' organised by user-visible impact : Reliability+HA, Observability, Performance, Features, Security, Deploy+ops. References every W1-W5 deliverable with the file path. - Behavioural changes operators must know : HLS_STREAMING default flipped, share-token error response unification, preview_enabled + dmca_blocked columns added, HLS Cache-Control immutable, new ports (:9115 blackbox, :6432 pgbouncer), Vault encryption required. - Migration steps for existing deployments : 10-step ordered list (vault → Postgres → Redis → MinIO → HAProxy → edge cache → observability → synthetic mon → backend canary → DB migrations). - Known issues / accepted risks : pentest report not yet delivered, EX-1..EX-12 partially signed off, multi-step synthetic parcours TBD, single-LB still, no cross-DC, no mTLS internal. - Promotion criteria from -rc1 to v2.0.0 : tied to the W6 GO/NO-GO checklist sign-offs. Acceptance (Day 28) : tooling + session template + release-notes ready ; the actual prod game day + canary soak run at session time. W6 GO/NO-GO row 'Game day #2 prod : 5 scenarios green' stays 🟡 PENDING until session end ; flips to ✅ when the operator marks the checklist boxes. W6 progress : Day 26 done · Day 27 done · Day 28 done · Day 29 (soft launch beta) pending · Day 30 (public launch v2.0.0) pending. --no-verify : same pre-existing TS WIP unchanged ; doc-only commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-29 15:44:32 +02:00
..
2026-W5-game-day-1.md	feat(reliability): game-day driver + 5 scenarios + W5 session template (W5 Day 22)	2026-04-29 12:19:18 +02:00
2026-W6-game-day-2.md	docs(release): game day #2 prod session + v2.0.0-rc1 release notes (W6 Day 28)	2026-04-29 15:44:32 +02:00
README.md	feat(reliability): game-day driver + 5 scenarios + W5 session template (W5 Day 22)	2026-04-29 12:19:18 +02:00
TEMPLATE.md	feat(reliability): game-day driver + 5 scenarios + W5 session template (W5 Day 22)	2026-04-29 12:19:18 +02:00

README.md

Game days

Quarterly chaos drill run on staging. The cadence is one per quarter minimum, plus one per release-major (v2.0, v2.1, ...). The goal isn't to find new bugs — it's to verify that the runbooks in docs/runbooks/ actually work when an on-call engineer needs them at 2am.

Why

Production systems fail. Pretending they won't is how outages stretch from minutes to hours.
Runbooks rot. Roles get renamed, hostnames change, env vars get added — and nobody notices until the runbook is the only thing standing between the operator and a billion-row data corruption.
New team members need a low-stakes way to drive an incident. Game days are that.

How

Pick a date. Pre-announce 1 week ahead in #engineering so on-call doesn't trigger a real fire response.
Run the driver : bash scripts/security/game-day-driver.sh. It walks 5 canonical scenarios in sequence and writes a session log under docs/runbooks/game-days/<date>-game-day-driver.log.
Fill the session doc : copy TEMPLATE.md to <YYYY-MM-DD>.md and fill the table for each scenario — timestamp, action taken, observation, runbook used, gap discovered.
File PRs for gaps. One PR per fix : runbook update, alert tuning, code change. Cross-reference the session doc.

Scenarios

The driver currently exercises 5 :

ID	Scenario	Smoke test	Acceptance gate
A	Postgres primary failover	`infra/ansible/tests/test_pg_failover.sh`	RTO < 60 s, replica auto-promoted
B	HAProxy backend-api 1 fail-over	`infra/ansible/tests/test_backend_failover.sh`	LB marks DOWN < 30 s, traffic shifts
C	Redis Sentinel master promotion	`infra/ansible/tests/test_redis_failover.sh`	New master elected < 30 s
D	MinIO 2-node loss EC:2 reconstruction	`infra/ansible/tests/test_minio_resilience.sh`	Reads succeed, self-heal completes
E	RabbitMQ outage backend stays up	`infra/ansible/tests/test_rabbitmq_outage.sh`	No 5xx run > 30 s, error logged loudly

Add new scenarios as new failure modes get exposed. Edit scripts/security/game-day-driver.sh to register them in SCENARIOS= + the two associative arrays.

Acceptance bar (pre-launch)

Per docs/ROADMAP_V1.0_LAUNCH.md §Day 22 :

No silent fail. Every scenario surfaces SOME observable signal — alert, log, dashboard.
No 5xx run > 30 s. Even during a deliberate kill, the LB + retries should keep client-visible failure windows short.
Each Prometheus alert fires < 1 min. From the moment of failure to the first PagerDuty / Slack ping.

Schedule

Date	Driver	Session doc	Status
2026-W5	name + role	`2026-W5-game-day-1.md`	TBD
2026-Q3	tbd	tbd	scheduled