veza/docs/runbooks
senke 2a5bc11628 fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook
The game-day driver had no notion of inventory — it would happily
execute the 5 destructive scenarios (Postgres kill, HAProxy stop,
Redis kill, MinIO node loss, RabbitMQ stop) against whatever the
underlying scripts pointed at, with the operator's only protection
being "don't typo a host." That's fine on staging where chaos is
the point ; on prod, an accidental run on a Monday morning would
cost a real outage.

Added :

  scripts/security/game-day-driver.sh
    * INVENTORY env var — defaults to 'staging' so silence stays
      safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive
      type-the-phrase 'KILL-PROD' confirm. Anything other than
      staging|prod aborts.
    * Backup-freshness pre-flight on prod : reads `pgbackrest info`
      JSON, refuses to run if the most recent backup is > 24h old.
      SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline.
    * Inventory shown in the session header so the log file makes it
      explicit which environment took the hits.

  docs/runbooks/rabbitmq-down.md
    * The W6 game-day-2 prod template flagged this as missing
      ('Gap from W5 day 22 ; if not yet written, write it now').
      Mirrors the structure of redis-down.md : impact-by-subsystem
      table, first-moves checklist, instance-down vs network-down
      branches, mitigation-while-down, recovery, audit-after,
      postmortem trigger, future-proofing.
    * Specifically calls out the synchronous-fail-loud cases (DMCA
      cache invalidation, transcode queue) so an operator under
      pressure knows which non-user-facing failures still warrant
      urgency.

Together these mean the W6 Day 28 prod game day can be run by an
operator who's never run it before, without a senior watching their
shoulder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:32:05 +02:00
..
game-days docs(release): game day #2 prod session + v2.0.0-rc1 release notes (W6 Day 28) 2026-04-29 15:44:32 +02:00
api-availability-slo-burn.md feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10) 2026-04-28 01:30:34 +02:00
api-latency-slo-burn.md feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10) 2026-04-28 01:30:34 +02:00
cert-expiring-soon.md feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10) 2026-04-28 01:30:34 +02:00
db-failover.md feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10) 2026-04-28 01:30:34 +02:00
DEPLOYMENT.md chore(release): v0.961 — Playbook (runbooks déploiement, rollback, incident) 2026-03-02 19:09:46 +01:00
disk-full.md feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10) 2026-04-28 01:30:34 +02:00
GRACEFUL_DEGRADATION.md chore(release): v0.961 — Playbook (runbooks déploiement, rollback, incident) 2026-03-02 19:09:46 +01:00
INCIDENT_RESPONSE.md chore(release): v0.961 — Playbook (runbooks déploiement, rollback, incident) 2026-03-02 19:09:46 +01:00
payment-success-slo-burn.md feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10) 2026-04-28 01:30:34 +02:00
rabbitmq-down.md fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook 2026-04-30 22:32:05 +02:00
redis-down.md feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10) 2026-04-28 01:30:34 +02:00
ROLLBACK.md chore(release): v0.961 — Playbook (runbooks déploiement, rollback, incident) 2026-03-02 19:09:46 +01:00
SECRET_ROTATION.md chore(release): v0.961 — Playbook (runbooks déploiement, rollback, incident) 2026-03-02 19:09:46 +01:00