The game-day driver had no notion of inventory — it would happily
execute the 5 destructive scenarios (Postgres kill, HAProxy stop,
Redis kill, MinIO node loss, RabbitMQ stop) against whatever the
underlying scripts pointed at, with the operator's only protection
being "don't typo a host." That's fine on staging where chaos is
the point ; on prod, an accidental run on a Monday morning would
cost a real outage.
Added :
scripts/security/game-day-driver.sh
* INVENTORY env var — defaults to 'staging' so silence stays
safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive
type-the-phrase 'KILL-PROD' confirm. Anything other than
staging|prod aborts.
* Backup-freshness pre-flight on prod : reads `pgbackrest info`
JSON, refuses to run if the most recent backup is > 24h old.
SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline.
* Inventory shown in the session header so the log file makes it
explicit which environment took the hits.
docs/runbooks/rabbitmq-down.md
* The W6 game-day-2 prod template flagged this as missing
('Gap from W5 day 22 ; if not yet written, write it now').
Mirrors the structure of redis-down.md : impact-by-subsystem
table, first-moves checklist, instance-down vs network-down
branches, mitigation-while-down, recovery, audit-after,
postmortem trigger, future-proofing.
* Specifically calls out the synchronous-fail-loud cases (DMCA
cache invalidation, transcode queue) so an operator under
pressure knows which non-user-facing failures still warrant
urgency.
Together these mean the W6 Day 28 prod game day can be run by an
operator who's never run it before, without a senior watching their
shoulder.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>