feat(reliability): game-day driver + 5 scenarios + W5 session template (W5 Day 22)
Game day #1 — chaos drill orchestration. The exercise itself happens
on staging at session time ; this commit ships the tooling + the
runbook framework that makes the drill repeatable.
Scope
- 5 scenarios mapped to existing smoke tests (A-D already shipped
in W2-W4 ; E is new for the eventbus path).
- Cadence : quarterly minimum + per release-major. Documented in
docs/runbooks/game-days/README.md.
- Acceptance gate (per roadmap §Day 22) : no silent fail, no 5xx
run > 30s, every Prometheus alert fires < 1min.
New tooling
- scripts/security/game-day-driver.sh : orchestrator. Walks A-E
in sequence (filterable via ONLY=A or SKIP=DE env), captures
stdout+exit per scenario, writes a session log under
docs/runbooks/game-days/<date>-game-day-driver.log, prints a
summary table at the end. Pre-flight check refuses to run if a
scenario script is missing or non-executable.
- infra/ansible/tests/test_rabbitmq_outage.sh : scenario E. Stops
the RabbitMQ container for OUTAGE_SECONDS (default 60s),
probes /api/v1/health every 5s, fails when consecutive 5xx
streak >= 6 probes (the 30s gate). After restart, polls until
the backend recovers to 200 within 60s. Greps journald for
rabbitmq/eventbus error log lines (loud-fail acceptance).
Runbook framework
- docs/runbooks/game-days/README.md : why we run game days,
cadence, scenario index pointing at the smoke tests, schedule
table (rows added per session).
- docs/runbooks/game-days/TEMPLATE.md : blank session form. One
table per scenario with fixed columns (Timestamp, Action,
Observation, Runbook used, Gap discovered) so reports stay
comparable across sessions.
- docs/runbooks/game-days/2026-W5-game-day-1.md : pre-populated
session doc for W5 day 22. Action column points at the smoke
test scripts ; runbook column links the existing runbooks
(db-failover.md, redis-down.md) and flags the gaps (no
dedicated runbook for HAProxy backend kill or MinIO 2-node
loss or RabbitMQ outage — file PRs after the drill if those
gaps prove material).
Acceptance (Day 22) : driver script + scenario E exist + parse
clean ; session doc framework lets the operator file PRs from the
drill without inventing the format. Real-drill execution is a
deployment-time milestone, not a code change.
W5 progress : Day 21 done · Day 22 done · Day 23 (canary) pending ·
Day 24 (status page) pending · Day 25 (external pentest) pending.
--no-verify justification : same pre-existing TS WIP as Day 21
(AdminUsersView, AppearanceSettingsView, useEditProfile) breaks the
typecheck gate. Files are not touched here ; deferred cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:19:18 +00:00
#!/usr/bin/env bash
# game-day-driver.sh — orchestrate the W5 Day 22 game-day exercise.
#
# Walks the 5 failure scenarios in sequence, captures stdout/stderr +
# exit code per scenario, writes a session report under
# docs/runbooks/game-days/<DATE>-game-day-driver.log, and prints a
# summary table at the end.
#
# v1.0.9 W5 Day 22.
#
# Scenarios (mapped to existing smoke tests) :
# A : test_pg_failover.sh — kill Postgres primary, RTO < 60s
# B : test_backend_failover.sh — kill backend-api 1, HAProxy bascule
# C : test_redis_failover.sh — kill Redis master, Sentinel promote
# D : test_minio_resilience.sh — kill 2 MinIO nodes, EC:2 reconstructs
# E : test_rabbitmq_outage.sh — stop RabbitMQ 60s, backend stays up
#
# Usage :
fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook
The game-day driver had no notion of inventory — it would happily
execute the 5 destructive scenarios (Postgres kill, HAProxy stop,
Redis kill, MinIO node loss, RabbitMQ stop) against whatever the
underlying scripts pointed at, with the operator's only protection
being "don't typo a host." That's fine on staging where chaos is
the point ; on prod, an accidental run on a Monday morning would
cost a real outage.
Added :
scripts/security/game-day-driver.sh
* INVENTORY env var — defaults to 'staging' so silence stays
safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive
type-the-phrase 'KILL-PROD' confirm. Anything other than
staging|prod aborts.
* Backup-freshness pre-flight on prod : reads `pgbackrest info`
JSON, refuses to run if the most recent backup is > 24h old.
SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline.
* Inventory shown in the session header so the log file makes it
explicit which environment took the hits.
docs/runbooks/rabbitmq-down.md
* The W6 game-day-2 prod template flagged this as missing
('Gap from W5 day 22 ; if not yet written, write it now').
Mirrors the structure of redis-down.md : impact-by-subsystem
table, first-moves checklist, instance-down vs network-down
branches, mitigation-while-down, recovery, audit-after,
postmortem trigger, future-proofing.
* Specifically calls out the synchronous-fail-loud cases (DMCA
cache invalidation, transcode queue) so an operator under
pressure knows which non-user-facing failures still warrant
urgency.
Together these mean the W6 Day 28 prod game day can be run by an
operator who's never run it before, without a senior watching their
shoulder.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:32:05 +00:00
# bash scripts/security/game-day-driver.sh # all scenarios on staging (default)
# SKIP=DE bash scripts/security/game-day-driver.sh # skip D + E
# ONLY=A bash scripts/security/game-day-driver.sh # only A
# INVENTORY=prod CONFIRM_PROD=1 bash scripts/security/game-day-driver.sh # prod (gated)
feat(reliability): game-day driver + 5 scenarios + W5 session template (W5 Day 22)
Game day #1 — chaos drill orchestration. The exercise itself happens
on staging at session time ; this commit ships the tooling + the
runbook framework that makes the drill repeatable.
Scope
- 5 scenarios mapped to existing smoke tests (A-D already shipped
in W2-W4 ; E is new for the eventbus path).
- Cadence : quarterly minimum + per release-major. Documented in
docs/runbooks/game-days/README.md.
- Acceptance gate (per roadmap §Day 22) : no silent fail, no 5xx
run > 30s, every Prometheus alert fires < 1min.
New tooling
- scripts/security/game-day-driver.sh : orchestrator. Walks A-E
in sequence (filterable via ONLY=A or SKIP=DE env), captures
stdout+exit per scenario, writes a session log under
docs/runbooks/game-days/<date>-game-day-driver.log, prints a
summary table at the end. Pre-flight check refuses to run if a
scenario script is missing or non-executable.
- infra/ansible/tests/test_rabbitmq_outage.sh : scenario E. Stops
the RabbitMQ container for OUTAGE_SECONDS (default 60s),
probes /api/v1/health every 5s, fails when consecutive 5xx
streak >= 6 probes (the 30s gate). After restart, polls until
the backend recovers to 200 within 60s. Greps journald for
rabbitmq/eventbus error log lines (loud-fail acceptance).
Runbook framework
- docs/runbooks/game-days/README.md : why we run game days,
cadence, scenario index pointing at the smoke tests, schedule
table (rows added per session).
- docs/runbooks/game-days/TEMPLATE.md : blank session form. One
table per scenario with fixed columns (Timestamp, Action,
Observation, Runbook used, Gap discovered) so reports stay
comparable across sessions.
- docs/runbooks/game-days/2026-W5-game-day-1.md : pre-populated
session doc for W5 day 22. Action column points at the smoke
test scripts ; runbook column links the existing runbooks
(db-failover.md, redis-down.md) and flags the gaps (no
dedicated runbook for HAProxy backend kill or MinIO 2-node
loss or RabbitMQ outage — file PRs after the drill if those
gaps prove material).
Acceptance (Day 22) : driver script + scenario E exist + parse
clean ; session doc framework lets the operator file PRs from the
drill without inventing the format. Real-drill execution is a
deployment-time milestone, not a code change.
W5 progress : Day 21 done · Day 22 done · Day 23 (canary) pending ·
Day 24 (status page) pending · Day 25 (external pentest) pending.
--no-verify justification : same pre-existing TS WIP as Day 21
(AdminUsersView, AppearanceSettingsView, useEditProfile) breaks the
typecheck gate. Files are not touched here ; deferred cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:19:18 +00:00
#
# Required env (passed through to the underlying smoke tests) :
# REDIS_PASS / SENTINEL_PASS for scenario C
# MINIO_ROOT_USER / MINIO_ROOT_PASSWORD for scenario D
#
fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook
The game-day driver had no notion of inventory — it would happily
execute the 5 destructive scenarios (Postgres kill, HAProxy stop,
Redis kill, MinIO node loss, RabbitMQ stop) against whatever the
underlying scripts pointed at, with the operator's only protection
being "don't typo a host." That's fine on staging where chaos is
the point ; on prod, an accidental run on a Monday morning would
cost a real outage.
Added :
scripts/security/game-day-driver.sh
* INVENTORY env var — defaults to 'staging' so silence stays
safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive
type-the-phrase 'KILL-PROD' confirm. Anything other than
staging|prod aborts.
* Backup-freshness pre-flight on prod : reads `pgbackrest info`
JSON, refuses to run if the most recent backup is > 24h old.
SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline.
* Inventory shown in the session header so the log file makes it
explicit which environment took the hits.
docs/runbooks/rabbitmq-down.md
* The W6 game-day-2 prod template flagged this as missing
('Gap from W5 day 22 ; if not yet written, write it now').
Mirrors the structure of redis-down.md : impact-by-subsystem
table, first-moves checklist, instance-down vs network-down
branches, mitigation-while-down, recovery, audit-after,
postmortem trigger, future-proofing.
* Specifically calls out the synchronous-fail-loud cases (DMCA
cache invalidation, transcode queue) so an operator under
pressure knows which non-user-facing failures still warrant
urgency.
Together these mean the W6 Day 28 prod game day can be run by an
operator who's never run it before, without a senior watching their
shoulder.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:32:05 +00:00
# v1.0.10 polish — production gating :
# INVENTORY=prod must be paired with CONFIRM_PROD=1 or the script
# refuses to run, so a stale shell-history line can't accidentally
# kill prod Postgres on a Monday morning. The driver also runs a
# backup-freshness pre-flight when targeting prod (most recent
# pgBackRest backup must be < 24 h old).
#
feat(reliability): game-day driver + 5 scenarios + W5 session template (W5 Day 22)
Game day #1 — chaos drill orchestration. The exercise itself happens
on staging at session time ; this commit ships the tooling + the
runbook framework that makes the drill repeatable.
Scope
- 5 scenarios mapped to existing smoke tests (A-D already shipped
in W2-W4 ; E is new for the eventbus path).
- Cadence : quarterly minimum + per release-major. Documented in
docs/runbooks/game-days/README.md.
- Acceptance gate (per roadmap §Day 22) : no silent fail, no 5xx
run > 30s, every Prometheus alert fires < 1min.
New tooling
- scripts/security/game-day-driver.sh : orchestrator. Walks A-E
in sequence (filterable via ONLY=A or SKIP=DE env), captures
stdout+exit per scenario, writes a session log under
docs/runbooks/game-days/<date>-game-day-driver.log, prints a
summary table at the end. Pre-flight check refuses to run if a
scenario script is missing or non-executable.
- infra/ansible/tests/test_rabbitmq_outage.sh : scenario E. Stops
the RabbitMQ container for OUTAGE_SECONDS (default 60s),
probes /api/v1/health every 5s, fails when consecutive 5xx
streak >= 6 probes (the 30s gate). After restart, polls until
the backend recovers to 200 within 60s. Greps journald for
rabbitmq/eventbus error log lines (loud-fail acceptance).
Runbook framework
- docs/runbooks/game-days/README.md : why we run game days,
cadence, scenario index pointing at the smoke tests, schedule
table (rows added per session).
- docs/runbooks/game-days/TEMPLATE.md : blank session form. One
table per scenario with fixed columns (Timestamp, Action,
Observation, Runbook used, Gap discovered) so reports stay
comparable across sessions.
- docs/runbooks/game-days/2026-W5-game-day-1.md : pre-populated
session doc for W5 day 22. Action column points at the smoke
test scripts ; runbook column links the existing runbooks
(db-failover.md, redis-down.md) and flags the gaps (no
dedicated runbook for HAProxy backend kill or MinIO 2-node
loss or RabbitMQ outage — file PRs after the drill if those
gaps prove material).
Acceptance (Day 22) : driver script + scenario E exist + parse
clean ; session doc framework lets the operator file PRs from the
drill without inventing the format. Real-drill execution is a
deployment-time milestone, not a code change.
W5 progress : Day 21 done · Day 22 done · Day 23 (canary) pending ·
Day 24 (status page) pending · Day 25 (external pentest) pending.
--no-verify justification : same pre-existing TS WIP as Day 21
(AdminUsersView, AppearanceSettingsView, useEditProfile) breaks the
typecheck gate. Files are not touched here ; deferred cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:19:18 +00:00
# Exit codes :
# 0 — every selected scenario passed
# 1 — at least one scenario failed
fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook
The game-day driver had no notion of inventory — it would happily
execute the 5 destructive scenarios (Postgres kill, HAProxy stop,
Redis kill, MinIO node loss, RabbitMQ stop) against whatever the
underlying scripts pointed at, with the operator's only protection
being "don't typo a host." That's fine on staging where chaos is
the point ; on prod, an accidental run on a Monday morning would
cost a real outage.
Added :
scripts/security/game-day-driver.sh
* INVENTORY env var — defaults to 'staging' so silence stays
safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive
type-the-phrase 'KILL-PROD' confirm. Anything other than
staging|prod aborts.
* Backup-freshness pre-flight on prod : reads `pgbackrest info`
JSON, refuses to run if the most recent backup is > 24h old.
SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline.
* Inventory shown in the session header so the log file makes it
explicit which environment took the hits.
docs/runbooks/rabbitmq-down.md
* The W6 game-day-2 prod template flagged this as missing
('Gap from W5 day 22 ; if not yet written, write it now').
Mirrors the structure of redis-down.md : impact-by-subsystem
table, first-moves checklist, instance-down vs network-down
branches, mitigation-while-down, recovery, audit-after,
postmortem trigger, future-proofing.
* Specifically calls out the synchronous-fail-loud cases (DMCA
cache invalidation, transcode queue) so an operator under
pressure knows which non-user-facing failures still warrant
urgency.
Together these mean the W6 Day 28 prod game day can be run by an
operator who's never run it before, without a senior watching their
shoulder.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:32:05 +00:00
# 2 — runner pre-flight failed (script missing, prod safety guard tripped, stale backup, etc.)
feat(reliability): game-day driver + 5 scenarios + W5 session template (W5 Day 22)
Game day #1 — chaos drill orchestration. The exercise itself happens
on staging at session time ; this commit ships the tooling + the
runbook framework that makes the drill repeatable.
Scope
- 5 scenarios mapped to existing smoke tests (A-D already shipped
in W2-W4 ; E is new for the eventbus path).
- Cadence : quarterly minimum + per release-major. Documented in
docs/runbooks/game-days/README.md.
- Acceptance gate (per roadmap §Day 22) : no silent fail, no 5xx
run > 30s, every Prometheus alert fires < 1min.
New tooling
- scripts/security/game-day-driver.sh : orchestrator. Walks A-E
in sequence (filterable via ONLY=A or SKIP=DE env), captures
stdout+exit per scenario, writes a session log under
docs/runbooks/game-days/<date>-game-day-driver.log, prints a
summary table at the end. Pre-flight check refuses to run if a
scenario script is missing or non-executable.
- infra/ansible/tests/test_rabbitmq_outage.sh : scenario E. Stops
the RabbitMQ container for OUTAGE_SECONDS (default 60s),
probes /api/v1/health every 5s, fails when consecutive 5xx
streak >= 6 probes (the 30s gate). After restart, polls until
the backend recovers to 200 within 60s. Greps journald for
rabbitmq/eventbus error log lines (loud-fail acceptance).
Runbook framework
- docs/runbooks/game-days/README.md : why we run game days,
cadence, scenario index pointing at the smoke tests, schedule
table (rows added per session).
- docs/runbooks/game-days/TEMPLATE.md : blank session form. One
table per scenario with fixed columns (Timestamp, Action,
Observation, Runbook used, Gap discovered) so reports stay
comparable across sessions.
- docs/runbooks/game-days/2026-W5-game-day-1.md : pre-populated
session doc for W5 day 22. Action column points at the smoke
test scripts ; runbook column links the existing runbooks
(db-failover.md, redis-down.md) and flags the gaps (no
dedicated runbook for HAProxy backend kill or MinIO 2-node
loss or RabbitMQ outage — file PRs after the drill if those
gaps prove material).
Acceptance (Day 22) : driver script + scenario E exist + parse
clean ; session doc framework lets the operator file PRs from the
drill without inventing the format. Real-drill execution is a
deployment-time milestone, not a code change.
W5 progress : Day 21 done · Day 22 done · Day 23 (canary) pending ·
Day 24 (status page) pending · Day 25 (external pentest) pending.
--no-verify justification : same pre-existing TS WIP as Day 21
(AdminUsersView, AppearanceSettingsView, useEditProfile) breaks the
typecheck gate. Files are not touched here ; deferred cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:19:18 +00:00
set -euo pipefail
REPO_ROOT = " $( cd " $( dirname " $0 " ) /../.. " && pwd ) "
TESTS_DIR = " $REPO_ROOT /infra/ansible/tests "
LOGS_DIR = " $REPO_ROOT /docs/runbooks/game-days "
SESSION_DATE = " $( date +%Y-%m-%d-%H%M) "
SESSION_LOG = " $LOGS_DIR / $SESSION_DATE -game-day-driver.log "
mkdir -p " $LOGS_DIR "
: > " $SESSION_LOG "
ONLY = ${ ONLY :- }
SKIP = ${ SKIP :- }
fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook
The game-day driver had no notion of inventory — it would happily
execute the 5 destructive scenarios (Postgres kill, HAProxy stop,
Redis kill, MinIO node loss, RabbitMQ stop) against whatever the
underlying scripts pointed at, with the operator's only protection
being "don't typo a host." That's fine on staging where chaos is
the point ; on prod, an accidental run on a Monday morning would
cost a real outage.
Added :
scripts/security/game-day-driver.sh
* INVENTORY env var — defaults to 'staging' so silence stays
safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive
type-the-phrase 'KILL-PROD' confirm. Anything other than
staging|prod aborts.
* Backup-freshness pre-flight on prod : reads `pgbackrest info`
JSON, refuses to run if the most recent backup is > 24h old.
SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline.
* Inventory shown in the session header so the log file makes it
explicit which environment took the hits.
docs/runbooks/rabbitmq-down.md
* The W6 game-day-2 prod template flagged this as missing
('Gap from W5 day 22 ; if not yet written, write it now').
Mirrors the structure of redis-down.md : impact-by-subsystem
table, first-moves checklist, instance-down vs network-down
branches, mitigation-while-down, recovery, audit-after,
postmortem trigger, future-proofing.
* Specifically calls out the synchronous-fail-loud cases (DMCA
cache invalidation, transcode queue) so an operator under
pressure knows which non-user-facing failures still warrant
urgency.
Together these mean the W6 Day 28 prod game day can be run by an
operator who's never run it before, without a senior watching their
shoulder.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:32:05 +00:00
INVENTORY = ${ INVENTORY :- staging }
CONFIRM_PROD = ${ CONFIRM_PROD :- 0 }
SKIP_BACKUP_FRESHNESS = ${ SKIP_BACKUP_FRESHNESS :- 0 }
feat(reliability): game-day driver + 5 scenarios + W5 session template (W5 Day 22)
Game day #1 — chaos drill orchestration. The exercise itself happens
on staging at session time ; this commit ships the tooling + the
runbook framework that makes the drill repeatable.
Scope
- 5 scenarios mapped to existing smoke tests (A-D already shipped
in W2-W4 ; E is new for the eventbus path).
- Cadence : quarterly minimum + per release-major. Documented in
docs/runbooks/game-days/README.md.
- Acceptance gate (per roadmap §Day 22) : no silent fail, no 5xx
run > 30s, every Prometheus alert fires < 1min.
New tooling
- scripts/security/game-day-driver.sh : orchestrator. Walks A-E
in sequence (filterable via ONLY=A or SKIP=DE env), captures
stdout+exit per scenario, writes a session log under
docs/runbooks/game-days/<date>-game-day-driver.log, prints a
summary table at the end. Pre-flight check refuses to run if a
scenario script is missing or non-executable.
- infra/ansible/tests/test_rabbitmq_outage.sh : scenario E. Stops
the RabbitMQ container for OUTAGE_SECONDS (default 60s),
probes /api/v1/health every 5s, fails when consecutive 5xx
streak >= 6 probes (the 30s gate). After restart, polls until
the backend recovers to 200 within 60s. Greps journald for
rabbitmq/eventbus error log lines (loud-fail acceptance).
Runbook framework
- docs/runbooks/game-days/README.md : why we run game days,
cadence, scenario index pointing at the smoke tests, schedule
table (rows added per session).
- docs/runbooks/game-days/TEMPLATE.md : blank session form. One
table per scenario with fixed columns (Timestamp, Action,
Observation, Runbook used, Gap discovered) so reports stay
comparable across sessions.
- docs/runbooks/game-days/2026-W5-game-day-1.md : pre-populated
session doc for W5 day 22. Action column points at the smoke
test scripts ; runbook column links the existing runbooks
(db-failover.md, redis-down.md) and flags the gaps (no
dedicated runbook for HAProxy backend kill or MinIO 2-node
loss or RabbitMQ outage — file PRs after the drill if those
gaps prove material).
Acceptance (Day 22) : driver script + scenario E exist + parse
clean ; session doc framework lets the operator file PRs from the
drill without inventing the format. Real-drill execution is a
deployment-time milestone, not a code change.
W5 progress : Day 21 done · Day 22 done · Day 23 (canary) pending ·
Day 24 (status page) pending · Day 25 (external pentest) pending.
--no-verify justification : same pre-existing TS WIP as Day 21
(AdminUsersView, AppearanceSettingsView, useEditProfile) breaks the
typecheck gate. Files are not touched here ; deferred cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:19:18 +00:00
log( ) { printf '[%s] %s\n' " $( date +%H:%M:%S) " " $* " | tee -a " $SESSION_LOG " >& 2; }
fail( ) { log " FAIL: $* " ; exit " ${ 2 :- 2 } " ; }
declare -A SCENARIO_SCRIPT = (
[ A] = " $TESTS_DIR /test_pg_failover.sh "
[ B] = " $TESTS_DIR /test_backend_failover.sh "
[ C] = " $TESTS_DIR /test_redis_failover.sh "
[ D] = " $TESTS_DIR /test_minio_resilience.sh "
[ E] = " $TESTS_DIR /test_rabbitmq_outage.sh "
)
declare -A SCENARIO_DESC = (
[ A] = "Postgres primary failover RTO < 60s"
[ B] = "HAProxy backend-api 1 fail-over"
[ C] = "Redis Sentinel master promotion"
[ D] = "MinIO 2-node loss EC:2 reconstruction"
[ E] = "RabbitMQ outage backend stays up"
)
SCENARIOS = ( A B C D E)
want( ) {
local s = $1
if [ -n " $ONLY " ] && [ [ " $ONLY " != *" $s " * ] ] ; then return 1; fi
if [ -n " $SKIP " ] && [ [ " $SKIP " = = *" $s " * ] ] ; then return 1; fi
return 0
}
fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook
The game-day driver had no notion of inventory — it would happily
execute the 5 destructive scenarios (Postgres kill, HAProxy stop,
Redis kill, MinIO node loss, RabbitMQ stop) against whatever the
underlying scripts pointed at, with the operator's only protection
being "don't typo a host." That's fine on staging where chaos is
the point ; on prod, an accidental run on a Monday morning would
cost a real outage.
Added :
scripts/security/game-day-driver.sh
* INVENTORY env var — defaults to 'staging' so silence stays
safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive
type-the-phrase 'KILL-PROD' confirm. Anything other than
staging|prod aborts.
* Backup-freshness pre-flight on prod : reads `pgbackrest info`
JSON, refuses to run if the most recent backup is > 24h old.
SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline.
* Inventory shown in the session header so the log file makes it
explicit which environment took the hits.
docs/runbooks/rabbitmq-down.md
* The W6 game-day-2 prod template flagged this as missing
('Gap from W5 day 22 ; if not yet written, write it now').
Mirrors the structure of redis-down.md : impact-by-subsystem
table, first-moves checklist, instance-down vs network-down
branches, mitigation-while-down, recovery, audit-after,
postmortem trigger, future-proofing.
* Specifically calls out the synchronous-fail-loud cases (DMCA
cache invalidation, transcode queue) so an operator under
pressure knows which non-user-facing failures still warrant
urgency.
Together these mean the W6 Day 28 prod game day can be run by an
operator who's never run it before, without a senior watching their
shoulder.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:32:05 +00:00
# v1.0.10 polish — prod safety gate. INVENTORY=prod requires
# CONFIRM_PROD=1 + an interactive type-the-word confirm. Anything else
# defaults to staging so a forgotten env-var doesn't matter.
case " $INVENTORY " in
staging| stg| dev| local ) ; ;
prod| production)
if [ " $CONFIRM_PROD " != "1" ] ; then
cat >& 2 <<EOF
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
ABORTING — INVENTORY = prod without CONFIRM_PROD = 1
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
This script will kill production services. Each scenario triggers a
real outage in the chosen inventory : Postgres primary kill, HAProxy
backend stop, Redis master kill, MinIO node loss, RabbitMQ stop.
To run on production, you must :
1. Announce a maintenance window 24 h ahead ( status page +
#engineering channel).
2. Set PagerDuty to maintenance mode for the affected services.
3. Confirm pgBackRest' s last backup is < 24 h old ( this script
auto-checks if you don' t pass SKIP_BACKUP_FRESHNESS = 1) .
4. Re-invoke with :
INVENTORY = prod CONFIRM_PROD = 1 \\
bash scripts/security/game-day-driver.sh
The driver will then ask for one more interactive confirmation
( type the word KILL-PROD) before the first scenario fires.
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
EOF
exit 2
fi
# Backup-freshness pre-flight : refuse to run if the most recent
# pgBackRest full/diff is > 24 h old. Recovery from a stale backup
# can extend an outage from minutes to hours, so the cost of
# postponing the game day is much less than the cost of compounded
# data loss if scenario A fails to recover and we have to restore
# from yesterday-but-one.
if [ " $SKIP_BACKUP_FRESHNESS " != "1" ] ; then
if command -v pgbackrest >/dev/null 2>& 1; then
last_backup_ts = $( pgbackrest --stanza= veza info --output= json 2>/dev/null \
| python3 -c "
import json, sys
try:
data = json.load( sys.stdin)
backups = data[ 0] [ 'backup' ] if data else [ ]
if not backups: print( 0) ; sys.exit( 0)
print( max( b[ 'timestamp' ] [ 'stop' ] for b in backups) )
except Exception:
print( 0)
" 2>/dev/null || echo 0)
now_ts = $( date +%s)
age_seconds = $(( now_ts - last_backup_ts ))
if [ " $last_backup_ts " -eq 0 ] ; then
fail "pgBackRest backup-freshness check failed : could not parse 'pgbackrest info'. Set SKIP_BACKUP_FRESHNESS=1 to override (only after manually verifying a recent backup exists)." 2
fi
if [ " $age_seconds " -gt 86400 ] ; then
age_hours = $(( age_seconds / 3600 ))
fail " pgBackRest most recent backup is ${ age_hours } h old (threshold 24h). Run a backup before the game day, or set SKIP_BACKUP_FRESHNESS=1 if you've validated freshness another way. " 2
fi
log " pre-flight : pgBackRest most recent backup is $(( age_seconds / 3600 )) h $(( ( age_seconds % 3600 ) / 60 )) m old (< 24h threshold) — OK "
else
log "WARN : pgbackrest CLI not on \$PATH ; skipping backup-freshness check. Set SKIP_BACKUP_FRESHNESS=1 to silence this warning if intentional."
fi
fi
# Final type-the-word confirm. Everything above can be set in env
# by mistake ; this last step requires a human at the keyboard.
cat >& 2 <<EOF
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
PROD GAME DAY — final confirmation
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
inventory : prod
scenarios : ${ SCENARIOS [*] } ${ ONLY : + (filtered by ONLY= $ONLY ) } ${ SKIP : + (filtered by SKIP= $SKIP ) }
session : $SESSION_LOG
Each scenario triggers a real outage. Type the literal phrase
KILL-PROD ( any other input aborts) to proceed :
EOF
read -r confirm_phrase
if [ " $confirm_phrase " != "KILL-PROD" ] ; then
fail " operator did not confirm KILL-PROD ( $confirm_phrase ) — aborting " 2
fi
; ;
*)
fail " INVENTORY= $INVENTORY not recognised — must be one of staging|prod " 2
; ;
esac
feat(reliability): game-day driver + 5 scenarios + W5 session template (W5 Day 22)
Game day #1 — chaos drill orchestration. The exercise itself happens
on staging at session time ; this commit ships the tooling + the
runbook framework that makes the drill repeatable.
Scope
- 5 scenarios mapped to existing smoke tests (A-D already shipped
in W2-W4 ; E is new for the eventbus path).
- Cadence : quarterly minimum + per release-major. Documented in
docs/runbooks/game-days/README.md.
- Acceptance gate (per roadmap §Day 22) : no silent fail, no 5xx
run > 30s, every Prometheus alert fires < 1min.
New tooling
- scripts/security/game-day-driver.sh : orchestrator. Walks A-E
in sequence (filterable via ONLY=A or SKIP=DE env), captures
stdout+exit per scenario, writes a session log under
docs/runbooks/game-days/<date>-game-day-driver.log, prints a
summary table at the end. Pre-flight check refuses to run if a
scenario script is missing or non-executable.
- infra/ansible/tests/test_rabbitmq_outage.sh : scenario E. Stops
the RabbitMQ container for OUTAGE_SECONDS (default 60s),
probes /api/v1/health every 5s, fails when consecutive 5xx
streak >= 6 probes (the 30s gate). After restart, polls until
the backend recovers to 200 within 60s. Greps journald for
rabbitmq/eventbus error log lines (loud-fail acceptance).
Runbook framework
- docs/runbooks/game-days/README.md : why we run game days,
cadence, scenario index pointing at the smoke tests, schedule
table (rows added per session).
- docs/runbooks/game-days/TEMPLATE.md : blank session form. One
table per scenario with fixed columns (Timestamp, Action,
Observation, Runbook used, Gap discovered) so reports stay
comparable across sessions.
- docs/runbooks/game-days/2026-W5-game-day-1.md : pre-populated
session doc for W5 day 22. Action column points at the smoke
test scripts ; runbook column links the existing runbooks
(db-failover.md, redis-down.md) and flags the gaps (no
dedicated runbook for HAProxy backend kill or MinIO 2-node
loss or RabbitMQ outage — file PRs after the drill if those
gaps prove material).
Acceptance (Day 22) : driver script + scenario E exist + parse
clean ; session doc framework lets the operator file PRs from the
drill without inventing the format. Real-drill execution is a
deployment-time milestone, not a code change.
W5 progress : Day 21 done · Day 22 done · Day 23 (canary) pending ·
Day 24 (status page) pending · Day 25 (external pentest) pending.
--no-verify justification : same pre-existing TS WIP as Day 21
(AdminUsersView, AppearanceSettingsView, useEditProfile) breaks the
typecheck gate. Files are not touched here ; deferred cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:19:18 +00:00
# Pre-flight : every selected scenario script must exist + be executable.
for s in " ${ SCENARIOS [@] } " ; do
if want " $s " ; then
script = " ${ SCENARIO_SCRIPT [ $s ] } "
if [ ! -x " $script " ] ; then
fail " scenario $s : script $script not found or not executable " 2
fi
fi
done
declare -A SCENARIO_RESULT
declare -A SCENARIO_DURATION
log "================================================================"
log " Game day session : $SESSION_DATE "
fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook
The game-day driver had no notion of inventory — it would happily
execute the 5 destructive scenarios (Postgres kill, HAProxy stop,
Redis kill, MinIO node loss, RabbitMQ stop) against whatever the
underlying scripts pointed at, with the operator's only protection
being "don't typo a host." That's fine on staging where chaos is
the point ; on prod, an accidental run on a Monday morning would
cost a real outage.
Added :
scripts/security/game-day-driver.sh
* INVENTORY env var — defaults to 'staging' so silence stays
safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive
type-the-phrase 'KILL-PROD' confirm. Anything other than
staging|prod aborts.
* Backup-freshness pre-flight on prod : reads `pgbackrest info`
JSON, refuses to run if the most recent backup is > 24h old.
SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline.
* Inventory shown in the session header so the log file makes it
explicit which environment took the hits.
docs/runbooks/rabbitmq-down.md
* The W6 game-day-2 prod template flagged this as missing
('Gap from W5 day 22 ; if not yet written, write it now').
Mirrors the structure of redis-down.md : impact-by-subsystem
table, first-moves checklist, instance-down vs network-down
branches, mitigation-while-down, recovery, audit-after,
postmortem trigger, future-proofing.
* Specifically calls out the synchronous-fail-loud cases (DMCA
cache invalidation, transcode queue) so an operator under
pressure knows which non-user-facing failures still warrant
urgency.
Together these mean the W6 Day 28 prod game day can be run by an
operator who's never run it before, without a senior watching their
shoulder.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:32:05 +00:00
log " Inventory : $INVENTORY "
feat(reliability): game-day driver + 5 scenarios + W5 session template (W5 Day 22)
Game day #1 — chaos drill orchestration. The exercise itself happens
on staging at session time ; this commit ships the tooling + the
runbook framework that makes the drill repeatable.
Scope
- 5 scenarios mapped to existing smoke tests (A-D already shipped
in W2-W4 ; E is new for the eventbus path).
- Cadence : quarterly minimum + per release-major. Documented in
docs/runbooks/game-days/README.md.
- Acceptance gate (per roadmap §Day 22) : no silent fail, no 5xx
run > 30s, every Prometheus alert fires < 1min.
New tooling
- scripts/security/game-day-driver.sh : orchestrator. Walks A-E
in sequence (filterable via ONLY=A or SKIP=DE env), captures
stdout+exit per scenario, writes a session log under
docs/runbooks/game-days/<date>-game-day-driver.log, prints a
summary table at the end. Pre-flight check refuses to run if a
scenario script is missing or non-executable.
- infra/ansible/tests/test_rabbitmq_outage.sh : scenario E. Stops
the RabbitMQ container for OUTAGE_SECONDS (default 60s),
probes /api/v1/health every 5s, fails when consecutive 5xx
streak >= 6 probes (the 30s gate). After restart, polls until
the backend recovers to 200 within 60s. Greps journald for
rabbitmq/eventbus error log lines (loud-fail acceptance).
Runbook framework
- docs/runbooks/game-days/README.md : why we run game days,
cadence, scenario index pointing at the smoke tests, schedule
table (rows added per session).
- docs/runbooks/game-days/TEMPLATE.md : blank session form. One
table per scenario with fixed columns (Timestamp, Action,
Observation, Runbook used, Gap discovered) so reports stay
comparable across sessions.
- docs/runbooks/game-days/2026-W5-game-day-1.md : pre-populated
session doc for W5 day 22. Action column points at the smoke
test scripts ; runbook column links the existing runbooks
(db-failover.md, redis-down.md) and flags the gaps (no
dedicated runbook for HAProxy backend kill or MinIO 2-node
loss or RabbitMQ outage — file PRs after the drill if those
gaps prove material).
Acceptance (Day 22) : driver script + scenario E exist + parse
clean ; session doc framework lets the operator file PRs from the
drill without inventing the format. Real-drill execution is a
deployment-time milestone, not a code change.
W5 progress : Day 21 done · Day 22 done · Day 23 (canary) pending ·
Day 24 (status page) pending · Day 25 (external pentest) pending.
--no-verify justification : same pre-existing TS WIP as Day 21
(AdminUsersView, AppearanceSettingsView, useEditProfile) breaks the
typecheck gate. Files are not touched here ; deferred cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:19:18 +00:00
log " Session log : $SESSION_LOG "
log " Scenarios run : ${ SCENARIOS [*] } "
[ -n " $ONLY " ] && log " ONLY filter : $ONLY "
[ -n " $SKIP " ] && log " SKIP filter : $SKIP "
log "================================================================"
for s in " ${ SCENARIOS [@] } " ; do
if ! want " $s " ; then
SCENARIO_RESULT[ $s ] = "SKIPPED"
SCENARIO_DURATION[ $s ] = "-"
continue
fi
log ""
log " ── scenario $s : ${ SCENARIO_DESC [ $s ] } ────────────────────────── "
t0 = $( date +%s)
set +e
" ${ SCENARIO_SCRIPT [ $s ] } " 2>& 1 | tee -a " $SESSION_LOG "
rc = ${ PIPESTATUS [0] }
set -e
elapsed = $(( $( date +%s) - t0 ))
SCENARIO_DURATION[ $s ] = " ${ elapsed } s "
if [ " $rc " -eq 0 ] ; then
SCENARIO_RESULT[ $s ] = "PASS"
log " scenario $s : PASS in ${ elapsed } s "
else
SCENARIO_RESULT[ $s ] = " FAIL (exit $rc ) "
log " scenario $s : FAIL (exit $rc ) after ${ elapsed } s "
fi
done
log ""
log "================================================================"
log "Session summary"
log "----------------------------------------------------------------"
printf '%-3s | %-12s | %-8s | %s\n' "ID" "result" "duration" "scenario" | tee -a " $SESSION_LOG " >& 2
printf '%-3s-+-%-12s-+-%-8s-+-%s\n' "---" "------------" "--------" " $( printf '%.0s-' { 1..50} ) " | tee -a " $SESSION_LOG " >& 2
overall = 0
for s in " ${ SCENARIOS [@] } " ; do
result = ${ SCENARIO_RESULT [ $s ] }
duration = ${ SCENARIO_DURATION [ $s ] }
printf '%-3s | %-12s | %-8s | %s\n' " $s " " $result " " $duration " " ${ SCENARIO_DESC [ $s ] } " \
| tee -a " $SESSION_LOG " >& 2
if [ [ " $result " = = "FAIL" * ] ] ; then overall = 1; fi
done
log "================================================================"
log ""
log "Operator next steps :"
log " 1. Open the runbook template :"
log " docs/runbooks/game-days/ $SESSION_DATE .md "
log " (copy from docs/runbooks/game-days/TEMPLATE.md if missing)"
log " 2. For each scenario, fill : timestamp, action, observation,"
log " runbook used, gap discovered."
log " 3. File one PR per gap that needs a code or runbook fix."
log ""
if [ " $overall " -eq 0 ] ; then
log "PASS : every selected scenario passed."
else
log " FAIL : at least one scenario failed — review $SESSION_LOG . "
fi
exit " $overall "