# Game day session — `` > **Driver** : ` ()` > **Observers** : `` > **Environment** : staging / lab / prod-canary > **Goal** : verify the runbooks in `docs/runbooks/` work end-to-end. ## Pre-flight - [ ] All target services healthy at start (run `kubectl get pods` / `incus list` / Grafana cluster overview) - [ ] On-call team notified in `#engineering` 1 h before kickoff so a real page doesn't surprise them - [ ] PagerDuty schedule overridden to silence pages on the test environment (or pre-agree the test pages will be acknowledged silently) - [ ] Driver script ready : `bash scripts/security/game-day-driver.sh --help` ## Session log For each scenario, fill the row immediately after running the smoke test. ### Scenario A — Postgres primary failover | Field | Value | | ----------------- | --------------------------------------------------------------------------- | | Timestamp UTC | | | Action | `incus stop --force pgaf-primary` | | Observation | _e.g. failover took 38 s, no client-visible 5xx, alert `PostgresPrimaryUnreachable` fired in 25 s_ | | Runbook used | [`db-failover.md`](../db-failover.md) | | Gap discovered | _e.g. step 3 mentions a script that no longer exists — file PR to fix_ | ### Scenario B — HAProxy backend-api 1 fail-over | Field | Value | | ----------------- | --------------------------------------------------------------------------- | | Timestamp UTC | | | Action | `incus stop --force backend-api-1` | | Observation | | | Runbook used | _add path here ; if no runbook exists this is a gap_ | | Gap discovered | | ### Scenario C — Redis Sentinel master promotion | Field | Value | | ----------------- | --------------------------------------------------------------------------- | | Timestamp UTC | | | Action | `incus stop --force redis-1` (or whichever Sentinel reports as master) | | Observation | | | Runbook used | [`redis-down.md`](../redis-down.md) | | Gap discovered | | ### Scenario D — MinIO 2-node loss EC:2 reconstruction | Field | Value | | ----------------- | --------------------------------------------------------------------------- | | Timestamp UTC | | | Action | `KILL_NODES="minio-2 minio-3" bash infra/ansible/tests/test_minio_resilience.sh` | | Observation | | | Runbook used | _add path ; nothing dedicated yet, open issue if needed_ | | Gap discovered | | ### Scenario E — RabbitMQ outage backend stays up | Field | Value | | ----------------- | --------------------------------------------------------------------------- | | Timestamp UTC | | | Action | `incus stop --force rabbitmq` (60 s window) | | Observation | | | Runbook used | _add path ; W5 day 22 to write if missing_ | | Gap discovered | | ## Acceptance gate - [ ] No silent fail across the 5 scenarios - [ ] Max consecutive 5xx run ≤ 30 s - [ ] Every Prometheus alert fired ≤ 1 min after the inducing event - [ ] Every scenario has a documented runbook (file the gap as a PR if missing) ## PRs filed from this session Track here so the next session knows what was actioned : - `` — link - `<branch> — <title>` — link ## Take-aways Free-form. What did we learn ? What surprised us ? What will we change for the next drill ?