86 lines
5 KiB
Markdown
86 lines
5 KiB
Markdown
|
|
# Game day session — `<YYYY-MM-DD>`
|
||
|
|
|
||
|
|
> **Driver** : `<name> (<role>)`
|
||
|
|
> **Observers** : `<list>`
|
||
|
|
> **Environment** : staging / lab / prod-canary
|
||
|
|
> **Goal** : verify the runbooks in `docs/runbooks/` work end-to-end.
|
||
|
|
|
||
|
|
## Pre-flight
|
||
|
|
|
||
|
|
- [ ] All target services healthy at start (run `kubectl get pods` / `incus list` / Grafana cluster overview)
|
||
|
|
- [ ] On-call team notified in `#engineering` 1 h before kickoff so a real page doesn't surprise them
|
||
|
|
- [ ] PagerDuty schedule overridden to silence pages on the test environment (or pre-agree the test pages will be acknowledged silently)
|
||
|
|
- [ ] Driver script ready : `bash scripts/security/game-day-driver.sh --help`
|
||
|
|
|
||
|
|
## Session log
|
||
|
|
|
||
|
|
For each scenario, fill the row immediately after running the smoke test.
|
||
|
|
|
||
|
|
### Scenario A — Postgres primary failover
|
||
|
|
|
||
|
|
| Field | Value |
|
||
|
|
| ----------------- | --------------------------------------------------------------------------- |
|
||
|
|
| Timestamp UTC | |
|
||
|
|
| Action | `incus stop --force pgaf-primary` |
|
||
|
|
| Observation | _e.g. failover took 38 s, no client-visible 5xx, alert `PostgresPrimaryUnreachable` fired in 25 s_ |
|
||
|
|
| Runbook used | [`db-failover.md`](../db-failover.md) |
|
||
|
|
| Gap discovered | _e.g. step 3 mentions a script that no longer exists — file PR to fix_ |
|
||
|
|
|
||
|
|
### Scenario B — HAProxy backend-api 1 fail-over
|
||
|
|
|
||
|
|
| Field | Value |
|
||
|
|
| ----------------- | --------------------------------------------------------------------------- |
|
||
|
|
| Timestamp UTC | |
|
||
|
|
| Action | `incus stop --force backend-api-1` |
|
||
|
|
| Observation | |
|
||
|
|
| Runbook used | _add path here ; if no runbook exists this is a gap_ |
|
||
|
|
| Gap discovered | |
|
||
|
|
|
||
|
|
### Scenario C — Redis Sentinel master promotion
|
||
|
|
|
||
|
|
| Field | Value |
|
||
|
|
| ----------------- | --------------------------------------------------------------------------- |
|
||
|
|
| Timestamp UTC | |
|
||
|
|
| Action | `incus stop --force redis-1` (or whichever Sentinel reports as master) |
|
||
|
|
| Observation | |
|
||
|
|
| Runbook used | [`redis-down.md`](../redis-down.md) |
|
||
|
|
| Gap discovered | |
|
||
|
|
|
||
|
|
### Scenario D — MinIO 2-node loss EC:2 reconstruction
|
||
|
|
|
||
|
|
| Field | Value |
|
||
|
|
| ----------------- | --------------------------------------------------------------------------- |
|
||
|
|
| Timestamp UTC | |
|
||
|
|
| Action | `KILL_NODES="minio-2 minio-3" bash infra/ansible/tests/test_minio_resilience.sh` |
|
||
|
|
| Observation | |
|
||
|
|
| Runbook used | _add path ; nothing dedicated yet, open issue if needed_ |
|
||
|
|
| Gap discovered | |
|
||
|
|
|
||
|
|
### Scenario E — RabbitMQ outage backend stays up
|
||
|
|
|
||
|
|
| Field | Value |
|
||
|
|
| ----------------- | --------------------------------------------------------------------------- |
|
||
|
|
| Timestamp UTC | |
|
||
|
|
| Action | `incus stop --force rabbitmq` (60 s window) |
|
||
|
|
| Observation | |
|
||
|
|
| Runbook used | _add path ; W5 day 22 to write if missing_ |
|
||
|
|
| Gap discovered | |
|
||
|
|
|
||
|
|
## Acceptance gate
|
||
|
|
|
||
|
|
- [ ] No silent fail across the 5 scenarios
|
||
|
|
- [ ] Max consecutive 5xx run ≤ 30 s
|
||
|
|
- [ ] Every Prometheus alert fired ≤ 1 min after the inducing event
|
||
|
|
- [ ] Every scenario has a documented runbook (file the gap as a PR if missing)
|
||
|
|
|
||
|
|
## PRs filed from this session
|
||
|
|
|
||
|
|
Track here so the next session knows what was actioned :
|
||
|
|
|
||
|
|
- `<branch> — <title>` — link
|
||
|
|
- `<branch> — <title>` — link
|
||
|
|
|
||
|
|
## Take-aways
|
||
|
|
|
||
|
|
Free-form. What did we learn ? What surprised us ? What will we change for the next drill ?
|