veza/docs/runbooks/game-days/TEMPLATE.md

# Game day session — `<YYYY-MM-DD>`

> **Driver** : `<name> (<role>)`
> **Observers** : `<list>`
> **Environment** : staging / lab / prod-canary
> **Goal** : verify the runbooks in `docs/runbooks/` work end-to-end.

## Pre-flight

- [ ] All target services healthy at start (run `kubectl get pods` / `incus list` / Grafana cluster overview)
- [ ] On-call team notified in `#engineering` 1 h before kickoff so a real page doesn't surprise them
- [ ] PagerDuty schedule overridden to silence pages on the test environment (or pre-agree the test pages will be acknowledged silently)
- [ ] Driver script ready : `bash scripts/security/game-day-driver.sh --help`

## Session log

For each scenario, fill the row immediately after running the smoke test.

### Scenario A — Postgres primary failover

| Field             | Value                                                                       |
| ----------------- | --------------------------------------------------------------------------- |
| Timestamp UTC     |                                                                             |
| Action            | `incus stop --force pgaf-primary`                                           |
| Observation       | _e.g. failover took 38 s, no client-visible 5xx, alert `PostgresPrimaryUnreachable` fired in 25 s_ |
| Runbook used      | [`db-failover.md`](../db-failover.md)                                       |
| Gap discovered    | _e.g. step 3 mentions a script that no longer exists — file PR to fix_       |

### Scenario B — HAProxy backend-api 1 fail-over

| Field             | Value                                                                       |
| ----------------- | --------------------------------------------------------------------------- |
| Timestamp UTC     |                                                                             |
| Action            | `incus stop --force backend-api-1`                                          |
| Observation       |                                                                             |
| Runbook used      | _add path here ; if no runbook exists this is a gap_                        |
| Gap discovered    |                                                                             |

### Scenario C — Redis Sentinel master promotion

| Field             | Value                                                                       |
| ----------------- | --------------------------------------------------------------------------- |
| Timestamp UTC     |                                                                             |
| Action            | `incus stop --force redis-1` (or whichever Sentinel reports as master)      |
| Observation       |                                                                             |
| Runbook used      | [`redis-down.md`](../redis-down.md)                                         |
| Gap discovered    |                                                                             |

### Scenario D — MinIO 2-node loss EC:2 reconstruction

| Field             | Value                                                                       |
| ----------------- | --------------------------------------------------------------------------- |
| Timestamp UTC     |                                                                             |
| Action            | `KILL_NODES="minio-2 minio-3" bash infra/ansible/tests/test_minio_resilience.sh` |
| Observation       |                                                                             |
| Runbook used      | _add path ; nothing dedicated yet, open issue if needed_                     |
| Gap discovered    |                                                                             |

### Scenario E — RabbitMQ outage backend stays up

| Field             | Value                                                                       |
| ----------------- | --------------------------------------------------------------------------- |
| Timestamp UTC     |                                                                             |
| Action            | `incus stop --force rabbitmq` (60 s window)                                 |
| Observation       |                                                                             |
| Runbook used      | _add path ; W5 day 22 to write if missing_                                  |
| Gap discovered    |                                                                             |

## Acceptance gate

- [ ] No silent fail across the 5 scenarios
- [ ] Max consecutive 5xx run ≤ 30 s
- [ ] Every Prometheus alert fired ≤ 1 min after the inducing event
- [ ] Every scenario has a documented runbook (file the gap as a PR if missing)

## PRs filed from this session

Track here so the next session knows what was actioned :

- `<branch> — <title>` — link
- `<branch> — <title>` — link

## Take-aways

Free-form. What did we learn ? What surprised us ? What will we change for the next drill ?