fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook
The game-day driver had no notion of inventory — it would happily
execute the 5 destructive scenarios (Postgres kill, HAProxy stop,
Redis kill, MinIO node loss, RabbitMQ stop) against whatever the
underlying scripts pointed at, with the operator's only protection
being "don't typo a host." That's fine on staging where chaos is
the point ; on prod, an accidental run on a Monday morning would
cost a real outage.
Added :
scripts/security/game-day-driver.sh
* INVENTORY env var — defaults to 'staging' so silence stays
safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive
type-the-phrase 'KILL-PROD' confirm. Anything other than
staging|prod aborts.
* Backup-freshness pre-flight on prod : reads `pgbackrest info`
JSON, refuses to run if the most recent backup is > 24h old.
SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline.
* Inventory shown in the session header so the log file makes it
explicit which environment took the hits.
docs/runbooks/rabbitmq-down.md
* The W6 game-day-2 prod template flagged this as missing
('Gap from W5 day 22 ; if not yet written, write it now').
Mirrors the structure of redis-down.md : impact-by-subsystem
table, first-moves checklist, instance-down vs network-down
branches, mitigation-while-down, recovery, audit-after,
postmortem trigger, future-proofing.
* Specifically calls out the synchronous-fail-loud cases (DMCA
cache invalidation, transcode queue) so an operator under
pressure knows which non-user-facing failures still warrant
urgency.
Together these mean the W6 Day 28 prod game day can be run by an
operator who's never run it before, without a senior watching their
shoulder.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
e780fbcd18
commit
2a5bc11628
2 changed files with 275 additions and 4 deletions
164
docs/runbooks/rabbitmq-down.md
Normal file
164
docs/runbooks/rabbitmq-down.md
Normal file
|
|
@ -0,0 +1,164 @@
|
|||
# Runbook — RabbitMQ unavailable
|
||||
|
||||
> **Alert** : `RabbitMQUnreachable` (in `config/prometheus/alert_rules.yml`).
|
||||
> **Owner** : infra on-call.
|
||||
> **Game-day scenario** : E (`infra/ansible/tests/test_rabbitmq_outage.sh`).
|
||||
|
||||
## What breaks when RabbitMQ is down
|
||||
|
||||
RabbitMQ is a fan-out broker for asynchronous, non-user-facing work
|
||||
(transcode jobs, distribution to external platforms, email digests,
|
||||
DMCA takedown propagation, search index updates). The user-facing
|
||||
request path does NOT block on RabbitMQ — the API publishes a message
|
||||
and returns 202 Accepted ; the worker picks it up later.
|
||||
|
||||
| Subsystem | Effect when RabbitMQ is gone | Severity |
|
||||
| ------------------------------------ | ------------------------------------------------------------------ | -------- |
|
||||
| Track upload → HLS transcode | Upload succeeds (S3 write OK), HLS segments don't appear | **MEDIUM** — track playable via fallback `/stream`, not via HLS |
|
||||
| Distribution to Spotify/SoundCloud | Submission silently queued ; users see "pending" forever | MEDIUM — surfaces in distribution dashboard, not in player |
|
||||
| Email digest (weekly creator stats) | Cron tick logs `publish failed`, retries on next tick | LOW — eventual consistency, no user-visible breakage |
|
||||
| DMCA takedown event | Track flag flipped in DB synchronously ; downstream replay queue stalls | **HIGH** — track is gated immediately (synchronous DB UPDATE), but cache invalidation lags |
|
||||
| Search index updates | New tracks not searchable until queue drains | LOW — falls back to Postgres FTS |
|
||||
| Chat messages (WebSocket) | INDEPENDENT — chat is direct WS, no RabbitMQ involvement | NONE |
|
||||
| Auth, sessions, payments | INDEPENDENT — no RabbitMQ dependency | NONE |
|
||||
|
||||
The synchronous-fail-loud cases (DMCA cache invalidation, transcode
|
||||
queue) are the ones that compound if the outage drags. Most user
|
||||
flows degrade gracefully.
|
||||
|
||||
## First moves
|
||||
|
||||
1. **Confirm RabbitMQ is actually down**, not "unreachable from one
|
||||
host" :
|
||||
```bash
|
||||
curl -s -u "$RMQ_USER:$RMQ_PASS" http://rabbitmq.lxd:15672/api/overview \
|
||||
| jq '.cluster_name, .object_totals'
|
||||
```
|
||||
2. **Confirm what changed.** If a deploy fired in the last 30 min,
|
||||
suspect the deploy. Check `journalctl -u veza-backend-api -n 200`
|
||||
for `amqp` errors with timestamps after the deploy.
|
||||
3. **Check the queues didn't fill the disk** (most common bring-down
|
||||
in development) :
|
||||
```bash
|
||||
ssh rabbitmq.lxd 'df -h /var/lib/rabbitmq'
|
||||
```
|
||||
|
||||
## RabbitMQ instance is down
|
||||
|
||||
```bash
|
||||
# State on the RabbitMQ host :
|
||||
ssh rabbitmq.lxd sudo systemctl status rabbitmq-server
|
||||
|
||||
# Logs (Erlang verbosity, grep for ERROR/CRASH) :
|
||||
ssh rabbitmq.lxd sudo journalctl -u rabbitmq-server -n 500 \
|
||||
| grep -E 'ERROR|CRASH|disk_alarm|memory_alarm'
|
||||
```
|
||||
|
||||
Common causes :
|
||||
|
||||
- **Disk alarm.** `/var/lib/rabbitmq` filled — RabbitMQ pauses producers
|
||||
when free space drops below `disk_free_limit`. The backend's amqp
|
||||
client surfaces this as "blocked". Fix : grow the disk or expire old
|
||||
messages with `rabbitmqctl purge_queue <queue>` (last resort, you
|
||||
lose what's in there).
|
||||
- **Memory alarm.** RSS over `vm_memory_high_watermark` × system mem.
|
||||
Same effect (producers blocked). Fix : add memory or unblock by
|
||||
draining a slow consumer.
|
||||
- **Process crashed.** Erlang OOM, segfault. `sudo systemctl restart
|
||||
rabbitmq-server` ; the queues survive (durable=true on every queue
|
||||
we declare).
|
||||
- **Cluster split-brain.** v1.0 is single-node, so this can't happen
|
||||
yet. Listed for the v1.1 multi-node config.
|
||||
|
||||
## Backend can't reach RabbitMQ
|
||||
|
||||
Network or DNS issue, not RabbitMQ's fault.
|
||||
|
||||
```bash
|
||||
# From the API container :
|
||||
nc -zv rabbitmq.lxd 5672
|
||||
|
||||
# DNS :
|
||||
getent hosts rabbitmq.lxd
|
||||
|
||||
# AMQP credentials :
|
||||
docker exec veza_backend_api env | grep AMQP_URL
|
||||
```
|
||||
|
||||
Likely culprits : Incus bridge restart, password rotation didn't
|
||||
propagate to the API container's env, security-group change.
|
||||
|
||||
## Mitigation while RabbitMQ is down
|
||||
|
||||
The backend already handles publish failures gracefully :
|
||||
|
||||
- `internal/eventbus/rabbitmq.go` retries with exponential backoff up
|
||||
to 30s, then drops to "degraded mode" (publish returns immediately
|
||||
with a logged warning, the API call succeeds, the side-effect is
|
||||
lost).
|
||||
- Workers in `internal/workers/` have `WithRetry()` middleware that
|
||||
republishes failed deliveries up to 5 times before dead-lettering.
|
||||
|
||||
If recovery is going to take > 10 min, set
|
||||
`EVENTBUS_DEGRADED_LOG_LEVEL=error` (default `warn`) so the
|
||||
fail-fast logs land in Sentry and operators can audit which messages
|
||||
were dropped.
|
||||
|
||||
**Do NOT** restart the backend to clear the AMQP connection pool ;
|
||||
the reconnect logic (`go.uber.org/zap`-logged in eventbus.go:142)
|
||||
handles it once RabbitMQ is back.
|
||||
|
||||
## Recovery
|
||||
|
||||
Once RabbitMQ is back up :
|
||||
|
||||
1. Verify connectivity from each backend instance :
|
||||
```bash
|
||||
docker exec veza_backend_api sh -c 'echo -e "AMQP\x00\x00\x09\x01" | nc -w1 rabbitmq.lxd 5672 | head -c 4'
|
||||
```
|
||||
Should return `AMQP`.
|
||||
2. Watch the queue depth on the management UI :
|
||||
`http://rabbitmq.lxd:15672/#/queues`. Expect `transcode_jobs`,
|
||||
`distribution_outbox`, `dmca_propagation`, `search_index_updates`
|
||||
to drain over the next 5-15 min as the workers catch up.
|
||||
3. If a queue is stuck > 30 min after recovery, the worker for it is
|
||||
wedged — restart that specific worker container :
|
||||
```bash
|
||||
docker compose -f docker-compose.prod.yml restart worker-<name>
|
||||
```
|
||||
|
||||
## Audit after the outage
|
||||
|
||||
1. Sentry filter `tag:eventbus.status=degraded` between outage start
|
||||
and end — gives you the count and shape of dropped events.
|
||||
2. For each dropped DMCA event, manually trigger the cache flush :
|
||||
```bash
|
||||
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
|
||||
https://api.veza.fr/api/v1/admin/cache/dmca/flush
|
||||
```
|
||||
3. For each dropped transcode job, requeue from the orders table :
|
||||
```bash
|
||||
psql "$DATABASE_URL" -c "
|
||||
INSERT INTO transcode_jobs (track_id, status, attempts, created_at)
|
||||
SELECT id, 'pending', 0, NOW() FROM tracks
|
||||
WHERE created_at BETWEEN '<outage_start>' AND '<outage_end>'
|
||||
AND hls_status IS NULL;
|
||||
"
|
||||
```
|
||||
|
||||
## Postmortem trigger
|
||||
|
||||
Any RabbitMQ outage > 10 min triggers a postmortem. The non-user-facing
|
||||
nature makes this less urgent than Redis or Postgres, but the
|
||||
silent-failure modes (dropped DMCA propagation, missing transcodes)
|
||||
warrant a write-up so we know what slipped through.
|
||||
|
||||
## Future-proofing
|
||||
|
||||
- v1.1 will move to a 3-node RabbitMQ cluster behind a load balancer
|
||||
for HA. This runbook will then split into "single-node down" (the
|
||||
cluster keeps serving) and "cluster split-brain" (rare, but the
|
||||
recovery path is different).
|
||||
- Worker idempotency keys are documented in `docs/api/eventbus.md` —
|
||||
any new worker MUST honour them so a replay during recovery doesn't
|
||||
double-charge / double-distribute / double-takedown.
|
||||
|
|
@ -16,18 +16,26 @@
|
|||
# E : test_rabbitmq_outage.sh — stop RabbitMQ 60s, backend stays up
|
||||
#
|
||||
# Usage :
|
||||
# bash scripts/security/game-day-driver.sh # run all scenarios
|
||||
# SKIP=DE bash scripts/security/game-day-driver.sh # skip scenarios D + E
|
||||
# ONLY=A bash scripts/security/game-day-driver.sh # only run scenario A
|
||||
# bash scripts/security/game-day-driver.sh # all scenarios on staging (default)
|
||||
# SKIP=DE bash scripts/security/game-day-driver.sh # skip D + E
|
||||
# ONLY=A bash scripts/security/game-day-driver.sh # only A
|
||||
# INVENTORY=prod CONFIRM_PROD=1 bash scripts/security/game-day-driver.sh # prod (gated)
|
||||
#
|
||||
# Required env (passed through to the underlying smoke tests) :
|
||||
# REDIS_PASS / SENTINEL_PASS for scenario C
|
||||
# MINIO_ROOT_USER / MINIO_ROOT_PASSWORD for scenario D
|
||||
#
|
||||
# v1.0.10 polish — production gating :
|
||||
# INVENTORY=prod must be paired with CONFIRM_PROD=1 or the script
|
||||
# refuses to run, so a stale shell-history line can't accidentally
|
||||
# kill prod Postgres on a Monday morning. The driver also runs a
|
||||
# backup-freshness pre-flight when targeting prod (most recent
|
||||
# pgBackRest backup must be < 24 h old).
|
||||
#
|
||||
# Exit codes :
|
||||
# 0 — every selected scenario passed
|
||||
# 1 — at least one scenario failed
|
||||
# 2 — runner pre-flight failed (script missing, etc.)
|
||||
# 2 — runner pre-flight failed (script missing, prod safety guard tripped, stale backup, etc.)
|
||||
set -euo pipefail
|
||||
|
||||
REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
|
||||
|
|
@ -41,6 +49,9 @@ mkdir -p "$LOGS_DIR"
|
|||
|
||||
ONLY=${ONLY:-}
|
||||
SKIP=${SKIP:-}
|
||||
INVENTORY=${INVENTORY:-staging}
|
||||
CONFIRM_PROD=${CONFIRM_PROD:-0}
|
||||
SKIP_BACKUP_FRESHNESS=${SKIP_BACKUP_FRESHNESS:-0}
|
||||
|
||||
log() { printf '[%s] %s\n' "$(date +%H:%M:%S)" "$*" | tee -a "$SESSION_LOG" >&2; }
|
||||
fail() { log "FAIL: $*"; exit "${2:-2}"; }
|
||||
|
|
@ -68,6 +79,101 @@ want() {
|
|||
return 0
|
||||
}
|
||||
|
||||
# v1.0.10 polish — prod safety gate. INVENTORY=prod requires
|
||||
# CONFIRM_PROD=1 + an interactive type-the-word confirm. Anything else
|
||||
# defaults to staging so a forgotten env-var doesn't matter.
|
||||
case "$INVENTORY" in
|
||||
staging|stg|dev|local) ;;
|
||||
prod|production)
|
||||
if [ "$CONFIRM_PROD" != "1" ]; then
|
||||
cat >&2 <<EOF
|
||||
|
||||
================================================================
|
||||
ABORTING — INVENTORY=prod without CONFIRM_PROD=1
|
||||
================================================================
|
||||
|
||||
This script will kill production services. Each scenario triggers a
|
||||
real outage in the chosen inventory : Postgres primary kill, HAProxy
|
||||
backend stop, Redis master kill, MinIO node loss, RabbitMQ stop.
|
||||
|
||||
To run on production, you must :
|
||||
|
||||
1. Announce a maintenance window 24 h ahead (status page +
|
||||
#engineering channel).
|
||||
2. Set PagerDuty to maintenance mode for the affected services.
|
||||
3. Confirm pgBackRest's last backup is < 24 h old (this script
|
||||
auto-checks if you don't pass SKIP_BACKUP_FRESHNESS=1).
|
||||
4. Re-invoke with :
|
||||
|
||||
INVENTORY=prod CONFIRM_PROD=1 \\
|
||||
bash scripts/security/game-day-driver.sh
|
||||
|
||||
The driver will then ask for one more interactive confirmation
|
||||
(type the word KILL-PROD) before the first scenario fires.
|
||||
================================================================
|
||||
EOF
|
||||
exit 2
|
||||
fi
|
||||
|
||||
# Backup-freshness pre-flight : refuse to run if the most recent
|
||||
# pgBackRest full/diff is > 24 h old. Recovery from a stale backup
|
||||
# can extend an outage from minutes to hours, so the cost of
|
||||
# postponing the game day is much less than the cost of compounded
|
||||
# data loss if scenario A fails to recover and we have to restore
|
||||
# from yesterday-but-one.
|
||||
if [ "$SKIP_BACKUP_FRESHNESS" != "1" ]; then
|
||||
if command -v pgbackrest >/dev/null 2>&1; then
|
||||
last_backup_ts=$(pgbackrest --stanza=veza info --output=json 2>/dev/null \
|
||||
| python3 -c "
|
||||
import json, sys
|
||||
try:
|
||||
data = json.load(sys.stdin)
|
||||
backups = data[0]['backup'] if data else []
|
||||
if not backups: print(0); sys.exit(0)
|
||||
print(max(b['timestamp']['stop'] for b in backups))
|
||||
except Exception:
|
||||
print(0)
|
||||
" 2>/dev/null || echo 0)
|
||||
now_ts=$(date +%s)
|
||||
age_seconds=$(( now_ts - last_backup_ts ))
|
||||
if [ "$last_backup_ts" -eq 0 ]; then
|
||||
fail "pgBackRest backup-freshness check failed : could not parse 'pgbackrest info'. Set SKIP_BACKUP_FRESHNESS=1 to override (only after manually verifying a recent backup exists)." 2
|
||||
fi
|
||||
if [ "$age_seconds" -gt 86400 ]; then
|
||||
age_hours=$(( age_seconds / 3600 ))
|
||||
fail "pgBackRest most recent backup is ${age_hours}h old (threshold 24h). Run a backup before the game day, or set SKIP_BACKUP_FRESHNESS=1 if you've validated freshness another way." 2
|
||||
fi
|
||||
log "pre-flight : pgBackRest most recent backup is $(( age_seconds / 3600 ))h $(( (age_seconds % 3600) / 60 ))m old (< 24h threshold) — OK"
|
||||
else
|
||||
log "WARN : pgbackrest CLI not on \$PATH ; skipping backup-freshness check. Set SKIP_BACKUP_FRESHNESS=1 to silence this warning if intentional."
|
||||
fi
|
||||
fi
|
||||
|
||||
# Final type-the-word confirm. Everything above can be set in env
|
||||
# by mistake ; this last step requires a human at the keyboard.
|
||||
cat >&2 <<EOF
|
||||
|
||||
================================================================
|
||||
PROD GAME DAY — final confirmation
|
||||
================================================================
|
||||
|
||||
inventory : prod
|
||||
scenarios : ${SCENARIOS[*]}${ONLY:+ (filtered by ONLY=$ONLY)}${SKIP:+ (filtered by SKIP=$SKIP)}
|
||||
session : $SESSION_LOG
|
||||
|
||||
Each scenario triggers a real outage. Type the literal phrase
|
||||
KILL-PROD (any other input aborts) to proceed :
|
||||
EOF
|
||||
read -r confirm_phrase
|
||||
if [ "$confirm_phrase" != "KILL-PROD" ]; then
|
||||
fail "operator did not confirm KILL-PROD ($confirm_phrase) — aborting" 2
|
||||
fi
|
||||
;;
|
||||
*)
|
||||
fail "INVENTORY=$INVENTORY not recognised — must be one of staging|prod" 2
|
||||
;;
|
||||
esac
|
||||
|
||||
# Pre-flight : every selected scenario script must exist + be executable.
|
||||
for s in "${SCENARIOS[@]}"; do
|
||||
if want "$s"; then
|
||||
|
|
@ -83,6 +189,7 @@ declare -A SCENARIO_DURATION
|
|||
|
||||
log "================================================================"
|
||||
log "Game day session : $SESSION_DATE"
|
||||
log "Inventory : $INVENTORY"
|
||||
log "Session log : $SESSION_LOG"
|
||||
log "Scenarios run : ${SCENARIOS[*]}"
|
||||
[ -n "$ONLY" ] && log "ONLY filter : $ONLY"
|
||||
|
|
|
|||
Loading…
Reference in a new issue