fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook

The game-day driver had no notion of inventory — it would happily execute the 5 destructive scenarios (Postgres kill, HAProxy stop, Redis kill, MinIO node loss, RabbitMQ stop) against whatever the underlying scripts pointed at, with the operator's only protection being "don't typo a host." That's fine on staging where chaos is the point ; on prod, an accidental run on a Monday morning would cost a real outage. Added : scripts/security/game-day-driver.sh * INVENTORY env var — defaults to 'staging' so silence stays safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive type-the-phrase 'KILL-PROD' confirm. Anything other than staging|prod aborts. * Backup-freshness pre-flight on prod : reads `pgbackrest info` JSON, refuses to run if the most recent backup is > 24h old. SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline. * Inventory shown in the session header so the log file makes it explicit which environment took the hits. docs/runbooks/rabbitmq-down.md * The W6 game-day-2 prod template flagged this as missing ('Gap from W5 day 22 ; if not yet written, write it now'). Mirrors the structure of redis-down.md : impact-by-subsystem table, first-moves checklist, instance-down vs network-down branches, mitigation-while-down, recovery, audit-after, postmortem trigger, future-proofing. * Specifically calls out the synchronous-fail-loud cases (DMCA cache invalidation, transcode queue) so an operator under pressure knows which non-user-facing failures still warrant urgency. Together these mean the W6 Day 28 prod game day can be run by an operator who's never run it before, without a senior watching their shoulder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:32:05 +02:00 · 2026-04-30 22:32:05 +02:00 · 2a5bc11628
commit 2a5bc11628
parent e780fbcd18
2 changed files with 275 additions and 4 deletions
--- a/docs/runbooks/rabbitmq-down.md
+++ b/docs/runbooks/rabbitmq-down.md
@ -0,0 +1,164 @@
+# Runbook — RabbitMQ unavailable
+
+> **Alert** : `RabbitMQUnreachable` (in `config/prometheus/alert_rules.yml`).
+> **Owner** : infra on-call.
+> **Game-day scenario** : E (`infra/ansible/tests/test_rabbitmq_outage.sh`).
+
+## What breaks when RabbitMQ is down
+
+RabbitMQ is a fan-out broker for asynchronous, non-user-facing work
+(transcode jobs, distribution to external platforms, email digests,
+DMCA takedown propagation, search index updates). The user-facing
+request path does NOT block on RabbitMQ — the API publishes a message
+and returns 202 Accepted ; the worker picks it up later.
+
+| Subsystem                            | Effect when RabbitMQ is gone                                       | Severity |
+| ------------------------------------ | ------------------------------------------------------------------ | -------- |
+| Track upload → HLS transcode         | Upload succeeds (S3 write OK), HLS segments don't appear           | **MEDIUM** — track playable via fallback `/stream`, not via HLS |
+| Distribution to Spotify/SoundCloud   | Submission silently queued ; users see "pending" forever           | MEDIUM — surfaces in distribution dashboard, not in player |
+| Email digest (weekly creator stats)  | Cron tick logs `publish failed`, retries on next tick              | LOW — eventual consistency, no user-visible breakage |
+| DMCA takedown event                  | Track flag flipped in DB synchronously ; downstream replay queue stalls | **HIGH** — track is gated immediately (synchronous DB UPDATE), but cache invalidation lags |
+| Search index updates                 | New tracks not searchable until queue drains                       | LOW — falls back to Postgres FTS |
+| Chat messages (WebSocket)            | INDEPENDENT — chat is direct WS, no RabbitMQ involvement           | NONE |
+| Auth, sessions, payments             | INDEPENDENT — no RabbitMQ dependency                               | NONE |
+
+The synchronous-fail-loud cases (DMCA cache invalidation, transcode
+queue) are the ones that compound if the outage drags. Most user
+flows degrade gracefully.
+
+## First moves
+
+1. **Confirm RabbitMQ is actually down**, not "unreachable from one
+   host" :
+   ```bash
+   curl -s -u "$RMQ_USER:$RMQ_PASS" http://rabbitmq.lxd:15672/api/overview \
+     | jq '.cluster_name, .object_totals'
+   ```
+2. **Confirm what changed.** If a deploy fired in the last 30 min,
+   suspect the deploy. Check `journalctl -u veza-backend-api -n 200`
+   for `amqp` errors with timestamps after the deploy.
+3. **Check the queues didn't fill the disk** (most common bring-down
+   in development) :
+   ```bash
+   ssh rabbitmq.lxd 'df -h /var/lib/rabbitmq'
+   ```
+
+## RabbitMQ instance is down
+
+```bash
+# State on the RabbitMQ host :
+ssh rabbitmq.lxd sudo systemctl status rabbitmq-server
+
+# Logs (Erlang verbosity, grep for ERROR/CRASH) :
+ssh rabbitmq.lxd sudo journalctl -u rabbitmq-server -n 500 \
+  | grep -E 'ERROR|CRASH|disk_alarm|memory_alarm'
+```
+
+Common causes :
+
+- **Disk alarm.** `/var/lib/rabbitmq` filled — RabbitMQ pauses producers
+  when free space drops below `disk_free_limit`. The backend's amqp
+  client surfaces this as "blocked". Fix : grow the disk or expire old
+  messages with `rabbitmqctl purge_queue <queue>` (last resort, you
+  lose what's in there).
+- **Memory alarm.** RSS over `vm_memory_high_watermark` × system mem.
+  Same effect (producers blocked). Fix : add memory or unblock by
+  draining a slow consumer.
+- **Process crashed.** Erlang OOM, segfault. `sudo systemctl restart
+  rabbitmq-server` ; the queues survive (durable=true on every queue
+  we declare).
+- **Cluster split-brain.** v1.0 is single-node, so this can't happen
+  yet. Listed for the v1.1 multi-node config.
+
+## Backend can't reach RabbitMQ
+
+Network or DNS issue, not RabbitMQ's fault.
+
+```bash
+# From the API container :
+nc -zv rabbitmq.lxd 5672
+
+# DNS :
+getent hosts rabbitmq.lxd
+
+# AMQP credentials :
+docker exec veza_backend_api env | grep AMQP_URL
+```
+
+Likely culprits : Incus bridge restart, password rotation didn't
+propagate to the API container's env, security-group change.
+
+## Mitigation while RabbitMQ is down
+
+The backend already handles publish failures gracefully :
+
+- `internal/eventbus/rabbitmq.go` retries with exponential backoff up
+  to 30s, then drops to "degraded mode" (publish returns immediately
+  with a logged warning, the API call succeeds, the side-effect is
+  lost).
+- Workers in `internal/workers/` have `WithRetry()` middleware that
+  republishes failed deliveries up to 5 times before dead-lettering.
+
+If recovery is going to take > 10 min, set
+`EVENTBUS_DEGRADED_LOG_LEVEL=error` (default `warn`) so the
+fail-fast logs land in Sentry and operators can audit which messages
+were dropped.
+
+**Do NOT** restart the backend to clear the AMQP connection pool ;
+the reconnect logic (`go.uber.org/zap`-logged in eventbus.go:142)
+handles it once RabbitMQ is back.
+
+## Recovery
+
+Once RabbitMQ is back up :
+
+1. Verify connectivity from each backend instance :
+   ```bash
+   docker exec veza_backend_api sh -c 'echo -e "AMQP\x00\x00\x09\x01" | nc -w1 rabbitmq.lxd 5672 | head -c 4'
+   ```
+   Should return `AMQP`.
+2. Watch the queue depth on the management UI :
+   `http://rabbitmq.lxd:15672/#/queues`. Expect `transcode_jobs`,
+   `distribution_outbox`, `dmca_propagation`, `search_index_updates`
+   to drain over the next 5-15 min as the workers catch up.
+3. If a queue is stuck > 30 min after recovery, the worker for it is
+   wedged — restart that specific worker container :
+   ```bash
+   docker compose -f docker-compose.prod.yml restart worker-<name>
+   ```
+
+## Audit after the outage
+
+1. Sentry filter `tag:eventbus.status=degraded` between outage start
+   and end — gives you the count and shape of dropped events.
+2. For each dropped DMCA event, manually trigger the cache flush :
+   ```bash
+   curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
+     https://api.veza.fr/api/v1/admin/cache/dmca/flush
+   ```
+3. For each dropped transcode job, requeue from the orders table :
+   ```bash
+   psql "$DATABASE_URL" -c "
+     INSERT INTO transcode_jobs (track_id, status, attempts, created_at)
+     SELECT id, 'pending', 0, NOW() FROM tracks
+     WHERE created_at BETWEEN '<outage_start>' AND '<outage_end>'
+       AND hls_status IS NULL;
+   "
+   ```
+
+## Postmortem trigger
+
+Any RabbitMQ outage > 10 min triggers a postmortem. The non-user-facing
+nature makes this less urgent than Redis or Postgres, but the
+silent-failure modes (dropped DMCA propagation, missing transcodes)
+warrant a write-up so we know what slipped through.
+
+## Future-proofing
+
+- v1.1 will move to a 3-node RabbitMQ cluster behind a load balancer
+  for HA. This runbook will then split into "single-node down" (the
+  cluster keeps serving) and "cluster split-brain" (rare, but the
+  recovery path is different).
+- Worker idempotency keys are documented in `docs/api/eventbus.md` —
+  any new worker MUST honour them so a replay during recovery doesn't
+  double-charge / double-distribute / double-takedown.
--- a/scripts/security/game-day-driver.sh
+++ b/scripts/security/game-day-driver.sh
@ -16,18 +16,26 @@
 #   E : test_rabbitmq_outage.sh     — stop RabbitMQ 60s, backend stays up
 #
 # Usage :
-#   bash scripts/security/game-day-driver.sh           # run all scenarios
-#   SKIP=DE bash scripts/security/game-day-driver.sh   # skip scenarios D + E
-#   ONLY=A bash scripts/security/game-day-driver.sh    # only run scenario A
+#   bash scripts/security/game-day-driver.sh                                 # all scenarios on staging (default)
+#   SKIP=DE bash scripts/security/game-day-driver.sh                         # skip D + E
+#   ONLY=A bash scripts/security/game-day-driver.sh                          # only A
+#   INVENTORY=prod CONFIRM_PROD=1 bash scripts/security/game-day-driver.sh   # prod (gated)
 #
 # Required env (passed through to the underlying smoke tests) :
 #   REDIS_PASS / SENTINEL_PASS for scenario C
 #   MINIO_ROOT_USER / MINIO_ROOT_PASSWORD for scenario D
 #
+# v1.0.10 polish — production gating :
+#   INVENTORY=prod must be paired with CONFIRM_PROD=1 or the script
+#   refuses to run, so a stale shell-history line can't accidentally
+#   kill prod Postgres on a Monday morning. The driver also runs a
+#   backup-freshness pre-flight when targeting prod (most recent
+#   pgBackRest backup must be < 24 h old).
+#
 # Exit codes :
 #   0  — every selected scenario passed
 #   1  — at least one scenario failed
-#   2  — runner pre-flight failed (script missing, etc.)
+#   2  — runner pre-flight failed (script missing, prod safety guard tripped, stale backup, etc.)
 set -euo pipefail

 REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
@ -41,6 +49,9 @@ mkdir -p "$LOGS_DIR"

 ONLY=${ONLY:-}
 SKIP=${SKIP:-}
+INVENTORY=${INVENTORY:-staging}
+CONFIRM_PROD=${CONFIRM_PROD:-0}
+SKIP_BACKUP_FRESHNESS=${SKIP_BACKUP_FRESHNESS:-0}

 log()  { printf '[%s] %s\n' "$(date +%H:%M:%S)" "$*" | tee -a "$SESSION_LOG" >&2; }
 fail() { log "FAIL: $*"; exit "${2:-2}"; }
@ -68,6 +79,101 @@ want() {
  return 0
 }

+# v1.0.10 polish — prod safety gate. INVENTORY=prod requires
+# CONFIRM_PROD=1 + an interactive type-the-word confirm. Anything else
+# defaults to staging so a forgotten env-var doesn't matter.
+case "$INVENTORY" in
+  staging|stg|dev|local) ;;
+  prod|production)
+    if [ "$CONFIRM_PROD" != "1" ]; then
+      cat >&2 <<EOF
+
+================================================================
+ABORTING — INVENTORY=prod without CONFIRM_PROD=1
+================================================================
+
+This script will kill production services. Each scenario triggers a
+real outage in the chosen inventory : Postgres primary kill, HAProxy
+backend stop, Redis master kill, MinIO node loss, RabbitMQ stop.
+
+To run on production, you must :
+
+  1. Announce a maintenance window 24 h ahead (status page +
+     #engineering channel).
+  2. Set PagerDuty to maintenance mode for the affected services.
+  3. Confirm pgBackRest's last backup is < 24 h old (this script
+     auto-checks if you don't pass SKIP_BACKUP_FRESHNESS=1).
+  4. Re-invoke with :
+
+       INVENTORY=prod CONFIRM_PROD=1 \\
+         bash scripts/security/game-day-driver.sh
+
+The driver will then ask for one more interactive confirmation
+(type the word KILL-PROD) before the first scenario fires.
+================================================================
+EOF
+      exit 2
+    fi
+
+    # Backup-freshness pre-flight : refuse to run if the most recent
+    # pgBackRest full/diff is > 24 h old. Recovery from a stale backup
+    # can extend an outage from minutes to hours, so the cost of
+    # postponing the game day is much less than the cost of compounded
+    # data loss if scenario A fails to recover and we have to restore
+    # from yesterday-but-one.
+    if [ "$SKIP_BACKUP_FRESHNESS" != "1" ]; then
+      if command -v pgbackrest >/dev/null 2>&1; then
+        last_backup_ts=$(pgbackrest --stanza=veza info --output=json 2>/dev/null \
+          | python3 -c "
+import json, sys
+try:
+    data = json.load(sys.stdin)
+    backups = data[0]['backup'] if data else []
+    if not backups: print(0); sys.exit(0)
+    print(max(b['timestamp']['stop'] for b in backups))
+except Exception:
+    print(0)
+" 2>/dev/null || echo 0)
+        now_ts=$(date +%s)
+        age_seconds=$(( now_ts - last_backup_ts ))
+        if [ "$last_backup_ts" -eq 0 ]; then
+          fail "pgBackRest backup-freshness check failed : could not parse 'pgbackrest info'. Set SKIP_BACKUP_FRESHNESS=1 to override (only after manually verifying a recent backup exists)." 2
+        fi
+        if [ "$age_seconds" -gt 86400 ]; then
+          age_hours=$(( age_seconds / 3600 ))
+          fail "pgBackRest most recent backup is ${age_hours}h old (threshold 24h). Run a backup before the game day, or set SKIP_BACKUP_FRESHNESS=1 if you've validated freshness another way." 2
+        fi
+        log "pre-flight : pgBackRest most recent backup is $(( age_seconds / 3600 ))h $(( (age_seconds % 3600) / 60 ))m old (< 24h threshold) — OK"
+      else
+        log "WARN : pgbackrest CLI not on \$PATH ; skipping backup-freshness check. Set SKIP_BACKUP_FRESHNESS=1 to silence this warning if intentional."
+      fi
+    fi
+
+    # Final type-the-word confirm. Everything above can be set in env
+    # by mistake ; this last step requires a human at the keyboard.
+    cat >&2 <<EOF
+
+================================================================
+PROD GAME DAY — final confirmation
+================================================================
+
+  inventory : prod
+  scenarios : ${SCENARIOS[*]}${ONLY:+   (filtered by ONLY=$ONLY)}${SKIP:+   (filtered by SKIP=$SKIP)}
+  session   : $SESSION_LOG
+
+Each scenario triggers a real outage. Type the literal phrase
+KILL-PROD (any other input aborts) to proceed :
+EOF
+    read -r confirm_phrase
+    if [ "$confirm_phrase" != "KILL-PROD" ]; then
+      fail "operator did not confirm KILL-PROD ($confirm_phrase) — aborting" 2
+    fi
+    ;;
+  *)
+    fail "INVENTORY=$INVENTORY not recognised — must be one of staging|prod" 2
+    ;;
+esac
+
 # Pre-flight : every selected scenario script must exist + be executable.
 for s in "${SCENARIOS[@]}"; do
  if want "$s"; then
@ -83,6 +189,7 @@ declare -A SCENARIO_DURATION

 log "================================================================"
 log "Game day session : $SESSION_DATE"
+log "Inventory        : $INVENTORY"
 log "Session log      : $SESSION_LOG"
 log "Scenarios run    : ${SCENARIOS[*]}"
 [ -n "$ONLY" ] && log "ONLY filter      : $ONLY"