From 2a5bc11628bfca84341380082f5e6c71292c466c Mon Sep 17 00:00:00 2001 From: senke Date: Thu, 30 Apr 2026 22:32:05 +0200 Subject: [PATCH] fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The game-day driver had no notion of inventory — it would happily execute the 5 destructive scenarios (Postgres kill, HAProxy stop, Redis kill, MinIO node loss, RabbitMQ stop) against whatever the underlying scripts pointed at, with the operator's only protection being "don't typo a host." That's fine on staging where chaos is the point ; on prod, an accidental run on a Monday morning would cost a real outage. Added : scripts/security/game-day-driver.sh * INVENTORY env var — defaults to 'staging' so silence stays safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive type-the-phrase 'KILL-PROD' confirm. Anything other than staging|prod aborts. * Backup-freshness pre-flight on prod : reads `pgbackrest info` JSON, refuses to run if the most recent backup is > 24h old. SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline. * Inventory shown in the session header so the log file makes it explicit which environment took the hits. docs/runbooks/rabbitmq-down.md * The W6 game-day-2 prod template flagged this as missing ('Gap from W5 day 22 ; if not yet written, write it now'). Mirrors the structure of redis-down.md : impact-by-subsystem table, first-moves checklist, instance-down vs network-down branches, mitigation-while-down, recovery, audit-after, postmortem trigger, future-proofing. * Specifically calls out the synchronous-fail-loud cases (DMCA cache invalidation, transcode queue) so an operator under pressure knows which non-user-facing failures still warrant urgency. Together these mean the W6 Day 28 prod game day can be run by an operator who's never run it before, without a senior watching their shoulder. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/runbooks/rabbitmq-down.md | 164 ++++++++++++++++++++++++++++ scripts/security/game-day-driver.sh | 115 ++++++++++++++++++- 2 files changed, 275 insertions(+), 4 deletions(-) create mode 100644 docs/runbooks/rabbitmq-down.md diff --git a/docs/runbooks/rabbitmq-down.md b/docs/runbooks/rabbitmq-down.md new file mode 100644 index 000000000..2d0673656 --- /dev/null +++ b/docs/runbooks/rabbitmq-down.md @@ -0,0 +1,164 @@ +# Runbook — RabbitMQ unavailable + +> **Alert** : `RabbitMQUnreachable` (in `config/prometheus/alert_rules.yml`). +> **Owner** : infra on-call. +> **Game-day scenario** : E (`infra/ansible/tests/test_rabbitmq_outage.sh`). + +## What breaks when RabbitMQ is down + +RabbitMQ is a fan-out broker for asynchronous, non-user-facing work +(transcode jobs, distribution to external platforms, email digests, +DMCA takedown propagation, search index updates). The user-facing +request path does NOT block on RabbitMQ — the API publishes a message +and returns 202 Accepted ; the worker picks it up later. + +| Subsystem | Effect when RabbitMQ is gone | Severity | +| ------------------------------------ | ------------------------------------------------------------------ | -------- | +| Track upload → HLS transcode | Upload succeeds (S3 write OK), HLS segments don't appear | **MEDIUM** — track playable via fallback `/stream`, not via HLS | +| Distribution to Spotify/SoundCloud | Submission silently queued ; users see "pending" forever | MEDIUM — surfaces in distribution dashboard, not in player | +| Email digest (weekly creator stats) | Cron tick logs `publish failed`, retries on next tick | LOW — eventual consistency, no user-visible breakage | +| DMCA takedown event | Track flag flipped in DB synchronously ; downstream replay queue stalls | **HIGH** — track is gated immediately (synchronous DB UPDATE), but cache invalidation lags | +| Search index updates | New tracks not searchable until queue drains | LOW — falls back to Postgres FTS | +| Chat messages (WebSocket) | INDEPENDENT — chat is direct WS, no RabbitMQ involvement | NONE | +| Auth, sessions, payments | INDEPENDENT — no RabbitMQ dependency | NONE | + +The synchronous-fail-loud cases (DMCA cache invalidation, transcode +queue) are the ones that compound if the outage drags. Most user +flows degrade gracefully. + +## First moves + +1. **Confirm RabbitMQ is actually down**, not "unreachable from one + host" : + ```bash + curl -s -u "$RMQ_USER:$RMQ_PASS" http://rabbitmq.lxd:15672/api/overview \ + | jq '.cluster_name, .object_totals' + ``` +2. **Confirm what changed.** If a deploy fired in the last 30 min, + suspect the deploy. Check `journalctl -u veza-backend-api -n 200` + for `amqp` errors with timestamps after the deploy. +3. **Check the queues didn't fill the disk** (most common bring-down + in development) : + ```bash + ssh rabbitmq.lxd 'df -h /var/lib/rabbitmq' + ``` + +## RabbitMQ instance is down + +```bash +# State on the RabbitMQ host : +ssh rabbitmq.lxd sudo systemctl status rabbitmq-server + +# Logs (Erlang verbosity, grep for ERROR/CRASH) : +ssh rabbitmq.lxd sudo journalctl -u rabbitmq-server -n 500 \ + | grep -E 'ERROR|CRASH|disk_alarm|memory_alarm' +``` + +Common causes : + +- **Disk alarm.** `/var/lib/rabbitmq` filled — RabbitMQ pauses producers + when free space drops below `disk_free_limit`. The backend's amqp + client surfaces this as "blocked". Fix : grow the disk or expire old + messages with `rabbitmqctl purge_queue ` (last resort, you + lose what's in there). +- **Memory alarm.** RSS over `vm_memory_high_watermark` × system mem. + Same effect (producers blocked). Fix : add memory or unblock by + draining a slow consumer. +- **Process crashed.** Erlang OOM, segfault. `sudo systemctl restart + rabbitmq-server` ; the queues survive (durable=true on every queue + we declare). +- **Cluster split-brain.** v1.0 is single-node, so this can't happen + yet. Listed for the v1.1 multi-node config. + +## Backend can't reach RabbitMQ + +Network or DNS issue, not RabbitMQ's fault. + +```bash +# From the API container : +nc -zv rabbitmq.lxd 5672 + +# DNS : +getent hosts rabbitmq.lxd + +# AMQP credentials : +docker exec veza_backend_api env | grep AMQP_URL +``` + +Likely culprits : Incus bridge restart, password rotation didn't +propagate to the API container's env, security-group change. + +## Mitigation while RabbitMQ is down + +The backend already handles publish failures gracefully : + +- `internal/eventbus/rabbitmq.go` retries with exponential backoff up + to 30s, then drops to "degraded mode" (publish returns immediately + with a logged warning, the API call succeeds, the side-effect is + lost). +- Workers in `internal/workers/` have `WithRetry()` middleware that + republishes failed deliveries up to 5 times before dead-lettering. + +If recovery is going to take > 10 min, set +`EVENTBUS_DEGRADED_LOG_LEVEL=error` (default `warn`) so the +fail-fast logs land in Sentry and operators can audit which messages +were dropped. + +**Do NOT** restart the backend to clear the AMQP connection pool ; +the reconnect logic (`go.uber.org/zap`-logged in eventbus.go:142) +handles it once RabbitMQ is back. + +## Recovery + +Once RabbitMQ is back up : + +1. Verify connectivity from each backend instance : + ```bash + docker exec veza_backend_api sh -c 'echo -e "AMQP\x00\x00\x09\x01" | nc -w1 rabbitmq.lxd 5672 | head -c 4' + ``` + Should return `AMQP`. +2. Watch the queue depth on the management UI : + `http://rabbitmq.lxd:15672/#/queues`. Expect `transcode_jobs`, + `distribution_outbox`, `dmca_propagation`, `search_index_updates` + to drain over the next 5-15 min as the workers catch up. +3. If a queue is stuck > 30 min after recovery, the worker for it is + wedged — restart that specific worker container : + ```bash + docker compose -f docker-compose.prod.yml restart worker- + ``` + +## Audit after the outage + +1. Sentry filter `tag:eventbus.status=degraded` between outage start + and end — gives you the count and shape of dropped events. +2. For each dropped DMCA event, manually trigger the cache flush : + ```bash + curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \ + https://api.veza.fr/api/v1/admin/cache/dmca/flush + ``` +3. For each dropped transcode job, requeue from the orders table : + ```bash + psql "$DATABASE_URL" -c " + INSERT INTO transcode_jobs (track_id, status, attempts, created_at) + SELECT id, 'pending', 0, NOW() FROM tracks + WHERE created_at BETWEEN '' AND '' + AND hls_status IS NULL; + " + ``` + +## Postmortem trigger + +Any RabbitMQ outage > 10 min triggers a postmortem. The non-user-facing +nature makes this less urgent than Redis or Postgres, but the +silent-failure modes (dropped DMCA propagation, missing transcodes) +warrant a write-up so we know what slipped through. + +## Future-proofing + +- v1.1 will move to a 3-node RabbitMQ cluster behind a load balancer + for HA. This runbook will then split into "single-node down" (the + cluster keeps serving) and "cluster split-brain" (rare, but the + recovery path is different). +- Worker idempotency keys are documented in `docs/api/eventbus.md` — + any new worker MUST honour them so a replay during recovery doesn't + double-charge / double-distribute / double-takedown. diff --git a/scripts/security/game-day-driver.sh b/scripts/security/game-day-driver.sh index b8496b776..3c474929c 100755 --- a/scripts/security/game-day-driver.sh +++ b/scripts/security/game-day-driver.sh @@ -16,18 +16,26 @@ # E : test_rabbitmq_outage.sh — stop RabbitMQ 60s, backend stays up # # Usage : -# bash scripts/security/game-day-driver.sh # run all scenarios -# SKIP=DE bash scripts/security/game-day-driver.sh # skip scenarios D + E -# ONLY=A bash scripts/security/game-day-driver.sh # only run scenario A +# bash scripts/security/game-day-driver.sh # all scenarios on staging (default) +# SKIP=DE bash scripts/security/game-day-driver.sh # skip D + E +# ONLY=A bash scripts/security/game-day-driver.sh # only A +# INVENTORY=prod CONFIRM_PROD=1 bash scripts/security/game-day-driver.sh # prod (gated) # # Required env (passed through to the underlying smoke tests) : # REDIS_PASS / SENTINEL_PASS for scenario C # MINIO_ROOT_USER / MINIO_ROOT_PASSWORD for scenario D # +# v1.0.10 polish — production gating : +# INVENTORY=prod must be paired with CONFIRM_PROD=1 or the script +# refuses to run, so a stale shell-history line can't accidentally +# kill prod Postgres on a Monday morning. The driver also runs a +# backup-freshness pre-flight when targeting prod (most recent +# pgBackRest backup must be < 24 h old). +# # Exit codes : # 0 — every selected scenario passed # 1 — at least one scenario failed -# 2 — runner pre-flight failed (script missing, etc.) +# 2 — runner pre-flight failed (script missing, prod safety guard tripped, stale backup, etc.) set -euo pipefail REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)" @@ -41,6 +49,9 @@ mkdir -p "$LOGS_DIR" ONLY=${ONLY:-} SKIP=${SKIP:-} +INVENTORY=${INVENTORY:-staging} +CONFIRM_PROD=${CONFIRM_PROD:-0} +SKIP_BACKUP_FRESHNESS=${SKIP_BACKUP_FRESHNESS:-0} log() { printf '[%s] %s\n' "$(date +%H:%M:%S)" "$*" | tee -a "$SESSION_LOG" >&2; } fail() { log "FAIL: $*"; exit "${2:-2}"; } @@ -68,6 +79,101 @@ want() { return 0 } +# v1.0.10 polish — prod safety gate. INVENTORY=prod requires +# CONFIRM_PROD=1 + an interactive type-the-word confirm. Anything else +# defaults to staging so a forgotten env-var doesn't matter. +case "$INVENTORY" in + staging|stg|dev|local) ;; + prod|production) + if [ "$CONFIRM_PROD" != "1" ]; then + cat >&2 < 24 h old. Recovery from a stale backup + # can extend an outage from minutes to hours, so the cost of + # postponing the game day is much less than the cost of compounded + # data loss if scenario A fails to recover and we have to restore + # from yesterday-but-one. + if [ "$SKIP_BACKUP_FRESHNESS" != "1" ]; then + if command -v pgbackrest >/dev/null 2>&1; then + last_backup_ts=$(pgbackrest --stanza=veza info --output=json 2>/dev/null \ + | python3 -c " +import json, sys +try: + data = json.load(sys.stdin) + backups = data[0]['backup'] if data else [] + if not backups: print(0); sys.exit(0) + print(max(b['timestamp']['stop'] for b in backups)) +except Exception: + print(0) +" 2>/dev/null || echo 0) + now_ts=$(date +%s) + age_seconds=$(( now_ts - last_backup_ts )) + if [ "$last_backup_ts" -eq 0 ]; then + fail "pgBackRest backup-freshness check failed : could not parse 'pgbackrest info'. Set SKIP_BACKUP_FRESHNESS=1 to override (only after manually verifying a recent backup exists)." 2 + fi + if [ "$age_seconds" -gt 86400 ]; then + age_hours=$(( age_seconds / 3600 )) + fail "pgBackRest most recent backup is ${age_hours}h old (threshold 24h). Run a backup before the game day, or set SKIP_BACKUP_FRESHNESS=1 if you've validated freshness another way." 2 + fi + log "pre-flight : pgBackRest most recent backup is $(( age_seconds / 3600 ))h $(( (age_seconds % 3600) / 60 ))m old (< 24h threshold) — OK" + else + log "WARN : pgbackrest CLI not on \$PATH ; skipping backup-freshness check. Set SKIP_BACKUP_FRESHNESS=1 to silence this warning if intentional." + fi + fi + + # Final type-the-word confirm. Everything above can be set in env + # by mistake ; this last step requires a human at the keyboard. + cat >&2 <