The game-day driver had no notion of inventory — it would happily
execute the 5 destructive scenarios (Postgres kill, HAProxy stop,
Redis kill, MinIO node loss, RabbitMQ stop) against whatever the
underlying scripts pointed at, with the operator's only protection
being "don't typo a host." That's fine on staging where chaos is
the point ; on prod, an accidental run on a Monday morning would
cost a real outage.
Added :
scripts/security/game-day-driver.sh
* INVENTORY env var — defaults to 'staging' so silence stays
safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive
type-the-phrase 'KILL-PROD' confirm. Anything other than
staging|prod aborts.
* Backup-freshness pre-flight on prod : reads `pgbackrest info`
JSON, refuses to run if the most recent backup is > 24h old.
SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline.
* Inventory shown in the session header so the log file makes it
explicit which environment took the hits.
docs/runbooks/rabbitmq-down.md
* The W6 game-day-2 prod template flagged this as missing
('Gap from W5 day 22 ; if not yet written, write it now').
Mirrors the structure of redis-down.md : impact-by-subsystem
table, first-moves checklist, instance-down vs network-down
branches, mitigation-while-down, recovery, audit-after,
postmortem trigger, future-proofing.
* Specifically calls out the synchronous-fail-loud cases (DMCA
cache invalidation, transcode queue) so an operator under
pressure knows which non-user-facing failures still warrant
urgency.
Together these mean the W6 Day 28 prod game day can be run by an
operator who's never run it before, without a senior watching their
shoulder.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.9 KiB
Runbook — RabbitMQ unavailable
Alert :
RabbitMQUnreachable(inconfig/prometheus/alert_rules.yml). Owner : infra on-call. Game-day scenario : E (infra/ansible/tests/test_rabbitmq_outage.sh).
What breaks when RabbitMQ is down
RabbitMQ is a fan-out broker for asynchronous, non-user-facing work (transcode jobs, distribution to external platforms, email digests, DMCA takedown propagation, search index updates). The user-facing request path does NOT block on RabbitMQ — the API publishes a message and returns 202 Accepted ; the worker picks it up later.
| Subsystem | Effect when RabbitMQ is gone | Severity |
|---|---|---|
| Track upload → HLS transcode | Upload succeeds (S3 write OK), HLS segments don't appear | MEDIUM — track playable via fallback /stream, not via HLS |
| Distribution to Spotify/SoundCloud | Submission silently queued ; users see "pending" forever | MEDIUM — surfaces in distribution dashboard, not in player |
| Email digest (weekly creator stats) | Cron tick logs publish failed, retries on next tick |
LOW — eventual consistency, no user-visible breakage |
| DMCA takedown event | Track flag flipped in DB synchronously ; downstream replay queue stalls | HIGH — track is gated immediately (synchronous DB UPDATE), but cache invalidation lags |
| Search index updates | New tracks not searchable until queue drains | LOW — falls back to Postgres FTS |
| Chat messages (WebSocket) | INDEPENDENT — chat is direct WS, no RabbitMQ involvement | NONE |
| Auth, sessions, payments | INDEPENDENT — no RabbitMQ dependency | NONE |
The synchronous-fail-loud cases (DMCA cache invalidation, transcode queue) are the ones that compound if the outage drags. Most user flows degrade gracefully.
First moves
- Confirm RabbitMQ is actually down, not "unreachable from one
host" :
curl -s -u "$RMQ_USER:$RMQ_PASS" http://rabbitmq.lxd:15672/api/overview \ | jq '.cluster_name, .object_totals' - Confirm what changed. If a deploy fired in the last 30 min,
suspect the deploy. Check
journalctl -u veza-backend-api -n 200foramqperrors with timestamps after the deploy. - Check the queues didn't fill the disk (most common bring-down
in development) :
ssh rabbitmq.lxd 'df -h /var/lib/rabbitmq'
RabbitMQ instance is down
# State on the RabbitMQ host :
ssh rabbitmq.lxd sudo systemctl status rabbitmq-server
# Logs (Erlang verbosity, grep for ERROR/CRASH) :
ssh rabbitmq.lxd sudo journalctl -u rabbitmq-server -n 500 \
| grep -E 'ERROR|CRASH|disk_alarm|memory_alarm'
Common causes :
- Disk alarm.
/var/lib/rabbitmqfilled — RabbitMQ pauses producers when free space drops belowdisk_free_limit. The backend's amqp client surfaces this as "blocked". Fix : grow the disk or expire old messages withrabbitmqctl purge_queue <queue>(last resort, you lose what's in there). - Memory alarm. RSS over
vm_memory_high_watermark× system mem. Same effect (producers blocked). Fix : add memory or unblock by draining a slow consumer. - Process crashed. Erlang OOM, segfault.
sudo systemctl restart rabbitmq-server; the queues survive (durable=true on every queue we declare). - Cluster split-brain. v1.0 is single-node, so this can't happen yet. Listed for the v1.1 multi-node config.
Backend can't reach RabbitMQ
Network or DNS issue, not RabbitMQ's fault.
# From the API container :
nc -zv rabbitmq.lxd 5672
# DNS :
getent hosts rabbitmq.lxd
# AMQP credentials :
docker exec veza_backend_api env | grep AMQP_URL
Likely culprits : Incus bridge restart, password rotation didn't propagate to the API container's env, security-group change.
Mitigation while RabbitMQ is down
The backend already handles publish failures gracefully :
internal/eventbus/rabbitmq.goretries with exponential backoff up to 30s, then drops to "degraded mode" (publish returns immediately with a logged warning, the API call succeeds, the side-effect is lost).- Workers in
internal/workers/haveWithRetry()middleware that republishes failed deliveries up to 5 times before dead-lettering.
If recovery is going to take > 10 min, set
EVENTBUS_DEGRADED_LOG_LEVEL=error (default warn) so the
fail-fast logs land in Sentry and operators can audit which messages
were dropped.
Do NOT restart the backend to clear the AMQP connection pool ;
the reconnect logic (go.uber.org/zap-logged in eventbus.go:142)
handles it once RabbitMQ is back.
Recovery
Once RabbitMQ is back up :
- Verify connectivity from each backend instance :
Should returndocker exec veza_backend_api sh -c 'echo -e "AMQP\x00\x00\x09\x01" | nc -w1 rabbitmq.lxd 5672 | head -c 4'AMQP. - Watch the queue depth on the management UI :
http://rabbitmq.lxd:15672/#/queues. Expecttranscode_jobs,distribution_outbox,dmca_propagation,search_index_updatesto drain over the next 5-15 min as the workers catch up. - If a queue is stuck > 30 min after recovery, the worker for it is
wedged — restart that specific worker container :
docker compose -f docker-compose.prod.yml restart worker-<name>
Audit after the outage
- Sentry filter
tag:eventbus.status=degradedbetween outage start and end — gives you the count and shape of dropped events. - For each dropped DMCA event, manually trigger the cache flush :
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \ https://api.veza.fr/api/v1/admin/cache/dmca/flush - For each dropped transcode job, requeue from the orders table :
psql "$DATABASE_URL" -c " INSERT INTO transcode_jobs (track_id, status, attempts, created_at) SELECT id, 'pending', 0, NOW() FROM tracks WHERE created_at BETWEEN '<outage_start>' AND '<outage_end>' AND hls_status IS NULL; "
Postmortem trigger
Any RabbitMQ outage > 10 min triggers a postmortem. The non-user-facing nature makes this less urgent than Redis or Postgres, but the silent-failure modes (dropped DMCA propagation, missing transcodes) warrant a write-up so we know what slipped through.
Future-proofing
- v1.1 will move to a 3-node RabbitMQ cluster behind a load balancer for HA. This runbook will then split into "single-node down" (the cluster keeps serving) and "cluster split-brain" (rare, but the recovery path is different).
- Worker idempotency keys are documented in
docs/api/eventbus.md— any new worker MUST honour them so a replay during recovery doesn't double-charge / double-distribute / double-takedown.