# Runbook — RabbitMQ unavailable > **Alert** : `RabbitMQUnreachable` (in `config/prometheus/alert_rules.yml`). > **Owner** : infra on-call. > **Game-day scenario** : E (`infra/ansible/tests/test_rabbitmq_outage.sh`). ## What breaks when RabbitMQ is down RabbitMQ is a fan-out broker for asynchronous, non-user-facing work (transcode jobs, distribution to external platforms, email digests, DMCA takedown propagation, search index updates). The user-facing request path does NOT block on RabbitMQ — the API publishes a message and returns 202 Accepted ; the worker picks it up later. | Subsystem | Effect when RabbitMQ is gone | Severity | | ------------------------------------ | ------------------------------------------------------------------ | -------- | | Track upload → HLS transcode | Upload succeeds (S3 write OK), HLS segments don't appear | **MEDIUM** — track playable via fallback `/stream`, not via HLS | | Distribution to Spotify/SoundCloud | Submission silently queued ; users see "pending" forever | MEDIUM — surfaces in distribution dashboard, not in player | | Email digest (weekly creator stats) | Cron tick logs `publish failed`, retries on next tick | LOW — eventual consistency, no user-visible breakage | | DMCA takedown event | Track flag flipped in DB synchronously ; downstream replay queue stalls | **HIGH** — track is gated immediately (synchronous DB UPDATE), but cache invalidation lags | | Search index updates | New tracks not searchable until queue drains | LOW — falls back to Postgres FTS | | Chat messages (WebSocket) | INDEPENDENT — chat is direct WS, no RabbitMQ involvement | NONE | | Auth, sessions, payments | INDEPENDENT — no RabbitMQ dependency | NONE | The synchronous-fail-loud cases (DMCA cache invalidation, transcode queue) are the ones that compound if the outage drags. Most user flows degrade gracefully. ## First moves 1. **Confirm RabbitMQ is actually down**, not "unreachable from one host" : ```bash curl -s -u "$RMQ_USER:$RMQ_PASS" http://rabbitmq.lxd:15672/api/overview \ | jq '.cluster_name, .object_totals' ``` 2. **Confirm what changed.** If a deploy fired in the last 30 min, suspect the deploy. Check `journalctl -u veza-backend-api -n 200` for `amqp` errors with timestamps after the deploy. 3. **Check the queues didn't fill the disk** (most common bring-down in development) : ```bash ssh rabbitmq.lxd 'df -h /var/lib/rabbitmq' ``` ## RabbitMQ instance is down ```bash # State on the RabbitMQ host : ssh rabbitmq.lxd sudo systemctl status rabbitmq-server # Logs (Erlang verbosity, grep for ERROR/CRASH) : ssh rabbitmq.lxd sudo journalctl -u rabbitmq-server -n 500 \ | grep -E 'ERROR|CRASH|disk_alarm|memory_alarm' ``` Common causes : - **Disk alarm.** `/var/lib/rabbitmq` filled — RabbitMQ pauses producers when free space drops below `disk_free_limit`. The backend's amqp client surfaces this as "blocked". Fix : grow the disk or expire old messages with `rabbitmqctl purge_queue ` (last resort, you lose what's in there). - **Memory alarm.** RSS over `vm_memory_high_watermark` × system mem. Same effect (producers blocked). Fix : add memory or unblock by draining a slow consumer. - **Process crashed.** Erlang OOM, segfault. `sudo systemctl restart rabbitmq-server` ; the queues survive (durable=true on every queue we declare). - **Cluster split-brain.** v1.0 is single-node, so this can't happen yet. Listed for the v1.1 multi-node config. ## Backend can't reach RabbitMQ Network or DNS issue, not RabbitMQ's fault. ```bash # From the API container : nc -zv rabbitmq.lxd 5672 # DNS : getent hosts rabbitmq.lxd # AMQP credentials : docker exec veza_backend_api env | grep AMQP_URL ``` Likely culprits : Incus bridge restart, password rotation didn't propagate to the API container's env, security-group change. ## Mitigation while RabbitMQ is down The backend already handles publish failures gracefully : - `internal/eventbus/rabbitmq.go` retries with exponential backoff up to 30s, then drops to "degraded mode" (publish returns immediately with a logged warning, the API call succeeds, the side-effect is lost). - Workers in `internal/workers/` have `WithRetry()` middleware that republishes failed deliveries up to 5 times before dead-lettering. If recovery is going to take > 10 min, set `EVENTBUS_DEGRADED_LOG_LEVEL=error` (default `warn`) so the fail-fast logs land in Sentry and operators can audit which messages were dropped. **Do NOT** restart the backend to clear the AMQP connection pool ; the reconnect logic (`go.uber.org/zap`-logged in eventbus.go:142) handles it once RabbitMQ is back. ## Recovery Once RabbitMQ is back up : 1. Verify connectivity from each backend instance : ```bash docker exec veza_backend_api sh -c 'echo -e "AMQP\x00\x00\x09\x01" | nc -w1 rabbitmq.lxd 5672 | head -c 4' ``` Should return `AMQP`. 2. Watch the queue depth on the management UI : `http://rabbitmq.lxd:15672/#/queues`. Expect `transcode_jobs`, `distribution_outbox`, `dmca_propagation`, `search_index_updates` to drain over the next 5-15 min as the workers catch up. 3. If a queue is stuck > 30 min after recovery, the worker for it is wedged — restart that specific worker container : ```bash docker compose -f docker-compose.prod.yml restart worker- ``` ## Audit after the outage 1. Sentry filter `tag:eventbus.status=degraded` between outage start and end — gives you the count and shape of dropped events. 2. For each dropped DMCA event, manually trigger the cache flush : ```bash curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \ https://api.veza.fr/api/v1/admin/cache/dmca/flush ``` 3. For each dropped transcode job, requeue from the orders table : ```bash psql "$DATABASE_URL" -c " INSERT INTO transcode_jobs (track_id, status, attempts, created_at) SELECT id, 'pending', 0, NOW() FROM tracks WHERE created_at BETWEEN '' AND '' AND hls_status IS NULL; " ``` ## Postmortem trigger Any RabbitMQ outage > 10 min triggers a postmortem. The non-user-facing nature makes this less urgent than Redis or Postgres, but the silent-failure modes (dropped DMCA propagation, missing transcodes) warrant a write-up so we know what slipped through. ## Future-proofing - v1.1 will move to a 3-node RabbitMQ cluster behind a load balancer for HA. This runbook will then split into "single-node down" (the cluster keeps serving) and "cluster split-brain" (rare, but the recovery path is different). - Worker idempotency keys are documented in `docs/api/eventbus.md` — any new worker MUST honour them so a replay during recovery doesn't double-charge / double-distribute / double-takedown.