165 lines
6.9 KiB
Markdown
165 lines
6.9 KiB
Markdown
|
|
# Runbook — RabbitMQ unavailable
|
|||
|
|
|
|||
|
|
> **Alert** : `RabbitMQUnreachable` (in `config/prometheus/alert_rules.yml`).
|
|||
|
|
> **Owner** : infra on-call.
|
|||
|
|
> **Game-day scenario** : E (`infra/ansible/tests/test_rabbitmq_outage.sh`).
|
|||
|
|
|
|||
|
|
## What breaks when RabbitMQ is down
|
|||
|
|
|
|||
|
|
RabbitMQ is a fan-out broker for asynchronous, non-user-facing work
|
|||
|
|
(transcode jobs, distribution to external platforms, email digests,
|
|||
|
|
DMCA takedown propagation, search index updates). The user-facing
|
|||
|
|
request path does NOT block on RabbitMQ — the API publishes a message
|
|||
|
|
and returns 202 Accepted ; the worker picks it up later.
|
|||
|
|
|
|||
|
|
| Subsystem | Effect when RabbitMQ is gone | Severity |
|
|||
|
|
| ------------------------------------ | ------------------------------------------------------------------ | -------- |
|
|||
|
|
| Track upload → HLS transcode | Upload succeeds (S3 write OK), HLS segments don't appear | **MEDIUM** — track playable via fallback `/stream`, not via HLS |
|
|||
|
|
| Distribution to Spotify/SoundCloud | Submission silently queued ; users see "pending" forever | MEDIUM — surfaces in distribution dashboard, not in player |
|
|||
|
|
| Email digest (weekly creator stats) | Cron tick logs `publish failed`, retries on next tick | LOW — eventual consistency, no user-visible breakage |
|
|||
|
|
| DMCA takedown event | Track flag flipped in DB synchronously ; downstream replay queue stalls | **HIGH** — track is gated immediately (synchronous DB UPDATE), but cache invalidation lags |
|
|||
|
|
| Search index updates | New tracks not searchable until queue drains | LOW — falls back to Postgres FTS |
|
|||
|
|
| Chat messages (WebSocket) | INDEPENDENT — chat is direct WS, no RabbitMQ involvement | NONE |
|
|||
|
|
| Auth, sessions, payments | INDEPENDENT — no RabbitMQ dependency | NONE |
|
|||
|
|
|
|||
|
|
The synchronous-fail-loud cases (DMCA cache invalidation, transcode
|
|||
|
|
queue) are the ones that compound if the outage drags. Most user
|
|||
|
|
flows degrade gracefully.
|
|||
|
|
|
|||
|
|
## First moves
|
|||
|
|
|
|||
|
|
1. **Confirm RabbitMQ is actually down**, not "unreachable from one
|
|||
|
|
host" :
|
|||
|
|
```bash
|
|||
|
|
curl -s -u "$RMQ_USER:$RMQ_PASS" http://rabbitmq.lxd:15672/api/overview \
|
|||
|
|
| jq '.cluster_name, .object_totals'
|
|||
|
|
```
|
|||
|
|
2. **Confirm what changed.** If a deploy fired in the last 30 min,
|
|||
|
|
suspect the deploy. Check `journalctl -u veza-backend-api -n 200`
|
|||
|
|
for `amqp` errors with timestamps after the deploy.
|
|||
|
|
3. **Check the queues didn't fill the disk** (most common bring-down
|
|||
|
|
in development) :
|
|||
|
|
```bash
|
|||
|
|
ssh rabbitmq.lxd 'df -h /var/lib/rabbitmq'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## RabbitMQ instance is down
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# State on the RabbitMQ host :
|
|||
|
|
ssh rabbitmq.lxd sudo systemctl status rabbitmq-server
|
|||
|
|
|
|||
|
|
# Logs (Erlang verbosity, grep for ERROR/CRASH) :
|
|||
|
|
ssh rabbitmq.lxd sudo journalctl -u rabbitmq-server -n 500 \
|
|||
|
|
| grep -E 'ERROR|CRASH|disk_alarm|memory_alarm'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Common causes :
|
|||
|
|
|
|||
|
|
- **Disk alarm.** `/var/lib/rabbitmq` filled — RabbitMQ pauses producers
|
|||
|
|
when free space drops below `disk_free_limit`. The backend's amqp
|
|||
|
|
client surfaces this as "blocked". Fix : grow the disk or expire old
|
|||
|
|
messages with `rabbitmqctl purge_queue <queue>` (last resort, you
|
|||
|
|
lose what's in there).
|
|||
|
|
- **Memory alarm.** RSS over `vm_memory_high_watermark` × system mem.
|
|||
|
|
Same effect (producers blocked). Fix : add memory or unblock by
|
|||
|
|
draining a slow consumer.
|
|||
|
|
- **Process crashed.** Erlang OOM, segfault. `sudo systemctl restart
|
|||
|
|
rabbitmq-server` ; the queues survive (durable=true on every queue
|
|||
|
|
we declare).
|
|||
|
|
- **Cluster split-brain.** v1.0 is single-node, so this can't happen
|
|||
|
|
yet. Listed for the v1.1 multi-node config.
|
|||
|
|
|
|||
|
|
## Backend can't reach RabbitMQ
|
|||
|
|
|
|||
|
|
Network or DNS issue, not RabbitMQ's fault.
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# From the API container :
|
|||
|
|
nc -zv rabbitmq.lxd 5672
|
|||
|
|
|
|||
|
|
# DNS :
|
|||
|
|
getent hosts rabbitmq.lxd
|
|||
|
|
|
|||
|
|
# AMQP credentials :
|
|||
|
|
docker exec veza_backend_api env | grep AMQP_URL
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Likely culprits : Incus bridge restart, password rotation didn't
|
|||
|
|
propagate to the API container's env, security-group change.
|
|||
|
|
|
|||
|
|
## Mitigation while RabbitMQ is down
|
|||
|
|
|
|||
|
|
The backend already handles publish failures gracefully :
|
|||
|
|
|
|||
|
|
- `internal/eventbus/rabbitmq.go` retries with exponential backoff up
|
|||
|
|
to 30s, then drops to "degraded mode" (publish returns immediately
|
|||
|
|
with a logged warning, the API call succeeds, the side-effect is
|
|||
|
|
lost).
|
|||
|
|
- Workers in `internal/workers/` have `WithRetry()` middleware that
|
|||
|
|
republishes failed deliveries up to 5 times before dead-lettering.
|
|||
|
|
|
|||
|
|
If recovery is going to take > 10 min, set
|
|||
|
|
`EVENTBUS_DEGRADED_LOG_LEVEL=error` (default `warn`) so the
|
|||
|
|
fail-fast logs land in Sentry and operators can audit which messages
|
|||
|
|
were dropped.
|
|||
|
|
|
|||
|
|
**Do NOT** restart the backend to clear the AMQP connection pool ;
|
|||
|
|
the reconnect logic (`go.uber.org/zap`-logged in eventbus.go:142)
|
|||
|
|
handles it once RabbitMQ is back.
|
|||
|
|
|
|||
|
|
## Recovery
|
|||
|
|
|
|||
|
|
Once RabbitMQ is back up :
|
|||
|
|
|
|||
|
|
1. Verify connectivity from each backend instance :
|
|||
|
|
```bash
|
|||
|
|
docker exec veza_backend_api sh -c 'echo -e "AMQP\x00\x00\x09\x01" | nc -w1 rabbitmq.lxd 5672 | head -c 4'
|
|||
|
|
```
|
|||
|
|
Should return `AMQP`.
|
|||
|
|
2. Watch the queue depth on the management UI :
|
|||
|
|
`http://rabbitmq.lxd:15672/#/queues`. Expect `transcode_jobs`,
|
|||
|
|
`distribution_outbox`, `dmca_propagation`, `search_index_updates`
|
|||
|
|
to drain over the next 5-15 min as the workers catch up.
|
|||
|
|
3. If a queue is stuck > 30 min after recovery, the worker for it is
|
|||
|
|
wedged — restart that specific worker container :
|
|||
|
|
```bash
|
|||
|
|
docker compose -f docker-compose.prod.yml restart worker-<name>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Audit after the outage
|
|||
|
|
|
|||
|
|
1. Sentry filter `tag:eventbus.status=degraded` between outage start
|
|||
|
|
and end — gives you the count and shape of dropped events.
|
|||
|
|
2. For each dropped DMCA event, manually trigger the cache flush :
|
|||
|
|
```bash
|
|||
|
|
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
|
|||
|
|
https://api.veza.fr/api/v1/admin/cache/dmca/flush
|
|||
|
|
```
|
|||
|
|
3. For each dropped transcode job, requeue from the orders table :
|
|||
|
|
```bash
|
|||
|
|
psql "$DATABASE_URL" -c "
|
|||
|
|
INSERT INTO transcode_jobs (track_id, status, attempts, created_at)
|
|||
|
|
SELECT id, 'pending', 0, NOW() FROM tracks
|
|||
|
|
WHERE created_at BETWEEN '<outage_start>' AND '<outage_end>'
|
|||
|
|
AND hls_status IS NULL;
|
|||
|
|
"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Postmortem trigger
|
|||
|
|
|
|||
|
|
Any RabbitMQ outage > 10 min triggers a postmortem. The non-user-facing
|
|||
|
|
nature makes this less urgent than Redis or Postgres, but the
|
|||
|
|
silent-failure modes (dropped DMCA propagation, missing transcodes)
|
|||
|
|
warrant a write-up so we know what slipped through.
|
|||
|
|
|
|||
|
|
## Future-proofing
|
|||
|
|
|
|||
|
|
- v1.1 will move to a 3-node RabbitMQ cluster behind a load balancer
|
|||
|
|
for HA. This runbook will then split into "single-node down" (the
|
|||
|
|
cluster keeps serving) and "cluster split-brain" (rare, but the
|
|||
|
|
recovery path is different).
|
|||
|
|
- Worker idempotency keys are documented in `docs/api/eventbus.md` —
|
|||
|
|
any new worker MUST honour them so a replay during recovery doesn't
|
|||
|
|
double-charge / double-distribute / double-takedown.
|