veza/docs/runbooks/rabbitmq-down.md
senke 2a5bc11628 fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook
The game-day driver had no notion of inventory — it would happily
execute the 5 destructive scenarios (Postgres kill, HAProxy stop,
Redis kill, MinIO node loss, RabbitMQ stop) against whatever the
underlying scripts pointed at, with the operator's only protection
being "don't typo a host." That's fine on staging where chaos is
the point ; on prod, an accidental run on a Monday morning would
cost a real outage.

Added :

  scripts/security/game-day-driver.sh
    * INVENTORY env var — defaults to 'staging' so silence stays
      safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive
      type-the-phrase 'KILL-PROD' confirm. Anything other than
      staging|prod aborts.
    * Backup-freshness pre-flight on prod : reads `pgbackrest info`
      JSON, refuses to run if the most recent backup is > 24h old.
      SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline.
    * Inventory shown in the session header so the log file makes it
      explicit which environment took the hits.

  docs/runbooks/rabbitmq-down.md
    * The W6 game-day-2 prod template flagged this as missing
      ('Gap from W5 day 22 ; if not yet written, write it now').
      Mirrors the structure of redis-down.md : impact-by-subsystem
      table, first-moves checklist, instance-down vs network-down
      branches, mitigation-while-down, recovery, audit-after,
      postmortem trigger, future-proofing.
    * Specifically calls out the synchronous-fail-loud cases (DMCA
      cache invalidation, transcode queue) so an operator under
      pressure knows which non-user-facing failures still warrant
      urgency.

Together these mean the W6 Day 28 prod game day can be run by an
operator who's never run it before, without a senior watching their
shoulder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:32:05 +02:00

164 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runbook — RabbitMQ unavailable
> **Alert** : `RabbitMQUnreachable` (in `config/prometheus/alert_rules.yml`).
> **Owner** : infra on-call.
> **Game-day scenario** : E (`infra/ansible/tests/test_rabbitmq_outage.sh`).
## What breaks when RabbitMQ is down
RabbitMQ is a fan-out broker for asynchronous, non-user-facing work
(transcode jobs, distribution to external platforms, email digests,
DMCA takedown propagation, search index updates). The user-facing
request path does NOT block on RabbitMQ — the API publishes a message
and returns 202 Accepted ; the worker picks it up later.
| Subsystem | Effect when RabbitMQ is gone | Severity |
| ------------------------------------ | ------------------------------------------------------------------ | -------- |
| Track upload → HLS transcode | Upload succeeds (S3 write OK), HLS segments don't appear | **MEDIUM** — track playable via fallback `/stream`, not via HLS |
| Distribution to Spotify/SoundCloud | Submission silently queued ; users see "pending" forever | MEDIUM — surfaces in distribution dashboard, not in player |
| Email digest (weekly creator stats) | Cron tick logs `publish failed`, retries on next tick | LOW — eventual consistency, no user-visible breakage |
| DMCA takedown event | Track flag flipped in DB synchronously ; downstream replay queue stalls | **HIGH** — track is gated immediately (synchronous DB UPDATE), but cache invalidation lags |
| Search index updates | New tracks not searchable until queue drains | LOW — falls back to Postgres FTS |
| Chat messages (WebSocket) | INDEPENDENT — chat is direct WS, no RabbitMQ involvement | NONE |
| Auth, sessions, payments | INDEPENDENT — no RabbitMQ dependency | NONE |
The synchronous-fail-loud cases (DMCA cache invalidation, transcode
queue) are the ones that compound if the outage drags. Most user
flows degrade gracefully.
## First moves
1. **Confirm RabbitMQ is actually down**, not "unreachable from one
host" :
```bash
curl -s -u "$RMQ_USER:$RMQ_PASS" http://rabbitmq.lxd:15672/api/overview \
| jq '.cluster_name, .object_totals'
```
2. **Confirm what changed.** If a deploy fired in the last 30 min,
suspect the deploy. Check `journalctl -u veza-backend-api -n 200`
for `amqp` errors with timestamps after the deploy.
3. **Check the queues didn't fill the disk** (most common bring-down
in development) :
```bash
ssh rabbitmq.lxd 'df -h /var/lib/rabbitmq'
```
## RabbitMQ instance is down
```bash
# State on the RabbitMQ host :
ssh rabbitmq.lxd sudo systemctl status rabbitmq-server
# Logs (Erlang verbosity, grep for ERROR/CRASH) :
ssh rabbitmq.lxd sudo journalctl -u rabbitmq-server -n 500 \
| grep -E 'ERROR|CRASH|disk_alarm|memory_alarm'
```
Common causes :
- **Disk alarm.** `/var/lib/rabbitmq` filled — RabbitMQ pauses producers
when free space drops below `disk_free_limit`. The backend's amqp
client surfaces this as "blocked". Fix : grow the disk or expire old
messages with `rabbitmqctl purge_queue <queue>` (last resort, you
lose what's in there).
- **Memory alarm.** RSS over `vm_memory_high_watermark` × system mem.
Same effect (producers blocked). Fix : add memory or unblock by
draining a slow consumer.
- **Process crashed.** Erlang OOM, segfault. `sudo systemctl restart
rabbitmq-server` ; the queues survive (durable=true on every queue
we declare).
- **Cluster split-brain.** v1.0 is single-node, so this can't happen
yet. Listed for the v1.1 multi-node config.
## Backend can't reach RabbitMQ
Network or DNS issue, not RabbitMQ's fault.
```bash
# From the API container :
nc -zv rabbitmq.lxd 5672
# DNS :
getent hosts rabbitmq.lxd
# AMQP credentials :
docker exec veza_backend_api env | grep AMQP_URL
```
Likely culprits : Incus bridge restart, password rotation didn't
propagate to the API container's env, security-group change.
## Mitigation while RabbitMQ is down
The backend already handles publish failures gracefully :
- `internal/eventbus/rabbitmq.go` retries with exponential backoff up
to 30s, then drops to "degraded mode" (publish returns immediately
with a logged warning, the API call succeeds, the side-effect is
lost).
- Workers in `internal/workers/` have `WithRetry()` middleware that
republishes failed deliveries up to 5 times before dead-lettering.
If recovery is going to take > 10 min, set
`EVENTBUS_DEGRADED_LOG_LEVEL=error` (default `warn`) so the
fail-fast logs land in Sentry and operators can audit which messages
were dropped.
**Do NOT** restart the backend to clear the AMQP connection pool ;
the reconnect logic (`go.uber.org/zap`-logged in eventbus.go:142)
handles it once RabbitMQ is back.
## Recovery
Once RabbitMQ is back up :
1. Verify connectivity from each backend instance :
```bash
docker exec veza_backend_api sh -c 'echo -e "AMQP\x00\x00\x09\x01" | nc -w1 rabbitmq.lxd 5672 | head -c 4'
```
Should return `AMQP`.
2. Watch the queue depth on the management UI :
`http://rabbitmq.lxd:15672/#/queues`. Expect `transcode_jobs`,
`distribution_outbox`, `dmca_propagation`, `search_index_updates`
to drain over the next 5-15 min as the workers catch up.
3. If a queue is stuck > 30 min after recovery, the worker for it is
wedged — restart that specific worker container :
```bash
docker compose -f docker-compose.prod.yml restart worker-<name>
```
## Audit after the outage
1. Sentry filter `tag:eventbus.status=degraded` between outage start
and end — gives you the count and shape of dropped events.
2. For each dropped DMCA event, manually trigger the cache flush :
```bash
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
https://api.veza.fr/api/v1/admin/cache/dmca/flush
```
3. For each dropped transcode job, requeue from the orders table :
```bash
psql "$DATABASE_URL" -c "
INSERT INTO transcode_jobs (track_id, status, attempts, created_at)
SELECT id, 'pending', 0, NOW() FROM tracks
WHERE created_at BETWEEN '<outage_start>' AND '<outage_end>'
AND hls_status IS NULL;
"
```
## Postmortem trigger
Any RabbitMQ outage > 10 min triggers a postmortem. The non-user-facing
nature makes this less urgent than Redis or Postgres, but the
silent-failure modes (dropped DMCA propagation, missing transcodes)
warrant a write-up so we know what slipped through.
## Future-proofing
- v1.1 will move to a 3-node RabbitMQ cluster behind a load balancer
for HA. This runbook will then split into "single-node down" (the
cluster keeps serving) and "cluster split-brain" (rare, but the
recovery path is different).
- Worker idempotency keys are documented in `docs/api/eventbus.md`
any new worker MUST honour them so a replay during recovery doesn't
double-charge / double-distribute / double-takedown.