veza/docs/runbooks/rabbitmq-down.md
senke 2a5bc11628 fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook
The game-day driver had no notion of inventory — it would happily
execute the 5 destructive scenarios (Postgres kill, HAProxy stop,
Redis kill, MinIO node loss, RabbitMQ stop) against whatever the
underlying scripts pointed at, with the operator's only protection
being "don't typo a host." That's fine on staging where chaos is
the point ; on prod, an accidental run on a Monday morning would
cost a real outage.

Added :

  scripts/security/game-day-driver.sh
    * INVENTORY env var — defaults to 'staging' so silence stays
      safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive
      type-the-phrase 'KILL-PROD' confirm. Anything other than
      staging|prod aborts.
    * Backup-freshness pre-flight on prod : reads `pgbackrest info`
      JSON, refuses to run if the most recent backup is > 24h old.
      SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline.
    * Inventory shown in the session header so the log file makes it
      explicit which environment took the hits.

  docs/runbooks/rabbitmq-down.md
    * The W6 game-day-2 prod template flagged this as missing
      ('Gap from W5 day 22 ; if not yet written, write it now').
      Mirrors the structure of redis-down.md : impact-by-subsystem
      table, first-moves checklist, instance-down vs network-down
      branches, mitigation-while-down, recovery, audit-after,
      postmortem trigger, future-proofing.
    * Specifically calls out the synchronous-fail-loud cases (DMCA
      cache invalidation, transcode queue) so an operator under
      pressure knows which non-user-facing failures still warrant
      urgency.

Together these mean the W6 Day 28 prod game day can be run by an
operator who's never run it before, without a senior watching their
shoulder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:32:05 +02:00

6.9 KiB
Raw Blame History

Runbook — RabbitMQ unavailable

Alert : RabbitMQUnreachable (in config/prometheus/alert_rules.yml). Owner : infra on-call. Game-day scenario : E (infra/ansible/tests/test_rabbitmq_outage.sh).

What breaks when RabbitMQ is down

RabbitMQ is a fan-out broker for asynchronous, non-user-facing work (transcode jobs, distribution to external platforms, email digests, DMCA takedown propagation, search index updates). The user-facing request path does NOT block on RabbitMQ — the API publishes a message and returns 202 Accepted ; the worker picks it up later.

Subsystem Effect when RabbitMQ is gone Severity
Track upload → HLS transcode Upload succeeds (S3 write OK), HLS segments don't appear MEDIUM — track playable via fallback /stream, not via HLS
Distribution to Spotify/SoundCloud Submission silently queued ; users see "pending" forever MEDIUM — surfaces in distribution dashboard, not in player
Email digest (weekly creator stats) Cron tick logs publish failed, retries on next tick LOW — eventual consistency, no user-visible breakage
DMCA takedown event Track flag flipped in DB synchronously ; downstream replay queue stalls HIGH — track is gated immediately (synchronous DB UPDATE), but cache invalidation lags
Search index updates New tracks not searchable until queue drains LOW — falls back to Postgres FTS
Chat messages (WebSocket) INDEPENDENT — chat is direct WS, no RabbitMQ involvement NONE
Auth, sessions, payments INDEPENDENT — no RabbitMQ dependency NONE

The synchronous-fail-loud cases (DMCA cache invalidation, transcode queue) are the ones that compound if the outage drags. Most user flows degrade gracefully.

First moves

  1. Confirm RabbitMQ is actually down, not "unreachable from one host" :
    curl -s -u "$RMQ_USER:$RMQ_PASS" http://rabbitmq.lxd:15672/api/overview \
      | jq '.cluster_name, .object_totals'
    
  2. Confirm what changed. If a deploy fired in the last 30 min, suspect the deploy. Check journalctl -u veza-backend-api -n 200 for amqp errors with timestamps after the deploy.
  3. Check the queues didn't fill the disk (most common bring-down in development) :
    ssh rabbitmq.lxd 'df -h /var/lib/rabbitmq'
    

RabbitMQ instance is down

# State on the RabbitMQ host :
ssh rabbitmq.lxd sudo systemctl status rabbitmq-server

# Logs (Erlang verbosity, grep for ERROR/CRASH) :
ssh rabbitmq.lxd sudo journalctl -u rabbitmq-server -n 500 \
  | grep -E 'ERROR|CRASH|disk_alarm|memory_alarm'

Common causes :

  • Disk alarm. /var/lib/rabbitmq filled — RabbitMQ pauses producers when free space drops below disk_free_limit. The backend's amqp client surfaces this as "blocked". Fix : grow the disk or expire old messages with rabbitmqctl purge_queue <queue> (last resort, you lose what's in there).
  • Memory alarm. RSS over vm_memory_high_watermark × system mem. Same effect (producers blocked). Fix : add memory or unblock by draining a slow consumer.
  • Process crashed. Erlang OOM, segfault. sudo systemctl restart rabbitmq-server ; the queues survive (durable=true on every queue we declare).
  • Cluster split-brain. v1.0 is single-node, so this can't happen yet. Listed for the v1.1 multi-node config.

Backend can't reach RabbitMQ

Network or DNS issue, not RabbitMQ's fault.

# From the API container :
nc -zv rabbitmq.lxd 5672

# DNS :
getent hosts rabbitmq.lxd

# AMQP credentials :
docker exec veza_backend_api env | grep AMQP_URL

Likely culprits : Incus bridge restart, password rotation didn't propagate to the API container's env, security-group change.

Mitigation while RabbitMQ is down

The backend already handles publish failures gracefully :

  • internal/eventbus/rabbitmq.go retries with exponential backoff up to 30s, then drops to "degraded mode" (publish returns immediately with a logged warning, the API call succeeds, the side-effect is lost).
  • Workers in internal/workers/ have WithRetry() middleware that republishes failed deliveries up to 5 times before dead-lettering.

If recovery is going to take > 10 min, set EVENTBUS_DEGRADED_LOG_LEVEL=error (default warn) so the fail-fast logs land in Sentry and operators can audit which messages were dropped.

Do NOT restart the backend to clear the AMQP connection pool ; the reconnect logic (go.uber.org/zap-logged in eventbus.go:142) handles it once RabbitMQ is back.

Recovery

Once RabbitMQ is back up :

  1. Verify connectivity from each backend instance :
    docker exec veza_backend_api sh -c 'echo -e "AMQP\x00\x00\x09\x01" | nc -w1 rabbitmq.lxd 5672 | head -c 4'
    
    Should return AMQP.
  2. Watch the queue depth on the management UI : http://rabbitmq.lxd:15672/#/queues. Expect transcode_jobs, distribution_outbox, dmca_propagation, search_index_updates to drain over the next 5-15 min as the workers catch up.
  3. If a queue is stuck > 30 min after recovery, the worker for it is wedged — restart that specific worker container :
    docker compose -f docker-compose.prod.yml restart worker-<name>
    

Audit after the outage

  1. Sentry filter tag:eventbus.status=degraded between outage start and end — gives you the count and shape of dropped events.
  2. For each dropped DMCA event, manually trigger the cache flush :
    curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
      https://api.veza.fr/api/v1/admin/cache/dmca/flush
    
  3. For each dropped transcode job, requeue from the orders table :
    psql "$DATABASE_URL" -c "
      INSERT INTO transcode_jobs (track_id, status, attempts, created_at)
      SELECT id, 'pending', 0, NOW() FROM tracks
      WHERE created_at BETWEEN '<outage_start>' AND '<outage_end>'
        AND hls_status IS NULL;
    "
    

Postmortem trigger

Any RabbitMQ outage > 10 min triggers a postmortem. The non-user-facing nature makes this less urgent than Redis or Postgres, but the silent-failure modes (dropped DMCA propagation, missing transcodes) warrant a write-up so we know what slipped through.

Future-proofing

  • v1.1 will move to a 3-node RabbitMQ cluster behind a load balancer for HA. This runbook will then split into "single-node down" (the cluster keeps serving) and "cluster split-brain" (rare, but the recovery path is different).
  • Worker idempotency keys are documented in docs/api/eventbus.md — any new worker MUST honour them so a replay during recovery doesn't double-charge / double-distribute / double-takedown.