veza/docs/runbooks/rabbitmq-down.md

# Runbook — RabbitMQ unavailable

> **Alert** : `RabbitMQUnreachable` (in `config/prometheus/alert_rules.yml`).
> **Owner** : infra on-call.
> **Game-day scenario** : E (`infra/ansible/tests/test_rabbitmq_outage.sh`).

## What breaks when RabbitMQ is down

RabbitMQ is a fan-out broker for asynchronous, non-user-facing work
(transcode jobs, distribution to external platforms, email digests,
DMCA takedown propagation, search index updates). The user-facing
request path does NOT block on RabbitMQ — the API publishes a message
and returns 202 Accepted ; the worker picks it up later.

| Subsystem                            | Effect when RabbitMQ is gone                                       | Severity |
| ------------------------------------ | ------------------------------------------------------------------ | -------- |
| Track upload → HLS transcode         | Upload succeeds (S3 write OK), HLS segments don't appear           | **MEDIUM** — track playable via fallback `/stream`, not via HLS |
| Distribution to Spotify/SoundCloud   | Submission silently queued ; users see "pending" forever           | MEDIUM — surfaces in distribution dashboard, not in player |
| Email digest (weekly creator stats)  | Cron tick logs `publish failed`, retries on next tick              | LOW — eventual consistency, no user-visible breakage |
| DMCA takedown event                  | Track flag flipped in DB synchronously ; downstream replay queue stalls | **HIGH** — track is gated immediately (synchronous DB UPDATE), but cache invalidation lags |
| Search index updates                 | New tracks not searchable until queue drains                       | LOW — falls back to Postgres FTS |
| Chat messages (WebSocket)            | INDEPENDENT — chat is direct WS, no RabbitMQ involvement           | NONE |
| Auth, sessions, payments             | INDEPENDENT — no RabbitMQ dependency                               | NONE |

The synchronous-fail-loud cases (DMCA cache invalidation, transcode
queue) are the ones that compound if the outage drags. Most user
flows degrade gracefully.

## First moves

1. **Confirm RabbitMQ is actually down**, not "unreachable from one
   host" :
   ```bash
   curl -s -u "$RMQ_USER:$RMQ_PASS" http://rabbitmq.lxd:15672/api/overview \
     | jq '.cluster_name, .object_totals'
   ```
2. **Confirm what changed.** If a deploy fired in the last 30 min,
   suspect the deploy. Check `journalctl -u veza-backend-api -n 200`
   for `amqp` errors with timestamps after the deploy.
3. **Check the queues didn't fill the disk** (most common bring-down
   in development) :
   ```bash
   ssh rabbitmq.lxd 'df -h /var/lib/rabbitmq'
   ```

## RabbitMQ instance is down

```bash
# State on the RabbitMQ host :
ssh rabbitmq.lxd sudo systemctl status rabbitmq-server

# Logs (Erlang verbosity, grep for ERROR/CRASH) :
ssh rabbitmq.lxd sudo journalctl -u rabbitmq-server -n 500 \
  | grep -E 'ERROR|CRASH|disk_alarm|memory_alarm'
```

Common causes :

- **Disk alarm.** `/var/lib/rabbitmq` filled — RabbitMQ pauses producers
  when free space drops below `disk_free_limit`. The backend's amqp
  client surfaces this as "blocked". Fix : grow the disk or expire old
  messages with `rabbitmqctl purge_queue <queue>` (last resort, you
  lose what's in there).
- **Memory alarm.** RSS over `vm_memory_high_watermark` × system mem.
  Same effect (producers blocked). Fix : add memory or unblock by
  draining a slow consumer.
- **Process crashed.** Erlang OOM, segfault. `sudo systemctl restart
  rabbitmq-server` ; the queues survive (durable=true on every queue
  we declare).
- **Cluster split-brain.** v1.0 is single-node, so this can't happen
  yet. Listed for the v1.1 multi-node config.

## Backend can't reach RabbitMQ

Network or DNS issue, not RabbitMQ's fault.

```bash
# From the API container :
nc -zv rabbitmq.lxd 5672

# DNS :
getent hosts rabbitmq.lxd

# AMQP credentials :
docker exec veza_backend_api env | grep AMQP_URL
```

Likely culprits : Incus bridge restart, password rotation didn't
propagate to the API container's env, security-group change.

## Mitigation while RabbitMQ is down

The backend already handles publish failures gracefully :

- `internal/eventbus/rabbitmq.go` retries with exponential backoff up
  to 30s, then drops to "degraded mode" (publish returns immediately
  with a logged warning, the API call succeeds, the side-effect is
  lost).
- Workers in `internal/workers/` have `WithRetry()` middleware that
  republishes failed deliveries up to 5 times before dead-lettering.

If recovery is going to take > 10 min, set
`EVENTBUS_DEGRADED_LOG_LEVEL=error` (default `warn`) so the
fail-fast logs land in Sentry and operators can audit which messages
were dropped.

**Do NOT** restart the backend to clear the AMQP connection pool ;
the reconnect logic (`go.uber.org/zap`-logged in eventbus.go:142)
handles it once RabbitMQ is back.

## Recovery

Once RabbitMQ is back up :

1. Verify connectivity from each backend instance :
   ```bash
   docker exec veza_backend_api sh -c 'echo -e "AMQP\x00\x00\x09\x01" | nc -w1 rabbitmq.lxd 5672 | head -c 4'
   ```
   Should return `AMQP`.
2. Watch the queue depth on the management UI :
   `http://rabbitmq.lxd:15672/#/queues`. Expect `transcode_jobs`,
   `distribution_outbox`, `dmca_propagation`, `search_index_updates`
   to drain over the next 5-15 min as the workers catch up.
3. If a queue is stuck > 30 min after recovery, the worker for it is
   wedged — restart that specific worker container :
   ```bash
   docker compose -f docker-compose.prod.yml restart worker-<name>
   ```

## Audit after the outage

1. Sentry filter `tag:eventbus.status=degraded` between outage start
   and end — gives you the count and shape of dropped events.
2. For each dropped DMCA event, manually trigger the cache flush :
   ```bash
   curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
     https://api.veza.fr/api/v1/admin/cache/dmca/flush
   ```
3. For each dropped transcode job, requeue from the orders table :
   ```bash
   psql "$DATABASE_URL" -c "
     INSERT INTO transcode_jobs (track_id, status, attempts, created_at)
     SELECT id, 'pending', 0, NOW() FROM tracks
     WHERE created_at BETWEEN '<outage_start>' AND '<outage_end>'
       AND hls_status IS NULL;
   "
   ```

## Postmortem trigger

Any RabbitMQ outage > 10 min triggers a postmortem. The non-user-facing
nature makes this less urgent than Redis or Postgres, but the
silent-failure modes (dropped DMCA propagation, missing transcodes)
warrant a write-up so we know what slipped through.

## Future-proofing

- v1.1 will move to a 3-node RabbitMQ cluster behind a load balancer
  for HA. This runbook will then split into "single-node down" (the
  cluster keeps serving) and "cluster split-brain" (rare, but the
  recovery path is different).
- Worker idempotency keys are documented in `docs/api/eventbus.md` —
  any new worker MUST honour them so a replay during recovery doesn't
  double-charge / double-distribute / double-takedown.
-												fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook

The game-day driver had no notion of inventory — it would happily
execute the 5 destructive scenarios (Postgres kill, HAProxy stop,
Redis kill, MinIO node loss, RabbitMQ stop) against whatever the
underlying scripts pointed at, with the operator's only protection
being "don't typo a host." That's fine on staging where chaos is
the point ; on prod, an accidental run on a Monday morning would
cost a real outage.

Added :

  scripts/security/game-day-driver.sh
    * INVENTORY env var — defaults to 'staging' so silence stays
      safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive
      type-the-phrase 'KILL-PROD' confirm. Anything other than
      staging|prod aborts.
    * Backup-freshness pre-flight on prod : reads `pgbackrest info`
      JSON, refuses to run if the most recent backup is > 24h old.
      SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline.
    * Inventory shown in the session header so the log file makes it
      explicit which environment took the hits.

  docs/runbooks/rabbitmq-down.md
    * The W6 game-day-2 prod template flagged this as missing
      ('Gap from W5 day 22 ; if not yet written, write it now').
      Mirrors the structure of redis-down.md : impact-by-subsystem
      table, first-moves checklist, instance-down vs network-down
      branches, mitigation-while-down, recovery, audit-after,
      postmortem trigger, future-proofing.
    * Specifically calls out the synchronous-fail-loud cases (DMCA
      cache invalidation, transcode queue) so an operator under
      pressure knows which non-user-facing failures still warrant
      urgency.

Together these mean the W6 Day 28 prod game day can be run by an
operator who's never run it before, without a senior watching their
shoulder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-30 20:32:05 +00:00
+								# Runbook — RabbitMQ unavailable
 								> **Alert** : `RabbitMQUnreachable` (in `config/prometheus/alert_rules.yml`).
 								> **Owner** : infra on-call.
 								> **Game-day scenario** : E (`infra/ansible/tests/test_rabbitmq_outage.sh`).
 								## What breaks when RabbitMQ is down
 								RabbitMQ is a fan-out broker for asynchronous, non-user-facing work
 								(transcode jobs, distribution to external platforms, email digests,
 								DMCA takedown propagation, search index updates). The user-facing
 								request path does NOT block on RabbitMQ — the API publishes a message
 								and returns 202 Accepted ; the worker picks it up later.
 								| Subsystem                            | Effect when RabbitMQ is gone                                       | Severity |
 								| ------------------------------------ | ------------------------------------------------------------------ | -------- |
 								| Track upload → HLS transcode         | Upload succeeds (S3 write OK), HLS segments don't appear           | **MEDIUM** — track playable via fallback `/stream`, not via HLS |
 								| Distribution to Spotify/SoundCloud   | Submission silently queued ; users see "pending" forever           | MEDIUM — surfaces in distribution dashboard, not in player |
 								| Email digest (weekly creator stats)  | Cron tick logs `publish failed`, retries on next tick              | LOW — eventual consistency, no user-visible breakage |
 								| DMCA takedown event                  | Track flag flipped in DB synchronously ; downstream replay queue stalls | **HIGH** — track is gated immediately (synchronous DB UPDATE), but cache invalidation lags |
 								| Search index updates                 | New tracks not searchable until queue drains                       | LOW — falls back to Postgres FTS |
 								| Chat messages (WebSocket)            | INDEPENDENT — chat is direct WS, no RabbitMQ involvement           | NONE |
 								| Auth, sessions, payments             | INDEPENDENT — no RabbitMQ dependency                               | NONE |
 								The synchronous-fail-loud cases (DMCA cache invalidation, transcode
 								queue) are the ones that compound if the outage drags. Most user
 								flows degrade gracefully.
 								## First moves
 . **Confirm RabbitMQ is actually down**, not "unreachable from one
 								   host" :
 								   ```bash
 								   curl -s -u "$RMQ_USER:$RMQ_PASS" http://rabbitmq.lxd:15672/api/overview \
 								     | jq '.cluster_name, .object_totals'
 								   ```
 . **Confirm what changed.** If a deploy fired in the last 30 min,
 								   suspect the deploy. Check `journalctl -u veza-backend-api -n 200`
 								   for `amqp` errors with timestamps after the deploy.
 . **Check the queues didn't fill the disk** (most common bring-down
 								   in development) :
 								   ```bash
 								   ssh rabbitmq.lxd 'df -h /var/lib/rabbitmq'
 								   ```
 								## RabbitMQ instance is down
 								```bash
 								# State on the RabbitMQ host :
 								ssh rabbitmq.lxd sudo systemctl status rabbitmq-server
 								# Logs (Erlang verbosity, grep for ERROR/CRASH) :
 								ssh rabbitmq.lxd sudo journalctl -u rabbitmq-server -n 500 \
 								  | grep -E 'ERROR|CRASH|disk_alarm|memory_alarm'
 								```
 								Common causes :
 								- **Disk alarm.** `/var/lib/rabbitmq` filled — RabbitMQ pauses producers
 								  when free space drops below `disk_free_limit`. The backend's amqp
 								  client surfaces this as "blocked". Fix : grow the disk or expire old
 								  messages with `rabbitmqctl purge_queue <queue>` (last resort, you
 								  lose what's in there).
 								- **Memory alarm.** RSS over `vm_memory_high_watermark` × system mem.
 								  Same effect (producers blocked). Fix : add memory or unblock by
 								  draining a slow consumer.
 								- **Process crashed.** Erlang OOM, segfault. `sudo systemctl restart
 								  rabbitmq-server` ; the queues survive (durable=true on every queue
 								  we declare).
 								- **Cluster split-brain.** v1.0 is single-node, so this can't happen
 								  yet. Listed for the v1.1 multi-node config.
 								## Backend can't reach RabbitMQ
 								Network or DNS issue, not RabbitMQ's fault.
 								```bash
 								# From the API container :
 								nc -zv rabbitmq.lxd 5672
 								# DNS :
 								getent hosts rabbitmq.lxd
 								# AMQP credentials :
 								docker exec veza_backend_api env | grep AMQP_URL
 								```
 								Likely culprits : Incus bridge restart, password rotation didn't
 								propagate to the API container's env, security-group change.
 								## Mitigation while RabbitMQ is down
 								The backend already handles publish failures gracefully :
 								- `internal/eventbus/rabbitmq.go` retries with exponential backoff up
 								  to 30s, then drops to "degraded mode" (publish returns immediately
 								  with a logged warning, the API call succeeds, the side-effect is
 								  lost).
 								- Workers in `internal/workers/` have `WithRetry()` middleware that
 								  republishes failed deliveries up to 5 times before dead-lettering.
 								If recovery is going to take > 10 min, set
 								`EVENTBUS_DEGRADED_LOG_LEVEL=error` (default `warn`) so the
 								fail-fast logs land in Sentry and operators can audit which messages
 								were dropped.
 								**Do NOT** restart the backend to clear the AMQP connection pool ;
 								the reconnect logic (`go.uber.org/zap`-logged in eventbus.go:142)
 								handles it once RabbitMQ is back.
 								## Recovery
 								Once RabbitMQ is back up :
 . Verify connectivity from each backend instance :
 								   ```bash
 								   docker exec veza_backend_api sh -c 'echo -e "AMQP\x00\x00\x09\x01" | nc -w1 rabbitmq.lxd 5672 | head -c 4'
 								   ```
 								   Should return `AMQP`.
 . Watch the queue depth on the management UI :
 								   `http://rabbitmq.lxd:15672/#/queues`. Expect `transcode_jobs`,
 								   `distribution_outbox`, `dmca_propagation`, `search_index_updates`
 								   to drain over the next 5-15 min as the workers catch up.
 . If a queue is stuck > 30 min after recovery, the worker for it is
 								   wedged — restart that specific worker container :
 								   ```bash
 								   docker compose -f docker-compose.prod.yml restart worker-<name>
 								   ```
 								## Audit after the outage
 . Sentry filter `tag:eventbus.status=degraded` between outage start
 								   and end — gives you the count and shape of dropped events.
 . For each dropped DMCA event, manually trigger the cache flush :
 								   ```bash
 								   curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
 								     https://api.veza.fr/api/v1/admin/cache/dmca/flush
 								   ```
 . For each dropped transcode job, requeue from the orders table :
 								   ```bash
 								   psql "$DATABASE_URL" -c "
 								     INSERT INTO transcode_jobs (track_id, status, attempts, created_at)
 								     SELECT id, 'pending', 0, NOW() FROM tracks
 								     WHERE created_at BETWEEN '<outage_start>' AND '<outage_end>'
 								       AND hls_status IS NULL;
 								   "
 								   ```
 								## Postmortem trigger
 								Any RabbitMQ outage > 10 min triggers a postmortem. The non-user-facing
 								nature makes this less urgent than Redis or Postgres, but the
 								silent-failure modes (dropped DMCA propagation, missing transcodes)
 								warrant a write-up so we know what slipped through.
 								## Future-proofing
 								- v1.1 will move to a 3-node RabbitMQ cluster behind a load balancer
 								  for HA. This runbook will then split into "single-node down" (the
 								  cluster keeps serving) and "cluster split-brain" (rare, but the
 								  recovery path is different).
 								- Worker idempotency keys are documented in `docs/api/eventbus.md` —
 								  any new worker MUST honour them so a replay during recovery doesn't
 								  double-charge / double-distribute / double-takedown.