veza/docs/runbooks/GRACEFUL_DEGRADATION.md

# Runbook — Graceful Degradation

> **Owner** : platform engineering.
> **Purpose** : describe what happens when each backing service is
> down, so an operator can set expectations during an outage and a
> developer knows where the safety nets are.

The principle : **the user-facing request path should keep responding
even when secondary services degrade.** Hard failures (login, write
operations) trade for partial functionality (read-only, cached
responses, queued mutations) wherever the trade is reversible.

## Quick lookup — what breaks if X is down

| Backing service | User-visible impact                          | Severity   | Sub-runbook         |
| --------------- | -------------------------------------------- | ---------- | ------------------- |
| Postgres (primary) | All write operations + most reads fail (5xx) | **SEV-1**  | `db-failover.md`    |
| Postgres (replica) | Read-only routes slower (fall back to primary) | SEV-3   | `db-failover.md`    |
| Redis (master) | Sessions invalidated, rate-limit goes in-memory | **SEV-1** | `redis-down.md`     |
| Redis Sentinel | Failover detection broken, but Redis serves   | SEV-2      | `redis-down.md`     |
| RabbitMQ       | Async jobs queue (transcode, distribution, digest) | SEV-2 | `rabbitmq-down.md`  |
| MinIO / S3     | Track upload + signed-URL playback fail      | **SEV-1**  | (TODO v1.1)         |
| Hyperswitch    | Checkout fails, refund webhook stalls        | SEV-2      | `payment-success-slo-burn.md` |
| Stream server  | HLS transcode requests pile up, segment 404s | SEV-2      | `rabbitmq-down.md`  |
| ClamAV         | Track upload returns 503 (CLAMAV_REQUIRED=true) | SEV-2  | (no dedicated)      |
| Coturn         | WebRTC 1:1 calls fail behind symmetric NAT   | SEV-3      | (no dedicated)      |
| Elasticsearch  | (orphan in v1.0 — search uses Postgres FTS)  | SEV-3      | n/a                 |
| OpenSMTPD      | Email digest + verification mails queue     | SEV-3      | (no dedicated)      |

## Postgres degradations

### Primary down (sync + async writes)

The API depends on Postgres for every persistent operation. The
backend handler middleware ([`internal/middleware/db_health.go`])
short-circuits incoming requests with 503 when the connection pool
can't acquire a connection within 1 s. This protects from the
"requests pile up while connecting" failure mode that bleeds memory.

What still works while Postgres is down :
- The `/api/v1/health` endpoint responds 200 (it doesn't touch DB).
- The `/api/v1/health/deep` endpoint responds 503 with the failed
  component listed (the canary the status page reads).
- Static assets (frontend SPA) still serve from the HAProxy cache.
- WebSocket connections that don't read the DB stay open ; in
  practice that's almost none.

What fails immediately :
- Login / refresh / register : 503.
- Any read or write on `/api/v1/*`.
- Cached reads in Redis stay readable but every cache miss falls
  through to the DB and 5xx's there.

Recovery path : pg_auto_failover promotes the standby (RTO < 60 s
when sync replication holds). Frontend retries on 503 with backoff,
so users see ~1 min of "service unavailable" then the app comes
back. See `db-failover.md` for the operational steps.

### Replica down (read replica, optional)

When a read replica is configured (`READ_DATABASE_URL`), the
[`internal/database`] package routes read-only queries to it.
[`TrackService.forRead()`] is the canonical example.

If the replica is unreachable, GORM logs a connection error and the
forRead() helper falls back to the primary. User-visible impact :
none, beyond the latency uptick from the primary picking up read
load. Replica downtime is SEV-3 — the cluster keeps serving — but
should still be investigated within a business day to restore
read scaling.

## Redis degradations

Redis is multi-purpose ; impact differs by callsite.

### Master down

| Subsystem                         | Effect when Redis is gone                           | Severity |
| --------------------------------- | --------------------------------------------------- | -------- |
| Session storage / refresh tokens  | Login / refresh fail — users log out                | **HIGH** |
| Rate limiter (`UserRateLimiter`)  | Falls back to in-memory per-pod limits (less coverage but doesn't fail-open in prod) | MEDIUM |
| JWT revocation                    | Revoked tokens accepted again until access TTL       | **SECURITY** — silent failure |
| Cache (track lookups, feed pages) | Cache miss every read, falls back to Postgres       | LOW |
| RabbitMQ-fronted queues           | Independent — Redis is just metrics for these       | NONE |

The middleware doesn't 503 the whole API when Redis is down — that
would be too restrictive given the cache-miss-only impact on most
routes. Operators should expect a latency uptick (warmer DB) but
not full unavailability.

Recovery is via Redis Sentinel HA (W3 day 11). When Sentinel
promotes a replica, sessions persist (replication lag < 200 ms in
practice).

### Sentinel quorum lost

Sentinel running on 3 nodes, quorum=2. If two Sentinel nodes are
unreachable, automatic failover stops working but the master keeps
serving. SEV-2 — the cluster still answers, but a master failure
during this window is not auto-recoverable.

Mitigation : restart the Sentinel nodes one at a time. The master
keeps replicating to the replica throughout. See `redis-down.md`.

## RabbitMQ degradations

The detailed runbook lives at `rabbitmq-down.md`. Summary : the
user-facing request path doesn't block on RabbitMQ. The backend
publishes a message and returns 202 ; the worker picks it up later.

When RabbitMQ is down :
- Track upload succeeds (S3 write OK), but HLS transcode doesn't
  fire ; track stays in `processing` until RabbitMQ recovers.
  Playback falls back to direct `/stream` (MP3 range requests).
- Distribution submissions queue silently ; resurface in the
  distribution dashboard as "pending" until drained.
- Email digests miss a tick or two.
- DMCA cache invalidation lags ; the synchronous DB UPDATE that
  gates playback is unaffected.

The `internal/eventbus/rabbitmq.go` client retries with exponential
backoff up to 30 s, then falls into "degraded mode" — publish
returns immediately with a logged warning, the API call succeeds,
the side-effect is dropped. The dropped events are queryable via
Sentry filter `tag:eventbus.status=degraded`.

## MinIO / S3 degradations

When `TRACK_STORAGE_BACKEND=s3` (prod default per the v1.0.10
compose fix) and MinIO is down :

- Track upload returns 5xx (the multipart write fails).
- Direct `/stream` returns 502 (the API tries to presign a missing
  object).
- HLS playback : segments already on the CDN edge cache keep
  serving for ~7 days (segments are content-addressed, the
  `Cache-Control: public, max-age=86400, immutable` directive lets
  edges keep them past origin downtime).
- Playlists, comments, metadata : unaffected (DB only).

The MinIO distributed cluster (4 nodes, EC:2) tolerates 2 drives
offline. The `MinIODriveOffline` alert fires at 1 drive ; the
`MinIONodesUnreachable` alert pages on call at 2 nodes — that's the
threshold where the next failure causes data unavailability.

Mitigation while down : there's no fallback storage. Communicate the
outage on the status page, focus on restoring MinIO. Tracks
uploaded during the outage are not retryable from the client side
— the upload session is lost.

## Hyperswitch degradations

When Hyperswitch is unreachable :
- Checkout : the order is created in `pending_payment` state, but
  the redirect to the Hyperswitch UI fails. User sees "payment
  unavailable" ; their cart is preserved.
- Refund webhook : pending refunds stay in `pending` state indef
  until Hyperswitch is back. Operators can manually flip refunds
  via admin actions if the outage drags > 24 h.
- Real-money flows : nothing recoverable client-side. Status page
  must call this out as SEV-2.

## Stream server degradations

The Rust stream server handles HLS transcoding + segment serving.
When it's down or saturated :
- Existing HLS streams keep serving from the CDN edge cache (see
  MinIO §). New streams that need transcoding stall in
  `processing`.
- Direct `/stream` (MP3 range requests on the API itself, no stream
  server involvement) keeps working — that's the v1.0 fallback
  path for any track HLS hasn't materialised for yet.
- The user-visible symptom is "this track won't play" on a fresh
  upload. Older tracks that have HLS segments cached at the edge
  unaffected.

## ClamAV degradations

`ENABLE_CLAMAV=true` + `CLAMAV_REQUIRED=true` (prod default) means
upload requests block until ClamAV scans the file. If ClamAV is
unreachable, uploads return 503. SEV-2 — uploads are the highest-
value user action ; users lose work.

Operators can flip `CLAMAV_REQUIRED=false` as an emergency escape
hatch (uploads then go through unscanned). That's a *security*
trade — ClamAV was added explicitly to stop infected file
distribution. Document the timeframe in the incident postmortem
and flip back as soon as ClamAV is back.

## Coturn degradations

WebRTC 1:1 calls (per the v1.0.10 compose addition). Without
coturn :
- Calls between two peers on the same NAT segment work (peer-to-peer
  hole-punching).
- Calls between two peers behind symmetric NAT (corporate
  firewalls, mobile CGNAT) fail silently after ~30 s
  `iceConnectionState=failed`.
- The frontend's `useWebRTC().nat.hasTurn` flag is false ; the
  CallButton tooltip warns the user up-front. They see the
  warning, the call attempt still happens, the failure is
  visible.

This is SEV-3 — only some users are impacted, and those users are
warned before they hit the failure. Restoring coturn fixes new
calls instantly (the SPA refetches `/api/v1/config/webrtc` per
session).

## Elasticsearch — orphan in v1.0

The compose files still declare Elasticsearch but the search code
path uses Postgres FTS. ES being down has zero user impact in v1.0.
Leaving ES in compose is intentional — v1.1 will switch search back
to ES once the index is large enough to make Postgres FTS slow.

If ES is consuming resources during an outage of something else,
stop the container — it won't break anything.

## Health and observability surfaces

The `/api/v1/health/deep` endpoint reports the up/down state of each
dependency. Use it as the canary for any incident triage :

```bash
curl -s https://api.veza.fr/api/v1/health/deep | jq .
```

Sample response shape :

```json
{
  "status": "ok",
  "checks": {
    "db": "ok",
    "redis": "ok",
    "rabbitmq": "ok",
    "s3": "ok",
    "disk": "ok",
    "stream_server": "ok"
  },
  "version": "v1.0.10",
  "uptime_seconds": 12345
}
```

A `degraded` status with the list of failing components keeps the
status page accurate without operators having to ssh into anything.

## Adding a new degradation mode

When introducing a new backing service or feature flag :

1. Document the failure mode in this file (which subsystem, what
   degrades, what severity).
2. If the service is critical, add a row to `/api/v1/health/deep`.
3. If it has an alert rule, link the runbook in the alert
   annotation (per `config/prometheus/alert_rules.yml` convention).
4. Decide whether the failure should fail-loud (return 5xx) or
   fail-soft (degrade gracefully). Document the choice in code with
   a `// FAIL-SOFT: …` or `// FAIL-LOUD: …` comment so the next
   maintainer doesn't second-guess.