# Runbook — Graceful Degradation > **Owner** : platform engineering. > **Purpose** : describe what happens when each backing service is > down, so an operator can set expectations during an outage and a > developer knows where the safety nets are. The principle : **the user-facing request path should keep responding even when secondary services degrade.** Hard failures (login, write operations) trade for partial functionality (read-only, cached responses, queued mutations) wherever the trade is reversible. ## Quick lookup — what breaks if X is down | Backing service | User-visible impact | Severity | Sub-runbook | | --------------- | -------------------------------------------- | ---------- | ------------------- | | Postgres (primary) | All write operations + most reads fail (5xx) | **SEV-1** | `db-failover.md` | | Postgres (replica) | Read-only routes slower (fall back to primary) | SEV-3 | `db-failover.md` | | Redis (master) | Sessions invalidated, rate-limit goes in-memory | **SEV-1** | `redis-down.md` | | Redis Sentinel | Failover detection broken, but Redis serves | SEV-2 | `redis-down.md` | | RabbitMQ | Async jobs queue (transcode, distribution, digest) | SEV-2 | `rabbitmq-down.md` | | MinIO / S3 | Track upload + signed-URL playback fail | **SEV-1** | (TODO v1.1) | | Hyperswitch | Checkout fails, refund webhook stalls | SEV-2 | `payment-success-slo-burn.md` | | Stream server | HLS transcode requests pile up, segment 404s | SEV-2 | `rabbitmq-down.md` | | ClamAV | Track upload returns 503 (CLAMAV_REQUIRED=true) | SEV-2 | (no dedicated) | | Coturn | WebRTC 1:1 calls fail behind symmetric NAT | SEV-3 | (no dedicated) | | Elasticsearch | (orphan in v1.0 — search uses Postgres FTS) | SEV-3 | n/a | | OpenSMTPD | Email digest + verification mails queue | SEV-3 | (no dedicated) | ## Postgres degradations ### Primary down (sync + async writes) The API depends on Postgres for every persistent operation. The backend handler middleware ([`internal/middleware/db_health.go`]) short-circuits incoming requests with 503 when the connection pool can't acquire a connection within 1 s. This protects from the "requests pile up while connecting" failure mode that bleeds memory. What still works while Postgres is down : - The `/api/v1/health` endpoint responds 200 (it doesn't touch DB). - The `/api/v1/health/deep` endpoint responds 503 with the failed component listed (the canary the status page reads). - Static assets (frontend SPA) still serve from the HAProxy cache. - WebSocket connections that don't read the DB stay open ; in practice that's almost none. What fails immediately : - Login / refresh / register : 503. - Any read or write on `/api/v1/*`. - Cached reads in Redis stay readable but every cache miss falls through to the DB and 5xx's there. Recovery path : pg_auto_failover promotes the standby (RTO < 60 s when sync replication holds). Frontend retries on 503 with backoff, so users see ~1 min of "service unavailable" then the app comes back. See `db-failover.md` for the operational steps. ### Replica down (read replica, optional) When a read replica is configured (`READ_DATABASE_URL`), the [`internal/database`] package routes read-only queries to it. [`TrackService.forRead()`] is the canonical example. If the replica is unreachable, GORM logs a connection error and the forRead() helper falls back to the primary. User-visible impact : none, beyond the latency uptick from the primary picking up read load. Replica downtime is SEV-3 — the cluster keeps serving — but should still be investigated within a business day to restore read scaling. ## Redis degradations Redis is multi-purpose ; impact differs by callsite. ### Master down | Subsystem | Effect when Redis is gone | Severity | | --------------------------------- | --------------------------------------------------- | -------- | | Session storage / refresh tokens | Login / refresh fail — users log out | **HIGH** | | Rate limiter (`UserRateLimiter`) | Falls back to in-memory per-pod limits (less coverage but doesn't fail-open in prod) | MEDIUM | | JWT revocation | Revoked tokens accepted again until access TTL | **SECURITY** — silent failure | | Cache (track lookups, feed pages) | Cache miss every read, falls back to Postgres | LOW | | RabbitMQ-fronted queues | Independent — Redis is just metrics for these | NONE | The middleware doesn't 503 the whole API when Redis is down — that would be too restrictive given the cache-miss-only impact on most routes. Operators should expect a latency uptick (warmer DB) but not full unavailability. Recovery is via Redis Sentinel HA (W3 day 11). When Sentinel promotes a replica, sessions persist (replication lag < 200 ms in practice). ### Sentinel quorum lost Sentinel running on 3 nodes, quorum=2. If two Sentinel nodes are unreachable, automatic failover stops working but the master keeps serving. SEV-2 — the cluster still answers, but a master failure during this window is not auto-recoverable. Mitigation : restart the Sentinel nodes one at a time. The master keeps replicating to the replica throughout. See `redis-down.md`. ## RabbitMQ degradations The detailed runbook lives at `rabbitmq-down.md`. Summary : the user-facing request path doesn't block on RabbitMQ. The backend publishes a message and returns 202 ; the worker picks it up later. When RabbitMQ is down : - Track upload succeeds (S3 write OK), but HLS transcode doesn't fire ; track stays in `processing` until RabbitMQ recovers. Playback falls back to direct `/stream` (MP3 range requests). - Distribution submissions queue silently ; resurface in the distribution dashboard as "pending" until drained. - Email digests miss a tick or two. - DMCA cache invalidation lags ; the synchronous DB UPDATE that gates playback is unaffected. The `internal/eventbus/rabbitmq.go` client retries with exponential backoff up to 30 s, then falls into "degraded mode" — publish returns immediately with a logged warning, the API call succeeds, the side-effect is dropped. The dropped events are queryable via Sentry filter `tag:eventbus.status=degraded`. ## MinIO / S3 degradations When `TRACK_STORAGE_BACKEND=s3` (prod default per the v1.0.10 compose fix) and MinIO is down : - Track upload returns 5xx (the multipart write fails). - Direct `/stream` returns 502 (the API tries to presign a missing object). - HLS playback : segments already on the CDN edge cache keep serving for ~7 days (segments are content-addressed, the `Cache-Control: public, max-age=86400, immutable` directive lets edges keep them past origin downtime). - Playlists, comments, metadata : unaffected (DB only). The MinIO distributed cluster (4 nodes, EC:2) tolerates 2 drives offline. The `MinIODriveOffline` alert fires at 1 drive ; the `MinIONodesUnreachable` alert pages on call at 2 nodes — that's the threshold where the next failure causes data unavailability. Mitigation while down : there's no fallback storage. Communicate the outage on the status page, focus on restoring MinIO. Tracks uploaded during the outage are not retryable from the client side — the upload session is lost. ## Hyperswitch degradations When Hyperswitch is unreachable : - Checkout : the order is created in `pending_payment` state, but the redirect to the Hyperswitch UI fails. User sees "payment unavailable" ; their cart is preserved. - Refund webhook : pending refunds stay in `pending` state indef until Hyperswitch is back. Operators can manually flip refunds via admin actions if the outage drags > 24 h. - Real-money flows : nothing recoverable client-side. Status page must call this out as SEV-2. ## Stream server degradations The Rust stream server handles HLS transcoding + segment serving. When it's down or saturated : - Existing HLS streams keep serving from the CDN edge cache (see MinIO §). New streams that need transcoding stall in `processing`. - Direct `/stream` (MP3 range requests on the API itself, no stream server involvement) keeps working — that's the v1.0 fallback path for any track HLS hasn't materialised for yet. - The user-visible symptom is "this track won't play" on a fresh upload. Older tracks that have HLS segments cached at the edge unaffected. ## ClamAV degradations `ENABLE_CLAMAV=true` + `CLAMAV_REQUIRED=true` (prod default) means upload requests block until ClamAV scans the file. If ClamAV is unreachable, uploads return 503. SEV-2 — uploads are the highest- value user action ; users lose work. Operators can flip `CLAMAV_REQUIRED=false` as an emergency escape hatch (uploads then go through unscanned). That's a *security* trade — ClamAV was added explicitly to stop infected file distribution. Document the timeframe in the incident postmortem and flip back as soon as ClamAV is back. ## Coturn degradations WebRTC 1:1 calls (per the v1.0.10 compose addition). Without coturn : - Calls between two peers on the same NAT segment work (peer-to-peer hole-punching). - Calls between two peers behind symmetric NAT (corporate firewalls, mobile CGNAT) fail silently after ~30 s `iceConnectionState=failed`. - The frontend's `useWebRTC().nat.hasTurn` flag is false ; the CallButton tooltip warns the user up-front. They see the warning, the call attempt still happens, the failure is visible. This is SEV-3 — only some users are impacted, and those users are warned before they hit the failure. Restoring coturn fixes new calls instantly (the SPA refetches `/api/v1/config/webrtc` per session). ## Elasticsearch — orphan in v1.0 The compose files still declare Elasticsearch but the search code path uses Postgres FTS. ES being down has zero user impact in v1.0. Leaving ES in compose is intentional — v1.1 will switch search back to ES once the index is large enough to make Postgres FTS slow. If ES is consuming resources during an outage of something else, stop the container — it won't break anything. ## Health and observability surfaces The `/api/v1/health/deep` endpoint reports the up/down state of each dependency. Use it as the canary for any incident triage : ```bash curl -s https://api.veza.fr/api/v1/health/deep | jq . ``` Sample response shape : ```json { "status": "ok", "checks": { "db": "ok", "redis": "ok", "rabbitmq": "ok", "s3": "ok", "disk": "ok", "stream_server": "ok" }, "version": "v1.0.10", "uptime_seconds": 12345 } ``` A `degraded` status with the list of failing components keeps the status page accurate without operators having to ssh into anything. ## Adding a new degradation mode When introducing a new backing service or feature flag : 1. Document the failure mode in this file (which subsystem, what degrades, what severity). 2. If the service is critical, add a row to `/api/v1/health/deep`. 3. If it has an alert rule, link the runbook in the alert annotation (per `config/prometheus/alert_rules.yml` convention). 4. Decide whether the failure should fail-loud (return 5xx) or fail-soft (degrade gracefully). Document the choice in code with a `// FAIL-SOFT: …` or `// FAIL-LOUD: …` comment so the next maintainer doesn't second-guess.