veza/docs/runbooks/GRACEFUL_DEGRADATION.md
senke 7d92820a9c docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs
Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.

INCIDENT_RESPONSE.md (15 → 208 lines)
  * "First 5 minutes" triage : ack → annotation → 3 dashboards →
    failure-class matrix → declare-if-stuck. Aligns with what an
    on-call actually does when paged.
  * Severity ladder (SEV-1/2/3) with response-time and
    communication norms — replaces the implicit "everything is
    SEV-1" the bullet points suggested.
  * "Capture evidence before mitigating" block with the four exact
    commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
    queues) the postmortem will want.
  * Mitigation patterns per failure class (API down, DB down,
    storage failure, webhook failure, DDoS, performance), each
    pointing at the deep-dive runbook for the specific recipe.
  * "After mitigation" : status page, comm pattern, postmortem
    schedule by severity, runbook update policy.
  * Tools section with the bookmark-able URLs (Grafana, Tempo,
    Sentry, status page, HAProxy stats, pg_auto_failover monitor,
    RabbitMQ console, MinIO console).

GRACEFUL_DEGRADATION.md (25 → 261 lines)
  * Quick-lookup matrix of every backing service × user-visible
    impact × severity × deep-dive runbook. Lets the on-call read
    one row instead of paging through six docs.
  * Per-service section detailing what still works and what fails :
    Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
    MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
    Elasticsearch (called out as the v1.0 orphan it is).
  * `/api/v1/health/deep` documented as the canary surface, with a
    sample response shape so operators know what `degraded` looks
    like before they see it.
  * "Adding a new degradation mode" section with the 4-step recipe
    (this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
    code comment) so future maintainers keep the docs in sync as
    the surface evolves.

These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 04:13:55 +02:00

261 lines
11 KiB
Markdown

# Runbook — Graceful Degradation
> **Owner** : platform engineering.
> **Purpose** : describe what happens when each backing service is
> down, so an operator can set expectations during an outage and a
> developer knows where the safety nets are.
The principle : **the user-facing request path should keep responding
even when secondary services degrade.** Hard failures (login, write
operations) trade for partial functionality (read-only, cached
responses, queued mutations) wherever the trade is reversible.
## Quick lookup — what breaks if X is down
| Backing service | User-visible impact | Severity | Sub-runbook |
| --------------- | -------------------------------------------- | ---------- | ------------------- |
| Postgres (primary) | All write operations + most reads fail (5xx) | **SEV-1** | `db-failover.md` |
| Postgres (replica) | Read-only routes slower (fall back to primary) | SEV-3 | `db-failover.md` |
| Redis (master) | Sessions invalidated, rate-limit goes in-memory | **SEV-1** | `redis-down.md` |
| Redis Sentinel | Failover detection broken, but Redis serves | SEV-2 | `redis-down.md` |
| RabbitMQ | Async jobs queue (transcode, distribution, digest) | SEV-2 | `rabbitmq-down.md` |
| MinIO / S3 | Track upload + signed-URL playback fail | **SEV-1** | (TODO v1.1) |
| Hyperswitch | Checkout fails, refund webhook stalls | SEV-2 | `payment-success-slo-burn.md` |
| Stream server | HLS transcode requests pile up, segment 404s | SEV-2 | `rabbitmq-down.md` |
| ClamAV | Track upload returns 503 (CLAMAV_REQUIRED=true) | SEV-2 | (no dedicated) |
| Coturn | WebRTC 1:1 calls fail behind symmetric NAT | SEV-3 | (no dedicated) |
| Elasticsearch | (orphan in v1.0 — search uses Postgres FTS) | SEV-3 | n/a |
| OpenSMTPD | Email digest + verification mails queue | SEV-3 | (no dedicated) |
## Postgres degradations
### Primary down (sync + async writes)
The API depends on Postgres for every persistent operation. The
backend handler middleware ([`internal/middleware/db_health.go`])
short-circuits incoming requests with 503 when the connection pool
can't acquire a connection within 1 s. This protects from the
"requests pile up while connecting" failure mode that bleeds memory.
What still works while Postgres is down :
- The `/api/v1/health` endpoint responds 200 (it doesn't touch DB).
- The `/api/v1/health/deep` endpoint responds 503 with the failed
component listed (the canary the status page reads).
- Static assets (frontend SPA) still serve from the HAProxy cache.
- WebSocket connections that don't read the DB stay open ; in
practice that's almost none.
What fails immediately :
- Login / refresh / register : 503.
- Any read or write on `/api/v1/*`.
- Cached reads in Redis stay readable but every cache miss falls
through to the DB and 5xx's there.
Recovery path : pg_auto_failover promotes the standby (RTO < 60 s
when sync replication holds). Frontend retries on 503 with backoff,
so users see ~1 min of "service unavailable" then the app comes
back. See `db-failover.md` for the operational steps.
### Replica down (read replica, optional)
When a read replica is configured (`READ_DATABASE_URL`), the
[`internal/database`] package routes read-only queries to it.
[`TrackService.forRead()`] is the canonical example.
If the replica is unreachable, GORM logs a connection error and the
forRead() helper falls back to the primary. User-visible impact :
none, beyond the latency uptick from the primary picking up read
load. Replica downtime is SEV-3 the cluster keeps serving but
should still be investigated within a business day to restore
read scaling.
## Redis degradations
Redis is multi-purpose ; impact differs by callsite.
### Master down
| Subsystem | Effect when Redis is gone | Severity |
| --------------------------------- | --------------------------------------------------- | -------- |
| Session storage / refresh tokens | Login / refresh fail users log out | **HIGH** |
| Rate limiter (`UserRateLimiter`) | Falls back to in-memory per-pod limits (less coverage but doesn't fail-open in prod) | MEDIUM |
| JWT revocation | Revoked tokens accepted again until access TTL | **SECURITY** silent failure |
| Cache (track lookups, feed pages) | Cache miss every read, falls back to Postgres | LOW |
| RabbitMQ-fronted queues | Independent Redis is just metrics for these | NONE |
The middleware doesn't 503 the whole API when Redis is down that
would be too restrictive given the cache-miss-only impact on most
routes. Operators should expect a latency uptick (warmer DB) but
not full unavailability.
Recovery is via Redis Sentinel HA (W3 day 11). When Sentinel
promotes a replica, sessions persist (replication lag < 200 ms in
practice).
### Sentinel quorum lost
Sentinel running on 3 nodes, quorum=2. If two Sentinel nodes are
unreachable, automatic failover stops working but the master keeps
serving. SEV-2 the cluster still answers, but a master failure
during this window is not auto-recoverable.
Mitigation : restart the Sentinel nodes one at a time. The master
keeps replicating to the replica throughout. See `redis-down.md`.
## RabbitMQ degradations
The detailed runbook lives at `rabbitmq-down.md`. Summary : the
user-facing request path doesn't block on RabbitMQ. The backend
publishes a message and returns 202 ; the worker picks it up later.
When RabbitMQ is down :
- Track upload succeeds (S3 write OK), but HLS transcode doesn't
fire ; track stays in `processing` until RabbitMQ recovers.
Playback falls back to direct `/stream` (MP3 range requests).
- Distribution submissions queue silently ; resurface in the
distribution dashboard as "pending" until drained.
- Email digests miss a tick or two.
- DMCA cache invalidation lags ; the synchronous DB UPDATE that
gates playback is unaffected.
The `internal/eventbus/rabbitmq.go` client retries with exponential
backoff up to 30 s, then falls into "degraded mode" publish
returns immediately with a logged warning, the API call succeeds,
the side-effect is dropped. The dropped events are queryable via
Sentry filter `tag:eventbus.status=degraded`.
## MinIO / S3 degradations
When `TRACK_STORAGE_BACKEND=s3` (prod default per the v1.0.10
compose fix) and MinIO is down :
- Track upload returns 5xx (the multipart write fails).
- Direct `/stream` returns 502 (the API tries to presign a missing
object).
- HLS playback : segments already on the CDN edge cache keep
serving for ~7 days (segments are content-addressed, the
`Cache-Control: public, max-age=86400, immutable` directive lets
edges keep them past origin downtime).
- Playlists, comments, metadata : unaffected (DB only).
The MinIO distributed cluster (4 nodes, EC:2) tolerates 2 drives
offline. The `MinIODriveOffline` alert fires at 1 drive ; the
`MinIONodesUnreachable` alert pages on call at 2 nodes that's the
threshold where the next failure causes data unavailability.
Mitigation while down : there's no fallback storage. Communicate the
outage on the status page, focus on restoring MinIO. Tracks
uploaded during the outage are not retryable from the client side
the upload session is lost.
## Hyperswitch degradations
When Hyperswitch is unreachable :
- Checkout : the order is created in `pending_payment` state, but
the redirect to the Hyperswitch UI fails. User sees "payment
unavailable" ; their cart is preserved.
- Refund webhook : pending refunds stay in `pending` state indef
until Hyperswitch is back. Operators can manually flip refunds
via admin actions if the outage drags > 24 h.
- Real-money flows : nothing recoverable client-side. Status page
must call this out as SEV-2.
## Stream server degradations
The Rust stream server handles HLS transcoding + segment serving.
When it's down or saturated :
- Existing HLS streams keep serving from the CDN edge cache (see
MinIO §). New streams that need transcoding stall in
`processing`.
- Direct `/stream` (MP3 range requests on the API itself, no stream
server involvement) keeps working — that's the v1.0 fallback
path for any track HLS hasn't materialised for yet.
- The user-visible symptom is "this track won't play" on a fresh
upload. Older tracks that have HLS segments cached at the edge
unaffected.
## ClamAV degradations
`ENABLE_CLAMAV=true` + `CLAMAV_REQUIRED=true` (prod default) means
upload requests block until ClamAV scans the file. If ClamAV is
unreachable, uploads return 503. SEV-2 — uploads are the highest-
value user action ; users lose work.
Operators can flip `CLAMAV_REQUIRED=false` as an emergency escape
hatch (uploads then go through unscanned). That's a *security*
trade — ClamAV was added explicitly to stop infected file
distribution. Document the timeframe in the incident postmortem
and flip back as soon as ClamAV is back.
## Coturn degradations
WebRTC 1:1 calls (per the v1.0.10 compose addition). Without
coturn :
- Calls between two peers on the same NAT segment work (peer-to-peer
hole-punching).
- Calls between two peers behind symmetric NAT (corporate
firewalls, mobile CGNAT) fail silently after ~30 s
`iceConnectionState=failed`.
- The frontend's `useWebRTC().nat.hasTurn` flag is false ; the
CallButton tooltip warns the user up-front. They see the
warning, the call attempt still happens, the failure is
visible.
This is SEV-3 — only some users are impacted, and those users are
warned before they hit the failure. Restoring coturn fixes new
calls instantly (the SPA refetches `/api/v1/config/webrtc` per
session).
## Elasticsearch — orphan in v1.0
The compose files still declare Elasticsearch but the search code
path uses Postgres FTS. ES being down has zero user impact in v1.0.
Leaving ES in compose is intentional — v1.1 will switch search back
to ES once the index is large enough to make Postgres FTS slow.
If ES is consuming resources during an outage of something else,
stop the container — it won't break anything.
## Health and observability surfaces
The `/api/v1/health/deep` endpoint reports the up/down state of each
dependency. Use it as the canary for any incident triage :
```bash
curl -s https://api.veza.fr/api/v1/health/deep | jq .
```
Sample response shape :
```json
{
"status": "ok",
"checks": {
"db": "ok",
"redis": "ok",
"rabbitmq": "ok",
"s3": "ok",
"disk": "ok",
"stream_server": "ok"
},
"version": "v1.0.10",
"uptime_seconds": 12345
}
```
A `degraded` status with the list of failing components keeps the
status page accurate without operators having to ssh into anything.
## Adding a new degradation mode
When introducing a new backing service or feature flag :
1. Document the failure mode in this file (which subsystem, what
degrades, what severity).
2. If the service is critical, add a row to `/api/v1/health/deep`.
3. If it has an alert rule, link the runbook in the alert
annotation (per `config/prometheus/alert_rules.yml` convention).
4. Decide whether the failure should fail-loud (return 5xx) or
fail-soft (degrade gracefully). Document the choice in code with
a `// FAIL-SOFT: …` or `// FAIL-LOUD: …` comment so the next
maintainer doesn't second-guess.