veza/docs/runbooks/GRACEFUL_DEGRADATION.md
senke 7d92820a9c docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs
Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.

INCIDENT_RESPONSE.md (15 → 208 lines)
  * "First 5 minutes" triage : ack → annotation → 3 dashboards →
    failure-class matrix → declare-if-stuck. Aligns with what an
    on-call actually does when paged.
  * Severity ladder (SEV-1/2/3) with response-time and
    communication norms — replaces the implicit "everything is
    SEV-1" the bullet points suggested.
  * "Capture evidence before mitigating" block with the four exact
    commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
    queues) the postmortem will want.
  * Mitigation patterns per failure class (API down, DB down,
    storage failure, webhook failure, DDoS, performance), each
    pointing at the deep-dive runbook for the specific recipe.
  * "After mitigation" : status page, comm pattern, postmortem
    schedule by severity, runbook update policy.
  * Tools section with the bookmark-able URLs (Grafana, Tempo,
    Sentry, status page, HAProxy stats, pg_auto_failover monitor,
    RabbitMQ console, MinIO console).

GRACEFUL_DEGRADATION.md (25 → 261 lines)
  * Quick-lookup matrix of every backing service × user-visible
    impact × severity × deep-dive runbook. Lets the on-call read
    one row instead of paging through six docs.
  * Per-service section detailing what still works and what fails :
    Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
    MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
    Elasticsearch (called out as the v1.0 orphan it is).
  * `/api/v1/health/deep` documented as the canary surface, with a
    sample response shape so operators know what `degraded` looks
    like before they see it.
  * "Adding a new degradation mode" section with the 4-step recipe
    (this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
    code comment) so future maintainers keep the docs in sync as
    the surface evolves.

These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 04:13:55 +02:00

11 KiB

Runbook — Graceful Degradation

Owner : platform engineering. Purpose : describe what happens when each backing service is down, so an operator can set expectations during an outage and a developer knows where the safety nets are.

The principle : the user-facing request path should keep responding even when secondary services degrade. Hard failures (login, write operations) trade for partial functionality (read-only, cached responses, queued mutations) wherever the trade is reversible.

Quick lookup — what breaks if X is down

Backing service User-visible impact Severity Sub-runbook
Postgres (primary) All write operations + most reads fail (5xx) SEV-1 db-failover.md
Postgres (replica) Read-only routes slower (fall back to primary) SEV-3 db-failover.md
Redis (master) Sessions invalidated, rate-limit goes in-memory SEV-1 redis-down.md
Redis Sentinel Failover detection broken, but Redis serves SEV-2 redis-down.md
RabbitMQ Async jobs queue (transcode, distribution, digest) SEV-2 rabbitmq-down.md
MinIO / S3 Track upload + signed-URL playback fail SEV-1 (TODO v1.1)
Hyperswitch Checkout fails, refund webhook stalls SEV-2 payment-success-slo-burn.md
Stream server HLS transcode requests pile up, segment 404s SEV-2 rabbitmq-down.md
ClamAV Track upload returns 503 (CLAMAV_REQUIRED=true) SEV-2 (no dedicated)
Coturn WebRTC 1:1 calls fail behind symmetric NAT SEV-3 (no dedicated)
Elasticsearch (orphan in v1.0 — search uses Postgres FTS) SEV-3 n/a
OpenSMTPD Email digest + verification mails queue SEV-3 (no dedicated)

Postgres degradations

Primary down (sync + async writes)

The API depends on Postgres for every persistent operation. The backend handler middleware ([internal/middleware/db_health.go]) short-circuits incoming requests with 503 when the connection pool can't acquire a connection within 1 s. This protects from the "requests pile up while connecting" failure mode that bleeds memory.

What still works while Postgres is down :

  • The /api/v1/health endpoint responds 200 (it doesn't touch DB).
  • The /api/v1/health/deep endpoint responds 503 with the failed component listed (the canary the status page reads).
  • Static assets (frontend SPA) still serve from the HAProxy cache.
  • WebSocket connections that don't read the DB stay open ; in practice that's almost none.

What fails immediately :

  • Login / refresh / register : 503.
  • Any read or write on /api/v1/*.
  • Cached reads in Redis stay readable but every cache miss falls through to the DB and 5xx's there.

Recovery path : pg_auto_failover promotes the standby (RTO < 60 s when sync replication holds). Frontend retries on 503 with backoff, so users see ~1 min of "service unavailable" then the app comes back. See db-failover.md for the operational steps.

Replica down (read replica, optional)

When a read replica is configured (READ_DATABASE_URL), the [internal/database] package routes read-only queries to it. [TrackService.forRead()] is the canonical example.

If the replica is unreachable, GORM logs a connection error and the forRead() helper falls back to the primary. User-visible impact : none, beyond the latency uptick from the primary picking up read load. Replica downtime is SEV-3 — the cluster keeps serving — but should still be investigated within a business day to restore read scaling.

Redis degradations

Redis is multi-purpose ; impact differs by callsite.

Master down

Subsystem Effect when Redis is gone Severity
Session storage / refresh tokens Login / refresh fail — users log out HIGH
Rate limiter (UserRateLimiter) Falls back to in-memory per-pod limits (less coverage but doesn't fail-open in prod) MEDIUM
JWT revocation Revoked tokens accepted again until access TTL SECURITY — silent failure
Cache (track lookups, feed pages) Cache miss every read, falls back to Postgres LOW
RabbitMQ-fronted queues Independent — Redis is just metrics for these NONE

The middleware doesn't 503 the whole API when Redis is down — that would be too restrictive given the cache-miss-only impact on most routes. Operators should expect a latency uptick (warmer DB) but not full unavailability.

Recovery is via Redis Sentinel HA (W3 day 11). When Sentinel promotes a replica, sessions persist (replication lag < 200 ms in practice).

Sentinel quorum lost

Sentinel running on 3 nodes, quorum=2. If two Sentinel nodes are unreachable, automatic failover stops working but the master keeps serving. SEV-2 — the cluster still answers, but a master failure during this window is not auto-recoverable.

Mitigation : restart the Sentinel nodes one at a time. The master keeps replicating to the replica throughout. See redis-down.md.

RabbitMQ degradations

The detailed runbook lives at rabbitmq-down.md. Summary : the user-facing request path doesn't block on RabbitMQ. The backend publishes a message and returns 202 ; the worker picks it up later.

When RabbitMQ is down :

  • Track upload succeeds (S3 write OK), but HLS transcode doesn't fire ; track stays in processing until RabbitMQ recovers. Playback falls back to direct /stream (MP3 range requests).
  • Distribution submissions queue silently ; resurface in the distribution dashboard as "pending" until drained.
  • Email digests miss a tick or two.
  • DMCA cache invalidation lags ; the synchronous DB UPDATE that gates playback is unaffected.

The internal/eventbus/rabbitmq.go client retries with exponential backoff up to 30 s, then falls into "degraded mode" — publish returns immediately with a logged warning, the API call succeeds, the side-effect is dropped. The dropped events are queryable via Sentry filter tag:eventbus.status=degraded.

MinIO / S3 degradations

When TRACK_STORAGE_BACKEND=s3 (prod default per the v1.0.10 compose fix) and MinIO is down :

  • Track upload returns 5xx (the multipart write fails).
  • Direct /stream returns 502 (the API tries to presign a missing object).
  • HLS playback : segments already on the CDN edge cache keep serving for ~7 days (segments are content-addressed, the Cache-Control: public, max-age=86400, immutable directive lets edges keep them past origin downtime).
  • Playlists, comments, metadata : unaffected (DB only).

The MinIO distributed cluster (4 nodes, EC:2) tolerates 2 drives offline. The MinIODriveOffline alert fires at 1 drive ; the MinIONodesUnreachable alert pages on call at 2 nodes — that's the threshold where the next failure causes data unavailability.

Mitigation while down : there's no fallback storage. Communicate the outage on the status page, focus on restoring MinIO. Tracks uploaded during the outage are not retryable from the client side — the upload session is lost.

Hyperswitch degradations

When Hyperswitch is unreachable :

  • Checkout : the order is created in pending_payment state, but the redirect to the Hyperswitch UI fails. User sees "payment unavailable" ; their cart is preserved.
  • Refund webhook : pending refunds stay in pending state indef until Hyperswitch is back. Operators can manually flip refunds via admin actions if the outage drags > 24 h.
  • Real-money flows : nothing recoverable client-side. Status page must call this out as SEV-2.

Stream server degradations

The Rust stream server handles HLS transcoding + segment serving. When it's down or saturated :

  • Existing HLS streams keep serving from the CDN edge cache (see MinIO §). New streams that need transcoding stall in processing.
  • Direct /stream (MP3 range requests on the API itself, no stream server involvement) keeps working — that's the v1.0 fallback path for any track HLS hasn't materialised for yet.
  • The user-visible symptom is "this track won't play" on a fresh upload. Older tracks that have HLS segments cached at the edge unaffected.

ClamAV degradations

ENABLE_CLAMAV=true + CLAMAV_REQUIRED=true (prod default) means upload requests block until ClamAV scans the file. If ClamAV is unreachable, uploads return 503. SEV-2 — uploads are the highest- value user action ; users lose work.

Operators can flip CLAMAV_REQUIRED=false as an emergency escape hatch (uploads then go through unscanned). That's a security trade — ClamAV was added explicitly to stop infected file distribution. Document the timeframe in the incident postmortem and flip back as soon as ClamAV is back.

Coturn degradations

WebRTC 1:1 calls (per the v1.0.10 compose addition). Without coturn :

  • Calls between two peers on the same NAT segment work (peer-to-peer hole-punching).
  • Calls between two peers behind symmetric NAT (corporate firewalls, mobile CGNAT) fail silently after ~30 s iceConnectionState=failed.
  • The frontend's useWebRTC().nat.hasTurn flag is false ; the CallButton tooltip warns the user up-front. They see the warning, the call attempt still happens, the failure is visible.

This is SEV-3 — only some users are impacted, and those users are warned before they hit the failure. Restoring coturn fixes new calls instantly (the SPA refetches /api/v1/config/webrtc per session).

Elasticsearch — orphan in v1.0

The compose files still declare Elasticsearch but the search code path uses Postgres FTS. ES being down has zero user impact in v1.0. Leaving ES in compose is intentional — v1.1 will switch search back to ES once the index is large enough to make Postgres FTS slow.

If ES is consuming resources during an outage of something else, stop the container — it won't break anything.

Health and observability surfaces

The /api/v1/health/deep endpoint reports the up/down state of each dependency. Use it as the canary for any incident triage :

curl -s https://api.veza.fr/api/v1/health/deep | jq .

Sample response shape :

{
  "status": "ok",
  "checks": {
    "db": "ok",
    "redis": "ok",
    "rabbitmq": "ok",
    "s3": "ok",
    "disk": "ok",
    "stream_server": "ok"
  },
  "version": "v1.0.10",
  "uptime_seconds": 12345
}

A degraded status with the list of failing components keeps the status page accurate without operators having to ssh into anything.

Adding a new degradation mode

When introducing a new backing service or feature flag :

  1. Document the failure mode in this file (which subsystem, what degrades, what severity).
  2. If the service is critical, add a row to /api/v1/health/deep.
  3. If it has an alert rule, link the runbook in the alert annotation (per config/prometheus/alert_rules.yml convention).
  4. Decide whether the failure should fail-loud (return 5xx) or fail-soft (degrade gracefully). Document the choice in code with a // FAIL-SOFT: … or // FAIL-LOUD: … comment so the next maintainer doesn't second-guess.