Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.
INCIDENT_RESPONSE.md (15 → 208 lines)
* "First 5 minutes" triage : ack → annotation → 3 dashboards →
failure-class matrix → declare-if-stuck. Aligns with what an
on-call actually does when paged.
* Severity ladder (SEV-1/2/3) with response-time and
communication norms — replaces the implicit "everything is
SEV-1" the bullet points suggested.
* "Capture evidence before mitigating" block with the four exact
commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
queues) the postmortem will want.
* Mitigation patterns per failure class (API down, DB down,
storage failure, webhook failure, DDoS, performance), each
pointing at the deep-dive runbook for the specific recipe.
* "After mitigation" : status page, comm pattern, postmortem
schedule by severity, runbook update policy.
* Tools section with the bookmark-able URLs (Grafana, Tempo,
Sentry, status page, HAProxy stats, pg_auto_failover monitor,
RabbitMQ console, MinIO console).
GRACEFUL_DEGRADATION.md (25 → 261 lines)
* Quick-lookup matrix of every backing service × user-visible
impact × severity × deep-dive runbook. Lets the on-call read
one row instead of paging through six docs.
* Per-service section detailing what still works and what fails :
Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
Elasticsearch (called out as the v1.0 orphan it is).
* `/api/v1/health/deep` documented as the canary surface, with a
sample response shape so operators know what `degraded` looks
like before they see it.
* "Adding a new degradation mode" section with the 4-step recipe
(this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
code comment) so future maintainers keep the docs in sync as
the surface evolves.
These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 KiB
Runbook — Graceful Degradation
Owner : platform engineering. Purpose : describe what happens when each backing service is down, so an operator can set expectations during an outage and a developer knows where the safety nets are.
The principle : the user-facing request path should keep responding even when secondary services degrade. Hard failures (login, write operations) trade for partial functionality (read-only, cached responses, queued mutations) wherever the trade is reversible.
Quick lookup — what breaks if X is down
| Backing service | User-visible impact | Severity | Sub-runbook |
|---|---|---|---|
| Postgres (primary) | All write operations + most reads fail (5xx) | SEV-1 | db-failover.md |
| Postgres (replica) | Read-only routes slower (fall back to primary) | SEV-3 | db-failover.md |
| Redis (master) | Sessions invalidated, rate-limit goes in-memory | SEV-1 | redis-down.md |
| Redis Sentinel | Failover detection broken, but Redis serves | SEV-2 | redis-down.md |
| RabbitMQ | Async jobs queue (transcode, distribution, digest) | SEV-2 | rabbitmq-down.md |
| MinIO / S3 | Track upload + signed-URL playback fail | SEV-1 | (TODO v1.1) |
| Hyperswitch | Checkout fails, refund webhook stalls | SEV-2 | payment-success-slo-burn.md |
| Stream server | HLS transcode requests pile up, segment 404s | SEV-2 | rabbitmq-down.md |
| ClamAV | Track upload returns 503 (CLAMAV_REQUIRED=true) | SEV-2 | (no dedicated) |
| Coturn | WebRTC 1:1 calls fail behind symmetric NAT | SEV-3 | (no dedicated) |
| Elasticsearch | (orphan in v1.0 — search uses Postgres FTS) | SEV-3 | n/a |
| OpenSMTPD | Email digest + verification mails queue | SEV-3 | (no dedicated) |
Postgres degradations
Primary down (sync + async writes)
The API depends on Postgres for every persistent operation. The
backend handler middleware ([internal/middleware/db_health.go])
short-circuits incoming requests with 503 when the connection pool
can't acquire a connection within 1 s. This protects from the
"requests pile up while connecting" failure mode that bleeds memory.
What still works while Postgres is down :
- The
/api/v1/healthendpoint responds 200 (it doesn't touch DB). - The
/api/v1/health/deependpoint responds 503 with the failed component listed (the canary the status page reads). - Static assets (frontend SPA) still serve from the HAProxy cache.
- WebSocket connections that don't read the DB stay open ; in practice that's almost none.
What fails immediately :
- Login / refresh / register : 503.
- Any read or write on
/api/v1/*. - Cached reads in Redis stay readable but every cache miss falls through to the DB and 5xx's there.
Recovery path : pg_auto_failover promotes the standby (RTO < 60 s
when sync replication holds). Frontend retries on 503 with backoff,
so users see ~1 min of "service unavailable" then the app comes
back. See db-failover.md for the operational steps.
Replica down (read replica, optional)
When a read replica is configured (READ_DATABASE_URL), the
[internal/database] package routes read-only queries to it.
[TrackService.forRead()] is the canonical example.
If the replica is unreachable, GORM logs a connection error and the forRead() helper falls back to the primary. User-visible impact : none, beyond the latency uptick from the primary picking up read load. Replica downtime is SEV-3 — the cluster keeps serving — but should still be investigated within a business day to restore read scaling.
Redis degradations
Redis is multi-purpose ; impact differs by callsite.
Master down
| Subsystem | Effect when Redis is gone | Severity |
|---|---|---|
| Session storage / refresh tokens | Login / refresh fail — users log out | HIGH |
Rate limiter (UserRateLimiter) |
Falls back to in-memory per-pod limits (less coverage but doesn't fail-open in prod) | MEDIUM |
| JWT revocation | Revoked tokens accepted again until access TTL | SECURITY — silent failure |
| Cache (track lookups, feed pages) | Cache miss every read, falls back to Postgres | LOW |
| RabbitMQ-fronted queues | Independent — Redis is just metrics for these | NONE |
The middleware doesn't 503 the whole API when Redis is down — that would be too restrictive given the cache-miss-only impact on most routes. Operators should expect a latency uptick (warmer DB) but not full unavailability.
Recovery is via Redis Sentinel HA (W3 day 11). When Sentinel promotes a replica, sessions persist (replication lag < 200 ms in practice).
Sentinel quorum lost
Sentinel running on 3 nodes, quorum=2. If two Sentinel nodes are unreachable, automatic failover stops working but the master keeps serving. SEV-2 — the cluster still answers, but a master failure during this window is not auto-recoverable.
Mitigation : restart the Sentinel nodes one at a time. The master
keeps replicating to the replica throughout. See redis-down.md.
RabbitMQ degradations
The detailed runbook lives at rabbitmq-down.md. Summary : the
user-facing request path doesn't block on RabbitMQ. The backend
publishes a message and returns 202 ; the worker picks it up later.
When RabbitMQ is down :
- Track upload succeeds (S3 write OK), but HLS transcode doesn't
fire ; track stays in
processinguntil RabbitMQ recovers. Playback falls back to direct/stream(MP3 range requests). - Distribution submissions queue silently ; resurface in the distribution dashboard as "pending" until drained.
- Email digests miss a tick or two.
- DMCA cache invalidation lags ; the synchronous DB UPDATE that gates playback is unaffected.
The internal/eventbus/rabbitmq.go client retries with exponential
backoff up to 30 s, then falls into "degraded mode" — publish
returns immediately with a logged warning, the API call succeeds,
the side-effect is dropped. The dropped events are queryable via
Sentry filter tag:eventbus.status=degraded.
MinIO / S3 degradations
When TRACK_STORAGE_BACKEND=s3 (prod default per the v1.0.10
compose fix) and MinIO is down :
- Track upload returns 5xx (the multipart write fails).
- Direct
/streamreturns 502 (the API tries to presign a missing object). - HLS playback : segments already on the CDN edge cache keep
serving for ~7 days (segments are content-addressed, the
Cache-Control: public, max-age=86400, immutabledirective lets edges keep them past origin downtime). - Playlists, comments, metadata : unaffected (DB only).
The MinIO distributed cluster (4 nodes, EC:2) tolerates 2 drives
offline. The MinIODriveOffline alert fires at 1 drive ; the
MinIONodesUnreachable alert pages on call at 2 nodes — that's the
threshold where the next failure causes data unavailability.
Mitigation while down : there's no fallback storage. Communicate the outage on the status page, focus on restoring MinIO. Tracks uploaded during the outage are not retryable from the client side — the upload session is lost.
Hyperswitch degradations
When Hyperswitch is unreachable :
- Checkout : the order is created in
pending_paymentstate, but the redirect to the Hyperswitch UI fails. User sees "payment unavailable" ; their cart is preserved. - Refund webhook : pending refunds stay in
pendingstate indef until Hyperswitch is back. Operators can manually flip refunds via admin actions if the outage drags > 24 h. - Real-money flows : nothing recoverable client-side. Status page must call this out as SEV-2.
Stream server degradations
The Rust stream server handles HLS transcoding + segment serving. When it's down or saturated :
- Existing HLS streams keep serving from the CDN edge cache (see
MinIO §). New streams that need transcoding stall in
processing. - Direct
/stream(MP3 range requests on the API itself, no stream server involvement) keeps working — that's the v1.0 fallback path for any track HLS hasn't materialised for yet. - The user-visible symptom is "this track won't play" on a fresh upload. Older tracks that have HLS segments cached at the edge unaffected.
ClamAV degradations
ENABLE_CLAMAV=true + CLAMAV_REQUIRED=true (prod default) means
upload requests block until ClamAV scans the file. If ClamAV is
unreachable, uploads return 503. SEV-2 — uploads are the highest-
value user action ; users lose work.
Operators can flip CLAMAV_REQUIRED=false as an emergency escape
hatch (uploads then go through unscanned). That's a security
trade — ClamAV was added explicitly to stop infected file
distribution. Document the timeframe in the incident postmortem
and flip back as soon as ClamAV is back.
Coturn degradations
WebRTC 1:1 calls (per the v1.0.10 compose addition). Without coturn :
- Calls between two peers on the same NAT segment work (peer-to-peer hole-punching).
- Calls between two peers behind symmetric NAT (corporate
firewalls, mobile CGNAT) fail silently after ~30 s
iceConnectionState=failed. - The frontend's
useWebRTC().nat.hasTurnflag is false ; the CallButton tooltip warns the user up-front. They see the warning, the call attempt still happens, the failure is visible.
This is SEV-3 — only some users are impacted, and those users are
warned before they hit the failure. Restoring coturn fixes new
calls instantly (the SPA refetches /api/v1/config/webrtc per
session).
Elasticsearch — orphan in v1.0
The compose files still declare Elasticsearch but the search code path uses Postgres FTS. ES being down has zero user impact in v1.0. Leaving ES in compose is intentional — v1.1 will switch search back to ES once the index is large enough to make Postgres FTS slow.
If ES is consuming resources during an outage of something else, stop the container — it won't break anything.
Health and observability surfaces
The /api/v1/health/deep endpoint reports the up/down state of each
dependency. Use it as the canary for any incident triage :
curl -s https://api.veza.fr/api/v1/health/deep | jq .
Sample response shape :
{
"status": "ok",
"checks": {
"db": "ok",
"redis": "ok",
"rabbitmq": "ok",
"s3": "ok",
"disk": "ok",
"stream_server": "ok"
},
"version": "v1.0.10",
"uptime_seconds": 12345
}
A degraded status with the list of failing components keeps the
status page accurate without operators having to ssh into anything.
Adding a new degradation mode
When introducing a new backing service or feature flag :
- Document the failure mode in this file (which subsystem, what degrades, what severity).
- If the service is critical, add a row to
/api/v1/health/deep. - If it has an alert rule, link the runbook in the alert
annotation (per
config/prometheus/alert_rules.ymlconvention). - Decide whether the failure should fail-loud (return 5xx) or
fail-soft (degrade gracefully). Document the choice in code with
a
// FAIL-SOFT: …or// FAIL-LOUD: …comment so the next maintainer doesn't second-guess.