2026-03-02 18:09:46 +00:00
|
|
|
# Runbook — Graceful Degradation
|
|
|
|
|
|
docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs
Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.
INCIDENT_RESPONSE.md (15 → 208 lines)
* "First 5 minutes" triage : ack → annotation → 3 dashboards →
failure-class matrix → declare-if-stuck. Aligns with what an
on-call actually does when paged.
* Severity ladder (SEV-1/2/3) with response-time and
communication norms — replaces the implicit "everything is
SEV-1" the bullet points suggested.
* "Capture evidence before mitigating" block with the four exact
commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
queues) the postmortem will want.
* Mitigation patterns per failure class (API down, DB down,
storage failure, webhook failure, DDoS, performance), each
pointing at the deep-dive runbook for the specific recipe.
* "After mitigation" : status page, comm pattern, postmortem
schedule by severity, runbook update policy.
* Tools section with the bookmark-able URLs (Grafana, Tempo,
Sentry, status page, HAProxy stats, pg_auto_failover monitor,
RabbitMQ console, MinIO console).
GRACEFUL_DEGRADATION.md (25 → 261 lines)
* Quick-lookup matrix of every backing service × user-visible
impact × severity × deep-dive runbook. Lets the on-call read
one row instead of paging through six docs.
* Per-service section detailing what still works and what fails :
Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
Elasticsearch (called out as the v1.0 orphan it is).
* `/api/v1/health/deep` documented as the canary surface, with a
sample response shape so operators know what `degraded` looks
like before they see it.
* "Adding a new degradation mode" section with the 4-step recipe
(this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
code comment) so future maintainers keep the docs in sync as
the surface evolves.
These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:13:55 +00:00
|
|
|
> **Owner** : platform engineering.
|
|
|
|
|
> **Purpose** : describe what happens when each backing service is
|
|
|
|
|
> down, so an operator can set expectations during an outage and a
|
|
|
|
|
> developer knows where the safety nets are.
|
2026-03-02 18:09:46 +00:00
|
|
|
|
docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs
Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.
INCIDENT_RESPONSE.md (15 → 208 lines)
* "First 5 minutes" triage : ack → annotation → 3 dashboards →
failure-class matrix → declare-if-stuck. Aligns with what an
on-call actually does when paged.
* Severity ladder (SEV-1/2/3) with response-time and
communication norms — replaces the implicit "everything is
SEV-1" the bullet points suggested.
* "Capture evidence before mitigating" block with the four exact
commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
queues) the postmortem will want.
* Mitigation patterns per failure class (API down, DB down,
storage failure, webhook failure, DDoS, performance), each
pointing at the deep-dive runbook for the specific recipe.
* "After mitigation" : status page, comm pattern, postmortem
schedule by severity, runbook update policy.
* Tools section with the bookmark-able URLs (Grafana, Tempo,
Sentry, status page, HAProxy stats, pg_auto_failover monitor,
RabbitMQ console, MinIO console).
GRACEFUL_DEGRADATION.md (25 → 261 lines)
* Quick-lookup matrix of every backing service × user-visible
impact × severity × deep-dive runbook. Lets the on-call read
one row instead of paging through six docs.
* Per-service section detailing what still works and what fails :
Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
Elasticsearch (called out as the v1.0 orphan it is).
* `/api/v1/health/deep` documented as the canary surface, with a
sample response shape so operators know what `degraded` looks
like before they see it.
* "Adding a new degradation mode" section with the 4-step recipe
(this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
code comment) so future maintainers keep the docs in sync as
the surface evolves.
These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:13:55 +00:00
|
|
|
The principle : **the user-facing request path should keep responding
|
|
|
|
|
even when secondary services degrade.** Hard failures (login, write
|
|
|
|
|
operations) trade for partial functionality (read-only, cached
|
|
|
|
|
responses, queued mutations) wherever the trade is reversible.
|
2026-03-02 18:09:46 +00:00
|
|
|
|
docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs
Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.
INCIDENT_RESPONSE.md (15 → 208 lines)
* "First 5 minutes" triage : ack → annotation → 3 dashboards →
failure-class matrix → declare-if-stuck. Aligns with what an
on-call actually does when paged.
* Severity ladder (SEV-1/2/3) with response-time and
communication norms — replaces the implicit "everything is
SEV-1" the bullet points suggested.
* "Capture evidence before mitigating" block with the four exact
commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
queues) the postmortem will want.
* Mitigation patterns per failure class (API down, DB down,
storage failure, webhook failure, DDoS, performance), each
pointing at the deep-dive runbook for the specific recipe.
* "After mitigation" : status page, comm pattern, postmortem
schedule by severity, runbook update policy.
* Tools section with the bookmark-able URLs (Grafana, Tempo,
Sentry, status page, HAProxy stats, pg_auto_failover monitor,
RabbitMQ console, MinIO console).
GRACEFUL_DEGRADATION.md (25 → 261 lines)
* Quick-lookup matrix of every backing service × user-visible
impact × severity × deep-dive runbook. Lets the on-call read
one row instead of paging through six docs.
* Per-service section detailing what still works and what fails :
Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
Elasticsearch (called out as the v1.0 orphan it is).
* `/api/v1/health/deep` documented as the canary surface, with a
sample response shape so operators know what `degraded` looks
like before they see it.
* "Adding a new degradation mode" section with the 4-step recipe
(this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
code comment) so future maintainers keep the docs in sync as
the surface evolves.
These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:13:55 +00:00
|
|
|
## Quick lookup — what breaks if X is down
|
2026-03-02 18:09:46 +00:00
|
|
|
|
docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs
Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.
INCIDENT_RESPONSE.md (15 → 208 lines)
* "First 5 minutes" triage : ack → annotation → 3 dashboards →
failure-class matrix → declare-if-stuck. Aligns with what an
on-call actually does when paged.
* Severity ladder (SEV-1/2/3) with response-time and
communication norms — replaces the implicit "everything is
SEV-1" the bullet points suggested.
* "Capture evidence before mitigating" block with the four exact
commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
queues) the postmortem will want.
* Mitigation patterns per failure class (API down, DB down,
storage failure, webhook failure, DDoS, performance), each
pointing at the deep-dive runbook for the specific recipe.
* "After mitigation" : status page, comm pattern, postmortem
schedule by severity, runbook update policy.
* Tools section with the bookmark-able URLs (Grafana, Tempo,
Sentry, status page, HAProxy stats, pg_auto_failover monitor,
RabbitMQ console, MinIO console).
GRACEFUL_DEGRADATION.md (25 → 261 lines)
* Quick-lookup matrix of every backing service × user-visible
impact × severity × deep-dive runbook. Lets the on-call read
one row instead of paging through six docs.
* Per-service section detailing what still works and what fails :
Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
Elasticsearch (called out as the v1.0 orphan it is).
* `/api/v1/health/deep` documented as the canary surface, with a
sample response shape so operators know what `degraded` looks
like before they see it.
* "Adding a new degradation mode" section with the 4-step recipe
(this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
code comment) so future maintainers keep the docs in sync as
the surface evolves.
These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:13:55 +00:00
|
|
|
| Backing service | User-visible impact | Severity | Sub-runbook |
|
|
|
|
|
| --------------- | -------------------------------------------- | ---------- | ------------------- |
|
|
|
|
|
| Postgres (primary) | All write operations + most reads fail (5xx) | **SEV-1** | `db-failover.md` |
|
|
|
|
|
| Postgres (replica) | Read-only routes slower (fall back to primary) | SEV-3 | `db-failover.md` |
|
|
|
|
|
| Redis (master) | Sessions invalidated, rate-limit goes in-memory | **SEV-1** | `redis-down.md` |
|
|
|
|
|
| Redis Sentinel | Failover detection broken, but Redis serves | SEV-2 | `redis-down.md` |
|
|
|
|
|
| RabbitMQ | Async jobs queue (transcode, distribution, digest) | SEV-2 | `rabbitmq-down.md` |
|
|
|
|
|
| MinIO / S3 | Track upload + signed-URL playback fail | **SEV-1** | (TODO v1.1) |
|
|
|
|
|
| Hyperswitch | Checkout fails, refund webhook stalls | SEV-2 | `payment-success-slo-burn.md` |
|
|
|
|
|
| Stream server | HLS transcode requests pile up, segment 404s | SEV-2 | `rabbitmq-down.md` |
|
|
|
|
|
| ClamAV | Track upload returns 503 (CLAMAV_REQUIRED=true) | SEV-2 | (no dedicated) |
|
|
|
|
|
| Coturn | WebRTC 1:1 calls fail behind symmetric NAT | SEV-3 | (no dedicated) |
|
|
|
|
|
| Elasticsearch | (orphan in v1.0 — search uses Postgres FTS) | SEV-3 | n/a |
|
|
|
|
|
| OpenSMTPD | Email digest + verification mails queue | SEV-3 | (no dedicated) |
|
2026-03-02 18:09:46 +00:00
|
|
|
|
docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs
Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.
INCIDENT_RESPONSE.md (15 → 208 lines)
* "First 5 minutes" triage : ack → annotation → 3 dashboards →
failure-class matrix → declare-if-stuck. Aligns with what an
on-call actually does when paged.
* Severity ladder (SEV-1/2/3) with response-time and
communication norms — replaces the implicit "everything is
SEV-1" the bullet points suggested.
* "Capture evidence before mitigating" block with the four exact
commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
queues) the postmortem will want.
* Mitigation patterns per failure class (API down, DB down,
storage failure, webhook failure, DDoS, performance), each
pointing at the deep-dive runbook for the specific recipe.
* "After mitigation" : status page, comm pattern, postmortem
schedule by severity, runbook update policy.
* Tools section with the bookmark-able URLs (Grafana, Tempo,
Sentry, status page, HAProxy stats, pg_auto_failover monitor,
RabbitMQ console, MinIO console).
GRACEFUL_DEGRADATION.md (25 → 261 lines)
* Quick-lookup matrix of every backing service × user-visible
impact × severity × deep-dive runbook. Lets the on-call read
one row instead of paging through six docs.
* Per-service section detailing what still works and what fails :
Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
Elasticsearch (called out as the v1.0 orphan it is).
* `/api/v1/health/deep` documented as the canary surface, with a
sample response shape so operators know what `degraded` looks
like before they see it.
* "Adding a new degradation mode" section with the 4-step recipe
(this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
code comment) so future maintainers keep the docs in sync as
the surface evolves.
These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:13:55 +00:00
|
|
|
## Postgres degradations
|
2026-03-02 18:09:46 +00:00
|
|
|
|
docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs
Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.
INCIDENT_RESPONSE.md (15 → 208 lines)
* "First 5 minutes" triage : ack → annotation → 3 dashboards →
failure-class matrix → declare-if-stuck. Aligns with what an
on-call actually does when paged.
* Severity ladder (SEV-1/2/3) with response-time and
communication norms — replaces the implicit "everything is
SEV-1" the bullet points suggested.
* "Capture evidence before mitigating" block with the four exact
commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
queues) the postmortem will want.
* Mitigation patterns per failure class (API down, DB down,
storage failure, webhook failure, DDoS, performance), each
pointing at the deep-dive runbook for the specific recipe.
* "After mitigation" : status page, comm pattern, postmortem
schedule by severity, runbook update policy.
* Tools section with the bookmark-able URLs (Grafana, Tempo,
Sentry, status page, HAProxy stats, pg_auto_failover monitor,
RabbitMQ console, MinIO console).
GRACEFUL_DEGRADATION.md (25 → 261 lines)
* Quick-lookup matrix of every backing service × user-visible
impact × severity × deep-dive runbook. Lets the on-call read
one row instead of paging through six docs.
* Per-service section detailing what still works and what fails :
Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
Elasticsearch (called out as the v1.0 orphan it is).
* `/api/v1/health/deep` documented as the canary surface, with a
sample response shape so operators know what `degraded` looks
like before they see it.
* "Adding a new degradation mode" section with the 4-step recipe
(this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
code comment) so future maintainers keep the docs in sync as
the surface evolves.
These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:13:55 +00:00
|
|
|
### Primary down (sync + async writes)
|
2026-03-02 18:09:46 +00:00
|
|
|
|
docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs
Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.
INCIDENT_RESPONSE.md (15 → 208 lines)
* "First 5 minutes" triage : ack → annotation → 3 dashboards →
failure-class matrix → declare-if-stuck. Aligns with what an
on-call actually does when paged.
* Severity ladder (SEV-1/2/3) with response-time and
communication norms — replaces the implicit "everything is
SEV-1" the bullet points suggested.
* "Capture evidence before mitigating" block with the four exact
commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
queues) the postmortem will want.
* Mitigation patterns per failure class (API down, DB down,
storage failure, webhook failure, DDoS, performance), each
pointing at the deep-dive runbook for the specific recipe.
* "After mitigation" : status page, comm pattern, postmortem
schedule by severity, runbook update policy.
* Tools section with the bookmark-able URLs (Grafana, Tempo,
Sentry, status page, HAProxy stats, pg_auto_failover monitor,
RabbitMQ console, MinIO console).
GRACEFUL_DEGRADATION.md (25 → 261 lines)
* Quick-lookup matrix of every backing service × user-visible
impact × severity × deep-dive runbook. Lets the on-call read
one row instead of paging through six docs.
* Per-service section detailing what still works and what fails :
Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
Elasticsearch (called out as the v1.0 orphan it is).
* `/api/v1/health/deep` documented as the canary surface, with a
sample response shape so operators know what `degraded` looks
like before they see it.
* "Adding a new degradation mode" section with the 4-step recipe
(this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
code comment) so future maintainers keep the docs in sync as
the surface evolves.
These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:13:55 +00:00
|
|
|
The API depends on Postgres for every persistent operation. The
|
|
|
|
|
backend handler middleware ([`internal/middleware/db_health.go`])
|
|
|
|
|
short-circuits incoming requests with 503 when the connection pool
|
|
|
|
|
can't acquire a connection within 1 s. This protects from the
|
|
|
|
|
"requests pile up while connecting" failure mode that bleeds memory.
|
2026-03-02 18:09:46 +00:00
|
|
|
|
docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs
Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.
INCIDENT_RESPONSE.md (15 → 208 lines)
* "First 5 minutes" triage : ack → annotation → 3 dashboards →
failure-class matrix → declare-if-stuck. Aligns with what an
on-call actually does when paged.
* Severity ladder (SEV-1/2/3) with response-time and
communication norms — replaces the implicit "everything is
SEV-1" the bullet points suggested.
* "Capture evidence before mitigating" block with the four exact
commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
queues) the postmortem will want.
* Mitigation patterns per failure class (API down, DB down,
storage failure, webhook failure, DDoS, performance), each
pointing at the deep-dive runbook for the specific recipe.
* "After mitigation" : status page, comm pattern, postmortem
schedule by severity, runbook update policy.
* Tools section with the bookmark-able URLs (Grafana, Tempo,
Sentry, status page, HAProxy stats, pg_auto_failover monitor,
RabbitMQ console, MinIO console).
GRACEFUL_DEGRADATION.md (25 → 261 lines)
* Quick-lookup matrix of every backing service × user-visible
impact × severity × deep-dive runbook. Lets the on-call read
one row instead of paging through six docs.
* Per-service section detailing what still works and what fails :
Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
Elasticsearch (called out as the v1.0 orphan it is).
* `/api/v1/health/deep` documented as the canary surface, with a
sample response shape so operators know what `degraded` looks
like before they see it.
* "Adding a new degradation mode" section with the 4-step recipe
(this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
code comment) so future maintainers keep the docs in sync as
the surface evolves.
These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:13:55 +00:00
|
|
|
What still works while Postgres is down :
|
|
|
|
|
- The `/api/v1/health` endpoint responds 200 (it doesn't touch DB).
|
|
|
|
|
- The `/api/v1/health/deep` endpoint responds 503 with the failed
|
|
|
|
|
component listed (the canary the status page reads).
|
|
|
|
|
- Static assets (frontend SPA) still serve from the HAProxy cache.
|
|
|
|
|
- WebSocket connections that don't read the DB stay open ; in
|
|
|
|
|
practice that's almost none.
|
2026-03-02 18:09:46 +00:00
|
|
|
|
docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs
Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.
INCIDENT_RESPONSE.md (15 → 208 lines)
* "First 5 minutes" triage : ack → annotation → 3 dashboards →
failure-class matrix → declare-if-stuck. Aligns with what an
on-call actually does when paged.
* Severity ladder (SEV-1/2/3) with response-time and
communication norms — replaces the implicit "everything is
SEV-1" the bullet points suggested.
* "Capture evidence before mitigating" block with the four exact
commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
queues) the postmortem will want.
* Mitigation patterns per failure class (API down, DB down,
storage failure, webhook failure, DDoS, performance), each
pointing at the deep-dive runbook for the specific recipe.
* "After mitigation" : status page, comm pattern, postmortem
schedule by severity, runbook update policy.
* Tools section with the bookmark-able URLs (Grafana, Tempo,
Sentry, status page, HAProxy stats, pg_auto_failover monitor,
RabbitMQ console, MinIO console).
GRACEFUL_DEGRADATION.md (25 → 261 lines)
* Quick-lookup matrix of every backing service × user-visible
impact × severity × deep-dive runbook. Lets the on-call read
one row instead of paging through six docs.
* Per-service section detailing what still works and what fails :
Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
Elasticsearch (called out as the v1.0 orphan it is).
* `/api/v1/health/deep` documented as the canary surface, with a
sample response shape so operators know what `degraded` looks
like before they see it.
* "Adding a new degradation mode" section with the 4-step recipe
(this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
code comment) so future maintainers keep the docs in sync as
the surface evolves.
These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:13:55 +00:00
|
|
|
What fails immediately :
|
|
|
|
|
- Login / refresh / register : 503.
|
|
|
|
|
- Any read or write on `/api/v1/*`.
|
|
|
|
|
- Cached reads in Redis stay readable but every cache miss falls
|
|
|
|
|
through to the DB and 5xx's there.
|
|
|
|
|
|
|
|
|
|
Recovery path : pg_auto_failover promotes the standby (RTO < 60 s
|
|
|
|
|
when sync replication holds). Frontend retries on 503 with backoff,
|
|
|
|
|
so users see ~1 min of "service unavailable" then the app comes
|
|
|
|
|
back. See `db-failover.md` for the operational steps.
|
|
|
|
|
|
|
|
|
|
### Replica down (read replica, optional)
|
|
|
|
|
|
|
|
|
|
When a read replica is configured (`READ_DATABASE_URL`), the
|
|
|
|
|
[`internal/database`] package routes read-only queries to it.
|
|
|
|
|
[`TrackService.forRead()`] is the canonical example.
|
|
|
|
|
|
|
|
|
|
If the replica is unreachable, GORM logs a connection error and the
|
|
|
|
|
forRead() helper falls back to the primary. User-visible impact :
|
|
|
|
|
none, beyond the latency uptick from the primary picking up read
|
|
|
|
|
load. Replica downtime is SEV-3 — the cluster keeps serving — but
|
|
|
|
|
should still be investigated within a business day to restore
|
|
|
|
|
read scaling.
|
|
|
|
|
|
|
|
|
|
## Redis degradations
|
|
|
|
|
|
|
|
|
|
Redis is multi-purpose ; impact differs by callsite.
|
|
|
|
|
|
|
|
|
|
### Master down
|
|
|
|
|
|
|
|
|
|
| Subsystem | Effect when Redis is gone | Severity |
|
|
|
|
|
| --------------------------------- | --------------------------------------------------- | -------- |
|
|
|
|
|
| Session storage / refresh tokens | Login / refresh fail — users log out | **HIGH** |
|
|
|
|
|
| Rate limiter (`UserRateLimiter`) | Falls back to in-memory per-pod limits (less coverage but doesn't fail-open in prod) | MEDIUM |
|
|
|
|
|
| JWT revocation | Revoked tokens accepted again until access TTL | **SECURITY** — silent failure |
|
|
|
|
|
| Cache (track lookups, feed pages) | Cache miss every read, falls back to Postgres | LOW |
|
|
|
|
|
| RabbitMQ-fronted queues | Independent — Redis is just metrics for these | NONE |
|
|
|
|
|
|
|
|
|
|
The middleware doesn't 503 the whole API when Redis is down — that
|
|
|
|
|
would be too restrictive given the cache-miss-only impact on most
|
|
|
|
|
routes. Operators should expect a latency uptick (warmer DB) but
|
|
|
|
|
not full unavailability.
|
|
|
|
|
|
|
|
|
|
Recovery is via Redis Sentinel HA (W3 day 11). When Sentinel
|
|
|
|
|
promotes a replica, sessions persist (replication lag < 200 ms in
|
|
|
|
|
practice).
|
|
|
|
|
|
|
|
|
|
### Sentinel quorum lost
|
|
|
|
|
|
|
|
|
|
Sentinel running on 3 nodes, quorum=2. If two Sentinel nodes are
|
|
|
|
|
unreachable, automatic failover stops working but the master keeps
|
|
|
|
|
serving. SEV-2 — the cluster still answers, but a master failure
|
|
|
|
|
during this window is not auto-recoverable.
|
|
|
|
|
|
|
|
|
|
Mitigation : restart the Sentinel nodes one at a time. The master
|
|
|
|
|
keeps replicating to the replica throughout. See `redis-down.md`.
|
|
|
|
|
|
|
|
|
|
## RabbitMQ degradations
|
|
|
|
|
|
|
|
|
|
The detailed runbook lives at `rabbitmq-down.md`. Summary : the
|
|
|
|
|
user-facing request path doesn't block on RabbitMQ. The backend
|
|
|
|
|
publishes a message and returns 202 ; the worker picks it up later.
|
|
|
|
|
|
|
|
|
|
When RabbitMQ is down :
|
|
|
|
|
- Track upload succeeds (S3 write OK), but HLS transcode doesn't
|
|
|
|
|
fire ; track stays in `processing` until RabbitMQ recovers.
|
|
|
|
|
Playback falls back to direct `/stream` (MP3 range requests).
|
|
|
|
|
- Distribution submissions queue silently ; resurface in the
|
|
|
|
|
distribution dashboard as "pending" until drained.
|
|
|
|
|
- Email digests miss a tick or two.
|
|
|
|
|
- DMCA cache invalidation lags ; the synchronous DB UPDATE that
|
|
|
|
|
gates playback is unaffected.
|
|
|
|
|
|
|
|
|
|
The `internal/eventbus/rabbitmq.go` client retries with exponential
|
|
|
|
|
backoff up to 30 s, then falls into "degraded mode" — publish
|
|
|
|
|
returns immediately with a logged warning, the API call succeeds,
|
|
|
|
|
the side-effect is dropped. The dropped events are queryable via
|
|
|
|
|
Sentry filter `tag:eventbus.status=degraded`.
|
|
|
|
|
|
|
|
|
|
## MinIO / S3 degradations
|
|
|
|
|
|
|
|
|
|
When `TRACK_STORAGE_BACKEND=s3` (prod default per the v1.0.10
|
|
|
|
|
compose fix) and MinIO is down :
|
|
|
|
|
|
|
|
|
|
- Track upload returns 5xx (the multipart write fails).
|
|
|
|
|
- Direct `/stream` returns 502 (the API tries to presign a missing
|
|
|
|
|
object).
|
|
|
|
|
- HLS playback : segments already on the CDN edge cache keep
|
|
|
|
|
serving for ~7 days (segments are content-addressed, the
|
|
|
|
|
`Cache-Control: public, max-age=86400, immutable` directive lets
|
|
|
|
|
edges keep them past origin downtime).
|
|
|
|
|
- Playlists, comments, metadata : unaffected (DB only).
|
|
|
|
|
|
|
|
|
|
The MinIO distributed cluster (4 nodes, EC:2) tolerates 2 drives
|
|
|
|
|
offline. The `MinIODriveOffline` alert fires at 1 drive ; the
|
|
|
|
|
`MinIONodesUnreachable` alert pages on call at 2 nodes — that's the
|
|
|
|
|
threshold where the next failure causes data unavailability.
|
|
|
|
|
|
|
|
|
|
Mitigation while down : there's no fallback storage. Communicate the
|
|
|
|
|
outage on the status page, focus on restoring MinIO. Tracks
|
|
|
|
|
uploaded during the outage are not retryable from the client side
|
|
|
|
|
— the upload session is lost.
|
|
|
|
|
|
|
|
|
|
## Hyperswitch degradations
|
|
|
|
|
|
|
|
|
|
When Hyperswitch is unreachable :
|
|
|
|
|
- Checkout : the order is created in `pending_payment` state, but
|
|
|
|
|
the redirect to the Hyperswitch UI fails. User sees "payment
|
|
|
|
|
unavailable" ; their cart is preserved.
|
|
|
|
|
- Refund webhook : pending refunds stay in `pending` state indef
|
|
|
|
|
until Hyperswitch is back. Operators can manually flip refunds
|
|
|
|
|
via admin actions if the outage drags > 24 h.
|
|
|
|
|
- Real-money flows : nothing recoverable client-side. Status page
|
|
|
|
|
must call this out as SEV-2.
|
|
|
|
|
|
|
|
|
|
## Stream server degradations
|
|
|
|
|
|
|
|
|
|
The Rust stream server handles HLS transcoding + segment serving.
|
|
|
|
|
When it's down or saturated :
|
|
|
|
|
- Existing HLS streams keep serving from the CDN edge cache (see
|
|
|
|
|
MinIO §). New streams that need transcoding stall in
|
|
|
|
|
`processing`.
|
|
|
|
|
- Direct `/stream` (MP3 range requests on the API itself, no stream
|
|
|
|
|
server involvement) keeps working — that's the v1.0 fallback
|
|
|
|
|
path for any track HLS hasn't materialised for yet.
|
|
|
|
|
- The user-visible symptom is "this track won't play" on a fresh
|
|
|
|
|
upload. Older tracks that have HLS segments cached at the edge
|
|
|
|
|
unaffected.
|
|
|
|
|
|
|
|
|
|
## ClamAV degradations
|
|
|
|
|
|
|
|
|
|
`ENABLE_CLAMAV=true` + `CLAMAV_REQUIRED=true` (prod default) means
|
|
|
|
|
upload requests block until ClamAV scans the file. If ClamAV is
|
|
|
|
|
unreachable, uploads return 503. SEV-2 — uploads are the highest-
|
|
|
|
|
value user action ; users lose work.
|
|
|
|
|
|
|
|
|
|
Operators can flip `CLAMAV_REQUIRED=false` as an emergency escape
|
|
|
|
|
hatch (uploads then go through unscanned). That's a *security*
|
|
|
|
|
trade — ClamAV was added explicitly to stop infected file
|
|
|
|
|
distribution. Document the timeframe in the incident postmortem
|
|
|
|
|
and flip back as soon as ClamAV is back.
|
|
|
|
|
|
|
|
|
|
## Coturn degradations
|
|
|
|
|
|
|
|
|
|
WebRTC 1:1 calls (per the v1.0.10 compose addition). Without
|
|
|
|
|
coturn :
|
|
|
|
|
- Calls between two peers on the same NAT segment work (peer-to-peer
|
|
|
|
|
hole-punching).
|
|
|
|
|
- Calls between two peers behind symmetric NAT (corporate
|
|
|
|
|
firewalls, mobile CGNAT) fail silently after ~30 s
|
|
|
|
|
`iceConnectionState=failed`.
|
|
|
|
|
- The frontend's `useWebRTC().nat.hasTurn` flag is false ; the
|
|
|
|
|
CallButton tooltip warns the user up-front. They see the
|
|
|
|
|
warning, the call attempt still happens, the failure is
|
|
|
|
|
visible.
|
|
|
|
|
|
|
|
|
|
This is SEV-3 — only some users are impacted, and those users are
|
|
|
|
|
warned before they hit the failure. Restoring coturn fixes new
|
|
|
|
|
calls instantly (the SPA refetches `/api/v1/config/webrtc` per
|
|
|
|
|
session).
|
|
|
|
|
|
|
|
|
|
## Elasticsearch — orphan in v1.0
|
|
|
|
|
|
|
|
|
|
The compose files still declare Elasticsearch but the search code
|
|
|
|
|
path uses Postgres FTS. ES being down has zero user impact in v1.0.
|
|
|
|
|
Leaving ES in compose is intentional — v1.1 will switch search back
|
|
|
|
|
to ES once the index is large enough to make Postgres FTS slow.
|
|
|
|
|
|
|
|
|
|
If ES is consuming resources during an outage of something else,
|
|
|
|
|
stop the container — it won't break anything.
|
|
|
|
|
|
|
|
|
|
## Health and observability surfaces
|
|
|
|
|
|
|
|
|
|
The `/api/v1/health/deep` endpoint reports the up/down state of each
|
|
|
|
|
dependency. Use it as the canary for any incident triage :
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
curl -s https://api.veza.fr/api/v1/health/deep | jq .
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Sample response shape :
|
|
|
|
|
|
|
|
|
|
```json
|
|
|
|
|
{
|
|
|
|
|
"status": "ok",
|
|
|
|
|
"checks": {
|
|
|
|
|
"db": "ok",
|
|
|
|
|
"redis": "ok",
|
|
|
|
|
"rabbitmq": "ok",
|
|
|
|
|
"s3": "ok",
|
|
|
|
|
"disk": "ok",
|
|
|
|
|
"stream_server": "ok"
|
|
|
|
|
},
|
|
|
|
|
"version": "v1.0.10",
|
|
|
|
|
"uptime_seconds": 12345
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
A `degraded` status with the list of failing components keeps the
|
|
|
|
|
status page accurate without operators having to ssh into anything.
|
|
|
|
|
|
|
|
|
|
## Adding a new degradation mode
|
|
|
|
|
|
|
|
|
|
When introducing a new backing service or feature flag :
|
|
|
|
|
|
|
|
|
|
1. Document the failure mode in this file (which subsystem, what
|
|
|
|
|
degrades, what severity).
|
|
|
|
|
2. If the service is critical, add a row to `/api/v1/health/deep`.
|
|
|
|
|
3. If it has an alert rule, link the runbook in the alert
|
|
|
|
|
annotation (per `config/prometheus/alert_rules.yml` convention).
|
|
|
|
|
4. Decide whether the failure should fail-loud (return 5xx) or
|
|
|
|
|
fail-soft (degrade gracefully). Document the choice in code with
|
|
|
|
|
a `// FAIL-SOFT: …` or `// FAIL-LOUD: …` comment so the next
|
|
|
|
|
maintainer doesn't second-guess.
|