veza/docs/runbooks/INCIDENT_RESPONSE.md
senke 7d92820a9c docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs
Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.

INCIDENT_RESPONSE.md (15 → 208 lines)
  * "First 5 minutes" triage : ack → annotation → 3 dashboards →
    failure-class matrix → declare-if-stuck. Aligns with what an
    on-call actually does when paged.
  * Severity ladder (SEV-1/2/3) with response-time and
    communication norms — replaces the implicit "everything is
    SEV-1" the bullet points suggested.
  * "Capture evidence before mitigating" block with the four exact
    commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
    queues) the postmortem will want.
  * Mitigation patterns per failure class (API down, DB down,
    storage failure, webhook failure, DDoS, performance), each
    pointing at the deep-dive runbook for the specific recipe.
  * "After mitigation" : status page, comm pattern, postmortem
    schedule by severity, runbook update policy.
  * Tools section with the bookmark-able URLs (Grafana, Tempo,
    Sentry, status page, HAProxy stats, pg_auto_failover monitor,
    RabbitMQ console, MinIO console).

GRACEFUL_DEGRADATION.md (25 → 261 lines)
  * Quick-lookup matrix of every backing service × user-visible
    impact × severity × deep-dive runbook. Lets the on-call read
    one row instead of paging through six docs.
  * Per-service section detailing what still works and what fails :
    Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
    MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
    Elasticsearch (called out as the v1.0 orphan it is).
  * `/api/v1/health/deep` documented as the canary surface, with a
    sample response shape so operators know what `degraded` looks
    like before they see it.
  * "Adding a new degradation mode" section with the 4-step recipe
    (this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
    code comment) so future maintainers keep the docs in sync as
    the surface evolves.

These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 04:13:55 +02:00

208 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runbook — Incident Response
> **Owner** : whoever holds the on-call pager.
> **Companion docs** : [GRACEFUL_DEGRADATION.md](GRACEFUL_DEGRADATION.md),
> [ROLLBACK.md](ROLLBACK.md), [DEPLOYMENT.md](DEPLOYMENT.md).
> **When in doubt** : preserve evidence first, mitigate second, fix
> third. Do NOT restart a service before capturing its logs.
The runbooks under `docs/runbooks/` are organised by failure mode
(`db-failover.md`, `redis-down.md`, `rabbitmq-down.md`, `disk-full.md`,
`api-availability-slo-burn.md`, `api-latency-slo-burn.md`,
`payment-success-slo-burn.md`, `cert-expiring-soon.md`,
`SECRET_ROTATION.md`). This file is the **entry point** : how to
triage an unknown alert, who to wake up, and which sub-runbook to
open next.
## The first 5 minutes
When you're paged and don't know what's happening yet :
1. **Acknowledge the alert** so a second pager doesn't fire on the
same incident. PagerDuty / Alertmanager → ack.
2. **Read the alert annotation.** Most alert rules carry
`runbook_url` pointing at the right sub-runbook (verify via
`config/prometheus/alert_rules.yml`). If the alert has a runbook,
open it now and follow that doc instead of this one.
3. **Glance at three dashboards in this exact order :**
- **Veza API Overview** (Grafana) → 5xx rate, p95 latency, RPS
- **Status page** (https://status.veza.fr) → which components are
red according to synthetic + SLO burn-rate alerts
- **Sentry → veza-backend project** → P1/P2 issues in the last
30 min, sorted by frequency
4. **Identify the failure class** from this matrix :
| Symptom | Likely cause | Open runbook |
| ---------------------------------------- | ---------------------- | ------------------------------------- |
| API 5xx > 5% for >5 min | DB or Redis down | `db-failover.md` or `redis-down.md` |
| API p95 > 2s, low CPU | DB slow query / Redis cache miss | `redis-down.md` (cache cold) or DB pgBadger |
| Login fails, sessions invalid | Redis down | `redis-down.md` |
| Tracks won't upload but API responds | MinIO / S3 down | (no dedicated runbook ; see "Storage failure" below) |
| HLS playback stalls, browser shows 404 | Stream server / transcode worker | `rabbitmq-down.md` if jobs piling |
| Payment webhook backlog growing | Hyperswitch issue | `payment-success-slo-burn.md` |
| "Nothing's broken but everyone's mad" | Graceful-degraded mode | `GRACEFUL_DEGRADATION.md` |
| TLS cert expired | LE renewal failed | `cert-expiring-soon.md` |
| Disk full alert | Logs / pgBackRest | `disk-full.md` |
5. **If you can't classify in 5 min,** declare an incident in
`#incidents` Slack with severity SEV-2 (assume worse-than-average
until proven otherwise), invite the on-call lead, and continue
investigating with backup.
## Severity ladder
| Level | Definition | Response time | Communication |
| ----- | --------------------------------------------------------------- | ------------- | ---------------------------------------------- |
| SEV-1 | User-facing data loss OR > 50% of users impacted > 10 min | 15 min | Status page banner + #incidents + email blast |
| SEV-2 | User-facing degradation, single-component outage, recoverable | 30 min | Status page component yellow + #incidents |
| SEV-3 | Internal / observability degradation, no user impact | next business day | #engineering ticket |
Default to SEV-2 on first page. Promote to SEV-1 if you confirm data
loss or wide impact ; demote to SEV-3 if you confirm no user impact.
## Capture evidence before mitigating
Restarting a process loses the in-memory state that diagnoses the
incident. Before the first mitigation action :
```bash
# Backend API logs (last 500 lines)
docker logs --tail 500 veza_backend_api_blue > /tmp/incident-$(date +%Y%m%d-%H%M%S)-api.log
# Stream server logs
docker logs --tail 500 veza_stream_server_blue > /tmp/incident-$(date +%Y%m%d-%H%M%S)-stream.log
# DB stats snapshot
psql "$DATABASE_URL" -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';" \
> /tmp/incident-$(date +%Y%m%d-%H%M%S)-pg-activity.txt
# Redis ops snapshot
redis-cli -u "$REDIS_URL" --bigkeys > /tmp/incident-$(date +%Y%m%d-%H%M%S)-redis-keys.txt
# RabbitMQ queue depths
curl -s -u "$RMQ_USER:$RMQ_PASS" http://rabbitmq.lxd:15672/api/queues \
| jq '.[] | {name, messages, consumers}' > /tmp/incident-$(date +%Y%m%d-%H%M%S)-rmq-queues.json
```
Drop those files into `s3://veza-incidents/<date>/` after the
incident closes — postmortem will reference them.
## Mitigation patterns
### API down (`/health` 5xx or unreachable)
1. **Confirm scope** : single instance or both (blue + green) ?
- One down : HAProxy auto-routes to the healthy one ; symptom
should be transient as the failed instance reboots.
- Both down : full outage. Status page → SEV-1.
2. **Logs first.** Check `docker logs veza_backend_api_blue` and
green for panic / fatal traces.
3. **DB connection pool exhausted ?** Look for "too many connections"
or "connection refused : 5432" in the logs. → `db-failover.md`.
4. **Restart only after capturing logs.**
```bash
docker compose -f docker-compose.prod.yml restart backend-api-blue
```
Wait 30s for healthcheck before rotating green.
5. **Rollback image if regression suspected** (deploy in last 1 h
correlates with incident start). → [ROLLBACK.md](ROLLBACK.md).
### DB down (`/health/deep` shows `db: false`)
This is the highest-severity pattern — most user flows fail.
1.`db-failover.md`. That runbook has the full pg_auto_failover
recipe ; this entry-point doc shouldn't duplicate it.
2. While failover runs, expect 503 on every API call. The frontend
has a "service unavailable" splash and will retry on its own.
3. Watch `pg_auto_failover` cluster state via the monitor URL listed
in the deep-dive runbook. Promotion takes < 60s if the standby is
healthy.
### Storage failure (MinIO / S3)
No dedicated runbook yet (TODO v1.1). Triage :
1. **MinIO drives offline ?** Alert `MinIODriveOffline` fires per
`config/prometheus/alert_rules.yml:90`. EC:2 tolerates 2 drive
losses ; investigate within the hour.
2. **MinIO node unreachable ?** Alert `MinIONodesUnreachable` fires
when 2 nodes are gone that's a SEV-1, redundancy exhausted.
3. **CDN dropping requests ?** Phase-1 cache is local Nginx. SSH the
cache container, `curl -I` the origin to verify. Bypass the CDN
by swapping the API's `CDN_BASE_URL` env to empty if needed.
### Webhook failures (Hyperswitch / Stream server)
1. `payment-success-slo-burn.md` for payment webhooks specifically.
2. For stream server transcode webhooks (`/internal/jobs/transcode`),
check `STREAM_SERVER_INTERNAL_API_KEY` matches between the API
and stream server containers. Mismatch surfaces as 401 in the
API logs.
3. Don't resend webhooks blindly most are idempotent but some
(DMCA propagation, distribution outbox) aren't fully so. Verify
the worker's idempotency-key handling before bulk-resending.
### DDoS / abusive traffic
Default rate limiter caps hit before degradation, but a determined
attacker can still saturate connections.
1. **Identify the abusive pattern.** Grafana Veza API Overview
"Top IP by request count last 5 min" panel. If single IP 10×
the median, it's likely a probe.
2. **Block at the edge** if you have one (Cloudflare, fail2ban on
the host) the rate limiter middleware blocks per request, not
per connection, so a flood of legitimate-looking requests can
still saturate goroutines.
3. **Tighten the rate limiter** by env override if needed :
```bash
docker compose -f docker-compose.prod.yml exec backend-api-blue \
env RATE_LIMIT_PER_MINUTE=10 ./bin/api &
```
(default 60/min). Restart after change.
4. **Scale horizontally** only if the traffic is legitimate.
### Performance degradation (high p95 latency, low CPU)
Classic symptom of a slow downstream :
1. **Check `histogram_quantile(0.99, http_request_duration_seconds_bucket)`** by route in Grafana. If one route is the outlier, that's the lead.
2. **DB queries.** Run `pgBadger` on the last 30 min of slow-query
logs : `pgbadger /var/log/postgresql/postgresql-slow.log -o /tmp/pgbadger-incident.html`.
3. **Cache cold ?** After Redis restarts, the next 5-15 min will
hit Postgres harder. Expected, transient.
4. **GORM N+1 ?** The most common code-side latency cause. Look in
the structured logs for repeated identical SELECTs from a single
request_id.
## After mitigation
1. **Update the status page** flip components back to green.
2. **Comm in #incidents** : "RESOLVED. Cause: …. Mitigated by: ….
Postmortem incoming."
3. **Schedule the postmortem** within 2 business days for SEV-1,
1 week for SEV-2.
4. **File a postmortem doc** under `docs/postmortems/<date>-<slug>.md`
with sections : timeline, root cause, what worked, what didn't,
action items, runbook updates needed.
5. **Update this runbook** if the incident exposed a class of
failure not yet covered above. The matrix in §"The first 5
minutes" is the doc consumers hit first keep it accurate.
## Tools you'll want fast access to
- Grafana : https://grafana.veza.fr (bookmark the "Veza API Overview"
dashboard URL directly)
- Tempo trace search : same Grafana, "Explore" "Tempo" data source.
Useful when latency spike has no obvious DB / Redis cause.
- Sentry : https://sentry.io/veza/veza-backend
- Status page : https://status.veza.fr (admin URL behind SSO)
- HAProxy stats : http://haproxy.veza.fr:8404/stats (prod LB)
- pg_auto_failover monitor : see `infra/ansible/roles/postgres_ha/`
README for the exact host:port (rotates when monitor fails over)
- RabbitMQ management : http://rabbitmq.lxd:15672 (basic auth)
- MinIO console : http://minio.veza.fr:9001 (admin SSO)
If any of those URLs is broken at incident time, that's the first
postmortem action item operators need bookmarks they can trust.