Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.
INCIDENT_RESPONSE.md (15 → 208 lines)
* "First 5 minutes" triage : ack → annotation → 3 dashboards →
failure-class matrix → declare-if-stuck. Aligns with what an
on-call actually does when paged.
* Severity ladder (SEV-1/2/3) with response-time and
communication norms — replaces the implicit "everything is
SEV-1" the bullet points suggested.
* "Capture evidence before mitigating" block with the four exact
commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
queues) the postmortem will want.
* Mitigation patterns per failure class (API down, DB down,
storage failure, webhook failure, DDoS, performance), each
pointing at the deep-dive runbook for the specific recipe.
* "After mitigation" : status page, comm pattern, postmortem
schedule by severity, runbook update policy.
* Tools section with the bookmark-able URLs (Grafana, Tempo,
Sentry, status page, HAProxy stats, pg_auto_failover monitor,
RabbitMQ console, MinIO console).
GRACEFUL_DEGRADATION.md (25 → 261 lines)
* Quick-lookup matrix of every backing service × user-visible
impact × severity × deep-dive runbook. Lets the on-call read
one row instead of paging through six docs.
* Per-service section detailing what still works and what fails :
Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
Elasticsearch (called out as the v1.0 orphan it is).
* `/api/v1/health/deep` documented as the canary surface, with a
sample response shape so operators know what `degraded` looks
like before they see it.
* "Adding a new degradation mode" section with the 4-step recipe
(this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
code comment) so future maintainers keep the docs in sync as
the surface evolves.
These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
208 lines
10 KiB
Markdown
208 lines
10 KiB
Markdown
# Runbook — Incident Response
|
||
|
||
> **Owner** : whoever holds the on-call pager.
|
||
> **Companion docs** : [GRACEFUL_DEGRADATION.md](GRACEFUL_DEGRADATION.md),
|
||
> [ROLLBACK.md](ROLLBACK.md), [DEPLOYMENT.md](DEPLOYMENT.md).
|
||
> **When in doubt** : preserve evidence first, mitigate second, fix
|
||
> third. Do NOT restart a service before capturing its logs.
|
||
|
||
The runbooks under `docs/runbooks/` are organised by failure mode
|
||
(`db-failover.md`, `redis-down.md`, `rabbitmq-down.md`, `disk-full.md`,
|
||
`api-availability-slo-burn.md`, `api-latency-slo-burn.md`,
|
||
`payment-success-slo-burn.md`, `cert-expiring-soon.md`,
|
||
`SECRET_ROTATION.md`). This file is the **entry point** : how to
|
||
triage an unknown alert, who to wake up, and which sub-runbook to
|
||
open next.
|
||
|
||
## The first 5 minutes
|
||
|
||
When you're paged and don't know what's happening yet :
|
||
|
||
1. **Acknowledge the alert** so a second pager doesn't fire on the
|
||
same incident. PagerDuty / Alertmanager → ack.
|
||
2. **Read the alert annotation.** Most alert rules carry
|
||
`runbook_url` pointing at the right sub-runbook (verify via
|
||
`config/prometheus/alert_rules.yml`). If the alert has a runbook,
|
||
open it now and follow that doc instead of this one.
|
||
3. **Glance at three dashboards in this exact order :**
|
||
- **Veza API Overview** (Grafana) → 5xx rate, p95 latency, RPS
|
||
- **Status page** (https://status.veza.fr) → which components are
|
||
red according to synthetic + SLO burn-rate alerts
|
||
- **Sentry → veza-backend project** → P1/P2 issues in the last
|
||
30 min, sorted by frequency
|
||
4. **Identify the failure class** from this matrix :
|
||
|
||
| Symptom | Likely cause | Open runbook |
|
||
| ---------------------------------------- | ---------------------- | ------------------------------------- |
|
||
| API 5xx > 5% for >5 min | DB or Redis down | `db-failover.md` or `redis-down.md` |
|
||
| API p95 > 2s, low CPU | DB slow query / Redis cache miss | `redis-down.md` (cache cold) or DB pgBadger |
|
||
| Login fails, sessions invalid | Redis down | `redis-down.md` |
|
||
| Tracks won't upload but API responds | MinIO / S3 down | (no dedicated runbook ; see "Storage failure" below) |
|
||
| HLS playback stalls, browser shows 404 | Stream server / transcode worker | `rabbitmq-down.md` if jobs piling |
|
||
| Payment webhook backlog growing | Hyperswitch issue | `payment-success-slo-burn.md` |
|
||
| "Nothing's broken but everyone's mad" | Graceful-degraded mode | `GRACEFUL_DEGRADATION.md` |
|
||
| TLS cert expired | LE renewal failed | `cert-expiring-soon.md` |
|
||
| Disk full alert | Logs / pgBackRest | `disk-full.md` |
|
||
|
||
5. **If you can't classify in 5 min,** declare an incident in
|
||
`#incidents` Slack with severity SEV-2 (assume worse-than-average
|
||
until proven otherwise), invite the on-call lead, and continue
|
||
investigating with backup.
|
||
|
||
## Severity ladder
|
||
|
||
| Level | Definition | Response time | Communication |
|
||
| ----- | --------------------------------------------------------------- | ------------- | ---------------------------------------------- |
|
||
| SEV-1 | User-facing data loss OR > 50% of users impacted > 10 min | 15 min | Status page banner + #incidents + email blast |
|
||
| SEV-2 | User-facing degradation, single-component outage, recoverable | 30 min | Status page component yellow + #incidents |
|
||
| SEV-3 | Internal / observability degradation, no user impact | next business day | #engineering ticket |
|
||
|
||
Default to SEV-2 on first page. Promote to SEV-1 if you confirm data
|
||
loss or wide impact ; demote to SEV-3 if you confirm no user impact.
|
||
|
||
## Capture evidence before mitigating
|
||
|
||
Restarting a process loses the in-memory state that diagnoses the
|
||
incident. Before the first mitigation action :
|
||
|
||
```bash
|
||
# Backend API logs (last 500 lines)
|
||
docker logs --tail 500 veza_backend_api_blue > /tmp/incident-$(date +%Y%m%d-%H%M%S)-api.log
|
||
|
||
# Stream server logs
|
||
docker logs --tail 500 veza_stream_server_blue > /tmp/incident-$(date +%Y%m%d-%H%M%S)-stream.log
|
||
|
||
# DB stats snapshot
|
||
psql "$DATABASE_URL" -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';" \
|
||
> /tmp/incident-$(date +%Y%m%d-%H%M%S)-pg-activity.txt
|
||
|
||
# Redis ops snapshot
|
||
redis-cli -u "$REDIS_URL" --bigkeys > /tmp/incident-$(date +%Y%m%d-%H%M%S)-redis-keys.txt
|
||
|
||
# RabbitMQ queue depths
|
||
curl -s -u "$RMQ_USER:$RMQ_PASS" http://rabbitmq.lxd:15672/api/queues \
|
||
| jq '.[] | {name, messages, consumers}' > /tmp/incident-$(date +%Y%m%d-%H%M%S)-rmq-queues.json
|
||
```
|
||
|
||
Drop those files into `s3://veza-incidents/<date>/` after the
|
||
incident closes — postmortem will reference them.
|
||
|
||
## Mitigation patterns
|
||
|
||
### API down (`/health` 5xx or unreachable)
|
||
|
||
1. **Confirm scope** : single instance or both (blue + green) ?
|
||
- One down : HAProxy auto-routes to the healthy one ; symptom
|
||
should be transient as the failed instance reboots.
|
||
- Both down : full outage. Status page → SEV-1.
|
||
2. **Logs first.** Check `docker logs veza_backend_api_blue` and
|
||
green for panic / fatal traces.
|
||
3. **DB connection pool exhausted ?** Look for "too many connections"
|
||
or "connection refused : 5432" in the logs. → `db-failover.md`.
|
||
4. **Restart only after capturing logs.**
|
||
```bash
|
||
docker compose -f docker-compose.prod.yml restart backend-api-blue
|
||
```
|
||
Wait 30s for healthcheck before rotating green.
|
||
5. **Rollback image if regression suspected** (deploy in last 1 h
|
||
correlates with incident start). → [ROLLBACK.md](ROLLBACK.md).
|
||
|
||
### DB down (`/health/deep` shows `db: false`)
|
||
|
||
This is the highest-severity pattern — most user flows fail.
|
||
|
||
1. → `db-failover.md`. That runbook has the full pg_auto_failover
|
||
recipe ; this entry-point doc shouldn't duplicate it.
|
||
2. While failover runs, expect 503 on every API call. The frontend
|
||
has a "service unavailable" splash and will retry on its own.
|
||
3. Watch `pg_auto_failover` cluster state via the monitor URL listed
|
||
in the deep-dive runbook. Promotion takes < 60s if the standby is
|
||
healthy.
|
||
|
||
### Storage failure (MinIO / S3)
|
||
|
||
No dedicated runbook yet (TODO v1.1). Triage :
|
||
|
||
1. **MinIO drives offline ?** Alert `MinIODriveOffline` fires per
|
||
`config/prometheus/alert_rules.yml:90`. EC:2 tolerates 2 drive
|
||
losses ; investigate within the hour.
|
||
2. **MinIO node unreachable ?** Alert `MinIONodesUnreachable` fires
|
||
when ≥ 2 nodes are gone — that's a SEV-1, redundancy exhausted.
|
||
3. **CDN dropping requests ?** Phase-1 cache is local Nginx. SSH the
|
||
cache container, `curl -I` the origin to verify. Bypass the CDN
|
||
by swapping the API's `CDN_BASE_URL` env to empty if needed.
|
||
|
||
### Webhook failures (Hyperswitch / Stream server)
|
||
|
||
1. `payment-success-slo-burn.md` for payment webhooks specifically.
|
||
2. For stream server transcode webhooks (`/internal/jobs/transcode`),
|
||
check `STREAM_SERVER_INTERNAL_API_KEY` matches between the API
|
||
and stream server containers. Mismatch surfaces as 401 in the
|
||
API logs.
|
||
3. Don't resend webhooks blindly — most are idempotent but some
|
||
(DMCA propagation, distribution outbox) aren't fully so. Verify
|
||
the worker's idempotency-key handling before bulk-resending.
|
||
|
||
### DDoS / abusive traffic
|
||
|
||
Default rate limiter caps hit before degradation, but a determined
|
||
attacker can still saturate connections.
|
||
|
||
1. **Identify the abusive pattern.** Grafana → Veza API Overview →
|
||
"Top IP by request count last 5 min" panel. If single IP ≥ 10×
|
||
the median, it's likely a probe.
|
||
2. **Block at the edge** if you have one (Cloudflare, fail2ban on
|
||
the host) — the rate limiter middleware blocks per request, not
|
||
per connection, so a flood of legitimate-looking requests can
|
||
still saturate goroutines.
|
||
3. **Tighten the rate limiter** by env override if needed :
|
||
```bash
|
||
docker compose -f docker-compose.prod.yml exec backend-api-blue \
|
||
env RATE_LIMIT_PER_MINUTE=10 ./bin/api &
|
||
```
|
||
(default 60/min). Restart after change.
|
||
4. **Scale horizontally** only if the traffic is legitimate.
|
||
|
||
### Performance degradation (high p95 latency, low CPU)
|
||
|
||
Classic symptom of a slow downstream :
|
||
|
||
1. **Check `histogram_quantile(0.99, http_request_duration_seconds_bucket)`** by route in Grafana. If one route is the outlier, that's the lead.
|
||
2. **DB queries.** Run `pgBadger` on the last 30 min of slow-query
|
||
logs : `pgbadger /var/log/postgresql/postgresql-slow.log -o /tmp/pgbadger-incident.html`.
|
||
3. **Cache cold ?** After Redis restarts, the next 5-15 min will
|
||
hit Postgres harder. Expected, transient.
|
||
4. **GORM N+1 ?** The most common code-side latency cause. Look in
|
||
the structured logs for repeated identical SELECTs from a single
|
||
request_id.
|
||
|
||
## After mitigation
|
||
|
||
1. **Update the status page** — flip components back to green.
|
||
2. **Comm in #incidents** : "RESOLVED. Cause: …. Mitigated by: ….
|
||
Postmortem incoming."
|
||
3. **Schedule the postmortem** within 2 business days for SEV-1,
|
||
1 week for SEV-2.
|
||
4. **File a postmortem doc** under `docs/postmortems/<date>-<slug>.md`
|
||
with sections : timeline, root cause, what worked, what didn't,
|
||
action items, runbook updates needed.
|
||
5. **Update this runbook** if the incident exposed a class of
|
||
failure not yet covered above. The matrix in §"The first 5
|
||
minutes" is the doc consumers hit first — keep it accurate.
|
||
|
||
## Tools you'll want fast access to
|
||
|
||
- Grafana : https://grafana.veza.fr (bookmark the "Veza API Overview"
|
||
dashboard URL directly)
|
||
- Tempo trace search : same Grafana, "Explore" → "Tempo" data source.
|
||
Useful when latency spike has no obvious DB / Redis cause.
|
||
- Sentry : https://sentry.io/veza/veza-backend
|
||
- Status page : https://status.veza.fr (admin URL behind SSO)
|
||
- HAProxy stats : http://haproxy.veza.fr:8404/stats (prod LB)
|
||
- pg_auto_failover monitor : see `infra/ansible/roles/postgres_ha/`
|
||
README for the exact host:port (rotates when monitor fails over)
|
||
- RabbitMQ management : http://rabbitmq.lxd:15672 (basic auth)
|
||
- MinIO console : http://minio.veza.fr:9001 (admin SSO)
|
||
|
||
If any of those URLs is broken at incident time, that's the first
|
||
postmortem action item — operators need bookmarks they can trust.
|