senke/veza

senke 7d92820a9c docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs

Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.

INCIDENT_RESPONSE.md (15 → 208 lines)
  * "First 5 minutes" triage : ack → annotation → 3 dashboards →
    failure-class matrix → declare-if-stuck. Aligns with what an
    on-call actually does when paged.
  * Severity ladder (SEV-1/2/3) with response-time and
    communication norms — replaces the implicit "everything is
    SEV-1" the bullet points suggested.
  * "Capture evidence before mitigating" block with the four exact
    commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
    queues) the postmortem will want.
  * Mitigation patterns per failure class (API down, DB down,
    storage failure, webhook failure, DDoS, performance), each
    pointing at the deep-dive runbook for the specific recipe.
  * "After mitigation" : status page, comm pattern, postmortem
    schedule by severity, runbook update policy.
  * Tools section with the bookmark-able URLs (Grafana, Tempo,
    Sentry, status page, HAProxy stats, pg_auto_failover monitor,
    RabbitMQ console, MinIO console).

GRACEFUL_DEGRADATION.md (25 → 261 lines)
  * Quick-lookup matrix of every backing service × user-visible
    impact × severity × deep-dive runbook. Lets the on-call read
    one row instead of paging through six docs.
  * Per-service section detailing what still works and what fails :
    Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
    MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
    Elasticsearch (called out as the v1.0 orphan it is).
  * `/api/v1/health/deep` documented as the canary surface, with a
    sample response shape so operators know what `degraded` looks
    like before they see it.
  * "Adding a new degradation mode" section with the 4-step recipe
    (this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
    code comment) so future maintainers keep the docs in sync as
    the surface evolves.

These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-01 04:13:55 +02:00

10 KiB

Raw Blame History

Runbook — Incident Response

Owner : whoever holds the on-call pager. Companion docs : GRACEFUL_DEGRADATION.md, ROLLBACK.md, DEPLOYMENT.md. When in doubt : preserve evidence first, mitigate second, fix third. Do NOT restart a service before capturing its logs.

The runbooks under docs/runbooks/ are organised by failure mode (db-failover.md, redis-down.md, rabbitmq-down.md, disk-full.md, api-availability-slo-burn.md, api-latency-slo-burn.md, payment-success-slo-burn.md, cert-expiring-soon.md, SECRET_ROTATION.md). This file is the entry point : how to triage an unknown alert, who to wake up, and which sub-runbook to open next.

The first 5 minutes

When you're paged and don't know what's happening yet :

Acknowledge the alert so a second pager doesn't fire on the same incident. PagerDuty / Alertmanager → ack.
Read the alert annotation. Most alert rules carry runbook_url pointing at the right sub-runbook (verify via config/prometheus/alert_rules.yml). If the alert has a runbook, open it now and follow that doc instead of this one.
Glance at three dashboards in this exact order :
- Veza API Overview (Grafana) → 5xx rate, p95 latency, RPS
- Status page (https://status.veza.fr) → which components are red according to synthetic + SLO burn-rate alerts
- Sentry → veza-backend project → P1/P2 issues in the last 30 min, sorted by frequency

Identify the failure class from this matrix :

Symptom	Likely cause	Open runbook
API 5xx > 5% for >5 min	DB or Redis down	`db-failover.md` or `redis-down.md`
API p95 > 2s, low CPU	DB slow query / Redis cache miss	`redis-down.md` (cache cold) or DB pgBadger
Login fails, sessions invalid	Redis down	`redis-down.md`
Tracks won't upload but API responds	MinIO / S3 down	(no dedicated runbook ; see "Storage failure" below)
HLS playback stalls, browser shows 404	Stream server / transcode worker	`rabbitmq-down.md` if jobs piling
Payment webhook backlog growing	Hyperswitch issue	`payment-success-slo-burn.md`
"Nothing's broken but everyone's mad"	Graceful-degraded mode	`GRACEFUL_DEGRADATION.md`
TLS cert expired	LE renewal failed	`cert-expiring-soon.md`
Disk full alert	Logs / pgBackRest	`disk-full.md`

If you can't classify in 5 min, declare an incident in #incidents Slack with severity SEV-2 (assume worse-than-average until proven otherwise), invite the on-call lead, and continue investigating with backup.

Severity ladder

Level	Definition	Response time	Communication
SEV-1	User-facing data loss OR > 50% of users impacted > 10 min	15 min	Status page banner + #incidents + email blast
SEV-2	User-facing degradation, single-component outage, recoverable	30 min	Status page component yellow + #incidents
SEV-3	Internal / observability degradation, no user impact	next business day	#engineering ticket

Default to SEV-2 on first page. Promote to SEV-1 if you confirm data loss or wide impact ; demote to SEV-3 if you confirm no user impact.

Capture evidence before mitigating

Restarting a process loses the in-memory state that diagnoses the incident. Before the first mitigation action :

# Backend API logs (last 500 lines)
docker logs --tail 500 veza_backend_api_blue > /tmp/incident-$(date +%Y%m%d-%H%M%S)-api.log

# Stream server logs
docker logs --tail 500 veza_stream_server_blue > /tmp/incident-$(date +%Y%m%d-%H%M%S)-stream.log

# DB stats snapshot
psql "$DATABASE_URL" -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';" \
  > /tmp/incident-$(date +%Y%m%d-%H%M%S)-pg-activity.txt

# Redis ops snapshot
redis-cli -u "$REDIS_URL" --bigkeys > /tmp/incident-$(date +%Y%m%d-%H%M%S)-redis-keys.txt

# RabbitMQ queue depths
curl -s -u "$RMQ_USER:$RMQ_PASS" http://rabbitmq.lxd:15672/api/queues \
  | jq '.[] | {name, messages, consumers}' > /tmp/incident-$(date +%Y%m%d-%H%M%S)-rmq-queues.json

Drop those files into s3://veza-incidents/<date>/ after the incident closes — postmortem will reference them.

Mitigation patterns

API down (`/health` 5xx or unreachable)

Confirm scope : single instance or both (blue + green) ?
- One down : HAProxy auto-routes to the healthy one ; symptom should be transient as the failed instance reboots.
- Both down : full outage. Status page → SEV-1.
Logs first. Check docker logs veza_backend_api_blue and green for panic / fatal traces.
DB connection pool exhausted ? Look for "too many connections" or "connection refused : 5432" in the logs. → db-failover.md.
Restart only after capturing logs.
```
docker compose -f docker-compose.prod.yml restart backend-api-blue
```
Wait 30s for healthcheck before rotating green.
Rollback image if regression suspected (deploy in last 1 h correlates with incident start). → ROLLBACK.md.

DB down (`/health/deep` shows `db: false`)

This is the highest-severity pattern — most user flows fail.

→ db-failover.md. That runbook has the full pg_auto_failover recipe ; this entry-point doc shouldn't duplicate it.
While failover runs, expect 503 on every API call. The frontend has a "service unavailable" splash and will retry on its own.
Watch pg_auto_failover cluster state via the monitor URL listed in the deep-dive runbook. Promotion takes < 60s if the standby is healthy.

Storage failure (MinIO / S3)

No dedicated runbook yet (TODO v1.1). Triage :

MinIO drives offline ? Alert MinIODriveOffline fires per config/prometheus/alert_rules.yml:90. EC:2 tolerates 2 drive losses ; investigate within the hour.
MinIO node unreachable ? Alert MinIONodesUnreachable fires when ≥ 2 nodes are gone — that's a SEV-1, redundancy exhausted.
CDN dropping requests ? Phase-1 cache is local Nginx. SSH the cache container, curl -I the origin to verify. Bypass the CDN by swapping the API's CDN_BASE_URL env to empty if needed.

Webhook failures (Hyperswitch / Stream server)

payment-success-slo-burn.md for payment webhooks specifically.
For stream server transcode webhooks (/internal/jobs/transcode), check STREAM_SERVER_INTERNAL_API_KEY matches between the API and stream server containers. Mismatch surfaces as 401 in the API logs.
Don't resend webhooks blindly — most are idempotent but some (DMCA propagation, distribution outbox) aren't fully so. Verify the worker's idempotency-key handling before bulk-resending.

DDoS / abusive traffic

Default rate limiter caps hit before degradation, but a determined attacker can still saturate connections.

Identify the abusive pattern. Grafana → Veza API Overview → "Top IP by request count last 5 min" panel. If single IP ≥ 10× the median, it's likely a probe.
Block at the edge if you have one (Cloudflare, fail2ban on the host) — the rate limiter middleware blocks per request, not per connection, so a flood of legitimate-looking requests can still saturate goroutines.

Tighten the rate limiter by env override if needed :

docker compose -f docker-compose.prod.yml exec backend-api-blue \
  env RATE_LIMIT_PER_MINUTE=10 ./bin/api &

(default 60/min). Restart after change.

Scale horizontally only if the traffic is legitimate.

Performance degradation (high p95 latency, low CPU)

Classic symptom of a slow downstream :

Check histogram_quantile(0.99, http_request_duration_seconds_bucket) by route in Grafana. If one route is the outlier, that's the lead.
DB queries. Run pgBadger on the last 30 min of slow-query logs : pgbadger /var/log/postgresql/postgresql-slow.log -o /tmp/pgbadger-incident.html.
Cache cold ? After Redis restarts, the next 5-15 min will hit Postgres harder. Expected, transient.
GORM N+1 ? The most common code-side latency cause. Look in the structured logs for repeated identical SELECTs from a single request_id.

After mitigation

Update the status page — flip components back to green.
Comm in #incidents : "RESOLVED. Cause: …. Mitigated by: …. Postmortem incoming."
Schedule the postmortem within 2 business days for SEV-1, 1 week for SEV-2.
File a postmortem doc under docs/postmortems/<date>-<slug>.md with sections : timeline, root cause, what worked, what didn't, action items, runbook updates needed.
Update this runbook if the incident exposed a class of failure not yet covered above. The matrix in §"The first 5 minutes" is the doc consumers hit first — keep it accurate.

Tools you'll want fast access to

Grafana : https://grafana.veza.fr (bookmark the "Veza API Overview" dashboard URL directly)
Tempo trace search : same Grafana, "Explore" → "Tempo" data source. Useful when latency spike has no obvious DB / Redis cause.
Sentry : https://sentry.io/veza/veza-backend
Status page : https://status.veza.fr (admin URL behind SSO)
HAProxy stats : http://haproxy.veza.fr:8404/stats (prod LB)
pg_auto_failover monitor : see infra/ansible/roles/postgres_ha/ README for the exact host:port (rotates when monitor fails over)
RabbitMQ management : http://rabbitmq.lxd:15672 (basic auth)
MinIO console : http://minio.veza.fr:9001 (admin SSO)

If any of those URLs is broken at incident time, that's the first postmortem action item — operators need bookmarks they can trust.

10 KiB Raw Blame History Unescape Escape