# Runbook — Incident Response > **Owner** : whoever holds the on-call pager. > **Companion docs** : [GRACEFUL_DEGRADATION.md](GRACEFUL_DEGRADATION.md), > [ROLLBACK.md](ROLLBACK.md), [DEPLOYMENT.md](DEPLOYMENT.md). > **When in doubt** : preserve evidence first, mitigate second, fix > third. Do NOT restart a service before capturing its logs. The runbooks under `docs/runbooks/` are organised by failure mode (`db-failover.md`, `redis-down.md`, `rabbitmq-down.md`, `disk-full.md`, `api-availability-slo-burn.md`, `api-latency-slo-burn.md`, `payment-success-slo-burn.md`, `cert-expiring-soon.md`, `SECRET_ROTATION.md`). This file is the **entry point** : how to triage an unknown alert, who to wake up, and which sub-runbook to open next. ## The first 5 minutes When you're paged and don't know what's happening yet : 1. **Acknowledge the alert** so a second pager doesn't fire on the same incident. PagerDuty / Alertmanager → ack. 2. **Read the alert annotation.** Most alert rules carry `runbook_url` pointing at the right sub-runbook (verify via `config/prometheus/alert_rules.yml`). If the alert has a runbook, open it now and follow that doc instead of this one. 3. **Glance at three dashboards in this exact order :** - **Veza API Overview** (Grafana) → 5xx rate, p95 latency, RPS - **Status page** (https://status.veza.fr) → which components are red according to synthetic + SLO burn-rate alerts - **Sentry → veza-backend project** → P1/P2 issues in the last 30 min, sorted by frequency 4. **Identify the failure class** from this matrix : | Symptom | Likely cause | Open runbook | | ---------------------------------------- | ---------------------- | ------------------------------------- | | API 5xx > 5% for >5 min | DB or Redis down | `db-failover.md` or `redis-down.md` | | API p95 > 2s, low CPU | DB slow query / Redis cache miss | `redis-down.md` (cache cold) or DB pgBadger | | Login fails, sessions invalid | Redis down | `redis-down.md` | | Tracks won't upload but API responds | MinIO / S3 down | (no dedicated runbook ; see "Storage failure" below) | | HLS playback stalls, browser shows 404 | Stream server / transcode worker | `rabbitmq-down.md` if jobs piling | | Payment webhook backlog growing | Hyperswitch issue | `payment-success-slo-burn.md` | | "Nothing's broken but everyone's mad" | Graceful-degraded mode | `GRACEFUL_DEGRADATION.md` | | TLS cert expired | LE renewal failed | `cert-expiring-soon.md` | | Disk full alert | Logs / pgBackRest | `disk-full.md` | 5. **If you can't classify in 5 min,** declare an incident in `#incidents` Slack with severity SEV-2 (assume worse-than-average until proven otherwise), invite the on-call lead, and continue investigating with backup. ## Severity ladder | Level | Definition | Response time | Communication | | ----- | --------------------------------------------------------------- | ------------- | ---------------------------------------------- | | SEV-1 | User-facing data loss OR > 50% of users impacted > 10 min | 15 min | Status page banner + #incidents + email blast | | SEV-2 | User-facing degradation, single-component outage, recoverable | 30 min | Status page component yellow + #incidents | | SEV-3 | Internal / observability degradation, no user impact | next business day | #engineering ticket | Default to SEV-2 on first page. Promote to SEV-1 if you confirm data loss or wide impact ; demote to SEV-3 if you confirm no user impact. ## Capture evidence before mitigating Restarting a process loses the in-memory state that diagnoses the incident. Before the first mitigation action : ```bash # Backend API logs (last 500 lines) docker logs --tail 500 veza_backend_api_blue > /tmp/incident-$(date +%Y%m%d-%H%M%S)-api.log # Stream server logs docker logs --tail 500 veza_stream_server_blue > /tmp/incident-$(date +%Y%m%d-%H%M%S)-stream.log # DB stats snapshot psql "$DATABASE_URL" -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';" \ > /tmp/incident-$(date +%Y%m%d-%H%M%S)-pg-activity.txt # Redis ops snapshot redis-cli -u "$REDIS_URL" --bigkeys > /tmp/incident-$(date +%Y%m%d-%H%M%S)-redis-keys.txt # RabbitMQ queue depths curl -s -u "$RMQ_USER:$RMQ_PASS" http://rabbitmq.lxd:15672/api/queues \ | jq '.[] | {name, messages, consumers}' > /tmp/incident-$(date +%Y%m%d-%H%M%S)-rmq-queues.json ``` Drop those files into `s3://veza-incidents//` after the incident closes — postmortem will reference them. ## Mitigation patterns ### API down (`/health` 5xx or unreachable) 1. **Confirm scope** : single instance or both (blue + green) ? - One down : HAProxy auto-routes to the healthy one ; symptom should be transient as the failed instance reboots. - Both down : full outage. Status page → SEV-1. 2. **Logs first.** Check `docker logs veza_backend_api_blue` and green for panic / fatal traces. 3. **DB connection pool exhausted ?** Look for "too many connections" or "connection refused : 5432" in the logs. → `db-failover.md`. 4. **Restart only after capturing logs.** ```bash docker compose -f docker-compose.prod.yml restart backend-api-blue ``` Wait 30s for healthcheck before rotating green. 5. **Rollback image if regression suspected** (deploy in last 1 h correlates with incident start). → [ROLLBACK.md](ROLLBACK.md). ### DB down (`/health/deep` shows `db: false`) This is the highest-severity pattern — most user flows fail. 1. → `db-failover.md`. That runbook has the full pg_auto_failover recipe ; this entry-point doc shouldn't duplicate it. 2. While failover runs, expect 503 on every API call. The frontend has a "service unavailable" splash and will retry on its own. 3. Watch `pg_auto_failover` cluster state via the monitor URL listed in the deep-dive runbook. Promotion takes < 60s if the standby is healthy. ### Storage failure (MinIO / S3) No dedicated runbook yet (TODO v1.1). Triage : 1. **MinIO drives offline ?** Alert `MinIODriveOffline` fires per `config/prometheus/alert_rules.yml:90`. EC:2 tolerates 2 drive losses ; investigate within the hour. 2. **MinIO node unreachable ?** Alert `MinIONodesUnreachable` fires when ≥ 2 nodes are gone — that's a SEV-1, redundancy exhausted. 3. **CDN dropping requests ?** Phase-1 cache is local Nginx. SSH the cache container, `curl -I` the origin to verify. Bypass the CDN by swapping the API's `CDN_BASE_URL` env to empty if needed. ### Webhook failures (Hyperswitch / Stream server) 1. `payment-success-slo-burn.md` for payment webhooks specifically. 2. For stream server transcode webhooks (`/internal/jobs/transcode`), check `STREAM_SERVER_INTERNAL_API_KEY` matches between the API and stream server containers. Mismatch surfaces as 401 in the API logs. 3. Don't resend webhooks blindly — most are idempotent but some (DMCA propagation, distribution outbox) aren't fully so. Verify the worker's idempotency-key handling before bulk-resending. ### DDoS / abusive traffic Default rate limiter caps hit before degradation, but a determined attacker can still saturate connections. 1. **Identify the abusive pattern.** Grafana → Veza API Overview → "Top IP by request count last 5 min" panel. If single IP ≥ 10× the median, it's likely a probe. 2. **Block at the edge** if you have one (Cloudflare, fail2ban on the host) — the rate limiter middleware blocks per request, not per connection, so a flood of legitimate-looking requests can still saturate goroutines. 3. **Tighten the rate limiter** by env override if needed : ```bash docker compose -f docker-compose.prod.yml exec backend-api-blue \ env RATE_LIMIT_PER_MINUTE=10 ./bin/api & ``` (default 60/min). Restart after change. 4. **Scale horizontally** only if the traffic is legitimate. ### Performance degradation (high p95 latency, low CPU) Classic symptom of a slow downstream : 1. **Check `histogram_quantile(0.99, http_request_duration_seconds_bucket)`** by route in Grafana. If one route is the outlier, that's the lead. 2. **DB queries.** Run `pgBadger` on the last 30 min of slow-query logs : `pgbadger /var/log/postgresql/postgresql-slow.log -o /tmp/pgbadger-incident.html`. 3. **Cache cold ?** After Redis restarts, the next 5-15 min will hit Postgres harder. Expected, transient. 4. **GORM N+1 ?** The most common code-side latency cause. Look in the structured logs for repeated identical SELECTs from a single request_id. ## After mitigation 1. **Update the status page** — flip components back to green. 2. **Comm in #incidents** : "RESOLVED. Cause: …. Mitigated by: …. Postmortem incoming." 3. **Schedule the postmortem** within 2 business days for SEV-1, 1 week for SEV-2. 4. **File a postmortem doc** under `docs/postmortems/-.md` with sections : timeline, root cause, what worked, what didn't, action items, runbook updates needed. 5. **Update this runbook** if the incident exposed a class of failure not yet covered above. The matrix in §"The first 5 minutes" is the doc consumers hit first — keep it accurate. ## Tools you'll want fast access to - Grafana : https://grafana.veza.fr (bookmark the "Veza API Overview" dashboard URL directly) - Tempo trace search : same Grafana, "Explore" → "Tempo" data source. Useful when latency spike has no obvious DB / Redis cause. - Sentry : https://sentry.io/veza/veza-backend - Status page : https://status.veza.fr (admin URL behind SSO) - HAProxy stats : http://haproxy.veza.fr:8404/stats (prod LB) - pg_auto_failover monitor : see `infra/ansible/roles/postgres_ha/` README for the exact host:port (rotates when monitor fails over) - RabbitMQ management : http://rabbitmq.lxd:15672 (basic auth) - MinIO console : http://minio.veza.fr:9001 (admin SSO) If any of those URLs is broken at incident time, that's the first postmortem action item — operators need bookmarks they can trust.