veza/docs/runbooks/api-availability-slo-burn.md
senke c78bf1b765
Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m4s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s
Veza CI / Backend (Go) (push) Failing after 15m45s
Veza CI / Frontend (Web) (push) Successful in 18m7s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Successful in 24m9s
feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)
Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
  * SLO_API_AVAILABILITY  : 99.5% on read (GET) endpoints
  * SLO_API_LATENCY       : 99% writes p95 < 500ms
  * SLO_PAYMENT_SUCCESS   : 99.5% on POST /api/v1/orders -> 2xx

Each SLO has two alerts :
  * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
  * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)

- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
  check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
  + PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
  + db-failover, redis-down, disk-full, cert-expiring-soon : one stub
  per likely page. Each lists first moves under 5min + common causes.

Acceptance (Day 10) : promtool check rules vert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 01:30:34 +02:00

62 lines
3.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runbook — API availability SLO burn
> **SLO** : 99.5% of GET requests on `/api/v1/*` return non-5xx (monthly window).
> **Alerts** : `APIAvailabilitySLOFastBurn` (page) · `APIAvailabilitySLOSlowBurn` (ticket)
> **Owner** : backend on-call.
## What tripped me
The 5xx ratio on read endpoints is consuming the monthly error budget faster than the steady-state rate allows :
- **Fast burn** (`page=true`) : 14.4× over 1h ⇒ entire monthly budget gone in ~3.5 days.
- **Slow burn** (`page=false`) : 6× over 6h ⇒ entire budget gone in ~7 days.
## First moves (under 5 minutes)
1. **Confirm the alert is real**, not a metric-pipeline glitch :
```bash
# Live error rate on the GETs we measure :
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m]))
/
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET"}[5m]))
```
Open Grafana → "Veza API Overview" dashboard, panel "Request rate by path".
2. **Identify the affected endpoint**. The fastest pivot :
```promql
topk(5, sum by (path, status) (
rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m])
))
```
3. **Drop into traces**. Open the "Veza Service Map (Tempo)" dashboard and filter the slowest-spans table for the offending path. If the failures correlate with one downstream (Redis, Postgres, Hyperswitch), the trace will show it.
## Common causes
| Symptom | Likely cause | Fix |
| -------------------------------------------- | ----------------------------------------------------------- | ---------------------------------------------------- |
| 5xx concentrated on `/feed`, `/library` | Postgres slow / connection pool exhausted | See `db-failover.md` — check `pg_auto_failover` state |
| 5xx concentrated on `/search`, `/tracks` | Postgres FTS index churn or autovacuum holding row locks | `SELECT pid, query FROM pg_stat_activity WHERE state='active' ORDER BY xact_start LIMIT 5;` |
| 5xx across all paths, sudden | Pod just restarted / migration broken / DB unreachable | `kubectl get pods -n veza` or `systemctl status veza-backend-api` |
| 5xx slowly climbing | Memory leak; container approaching OOMKill | `kubectl top pod -n veza` and bounce the leaking pod |
| 5xx confined to one instance | Single bad replica (config, certs, networking) | Drain that instance from the load balancer |
## If the page is real
1. **Page the secondary on-call** if the primary fix is going to take > 15 min.
2. **Update the status page** (`status.veza.fr`) with "Investigating elevated error rates."
3. **Post in #incident-response** with the alert link + first hypothesis.
## When to silence
- Confirmed degradation is a known maintenance window already announced : silence for the maintenance window's duration.
- Single-instance issue, that instance has been drained : silence for 1h.
- Otherwise, **do not silence** — let the alert keep firing until the burn rate drops below threshold naturally.
## Recovery verification
After mitigation, both burn-rate windows must drop below threshold for the alert to clear (1h and 5m for fast burn, 6h and 30m for slow burn). The 6h window means the slow-burn alert can stay green for hours after the issue is fixed — don't be surprised.
## Postmortem trigger
A page-grade alert that fires for > 15 minutes triggers a postmortem doc (`docs/postmortems/YYYY-MM-DD-<slug>.md`). Include the timeline, the trace IDs, and the metric query screenshots.