63 lines
3.7 KiB
Markdown
63 lines
3.7 KiB
Markdown
|
|
# Runbook — API availability SLO burn
|
|||
|
|
|
|||
|
|
> **SLO** : 99.5% of GET requests on `/api/v1/*` return non-5xx (monthly window).
|
|||
|
|
> **Alerts** : `APIAvailabilitySLOFastBurn` (page) · `APIAvailabilitySLOSlowBurn` (ticket)
|
|||
|
|
> **Owner** : backend on-call.
|
|||
|
|
|
|||
|
|
## What tripped me
|
|||
|
|
|
|||
|
|
The 5xx ratio on read endpoints is consuming the monthly error budget faster than the steady-state rate allows :
|
|||
|
|
|
|||
|
|
- **Fast burn** (`page=true`) : 14.4× over 1h ⇒ entire monthly budget gone in ~3.5 days.
|
|||
|
|
- **Slow burn** (`page=false`) : 6× over 6h ⇒ entire budget gone in ~7 days.
|
|||
|
|
|
|||
|
|
## First moves (under 5 minutes)
|
|||
|
|
|
|||
|
|
1. **Confirm the alert is real**, not a metric-pipeline glitch :
|
|||
|
|
```bash
|
|||
|
|
# Live error rate on the GETs we measure :
|
|||
|
|
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m]))
|
|||
|
|
/
|
|||
|
|
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET"}[5m]))
|
|||
|
|
```
|
|||
|
|
Open Grafana → "Veza API Overview" dashboard, panel "Request rate by path".
|
|||
|
|
|
|||
|
|
2. **Identify the affected endpoint**. The fastest pivot :
|
|||
|
|
```promql
|
|||
|
|
topk(5, sum by (path, status) (
|
|||
|
|
rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m])
|
|||
|
|
))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. **Drop into traces**. Open the "Veza Service Map (Tempo)" dashboard and filter the slowest-spans table for the offending path. If the failures correlate with one downstream (Redis, Postgres, Hyperswitch), the trace will show it.
|
|||
|
|
|
|||
|
|
## Common causes
|
|||
|
|
|
|||
|
|
| Symptom | Likely cause | Fix |
|
|||
|
|
| -------------------------------------------- | ----------------------------------------------------------- | ---------------------------------------------------- |
|
|||
|
|
| 5xx concentrated on `/feed`, `/library` | Postgres slow / connection pool exhausted | See `db-failover.md` — check `pg_auto_failover` state |
|
|||
|
|
| 5xx concentrated on `/search`, `/tracks` | Postgres FTS index churn or autovacuum holding row locks | `SELECT pid, query FROM pg_stat_activity WHERE state='active' ORDER BY xact_start LIMIT 5;` |
|
|||
|
|
| 5xx across all paths, sudden | Pod just restarted / migration broken / DB unreachable | `kubectl get pods -n veza` or `systemctl status veza-backend-api` |
|
|||
|
|
| 5xx slowly climbing | Memory leak; container approaching OOMKill | `kubectl top pod -n veza` and bounce the leaking pod |
|
|||
|
|
| 5xx confined to one instance | Single bad replica (config, certs, networking) | Drain that instance from the load balancer |
|
|||
|
|
|
|||
|
|
## If the page is real
|
|||
|
|
|
|||
|
|
1. **Page the secondary on-call** if the primary fix is going to take > 15 min.
|
|||
|
|
2. **Update the status page** (`status.veza.fr`) with "Investigating elevated error rates."
|
|||
|
|
3. **Post in #incident-response** with the alert link + first hypothesis.
|
|||
|
|
|
|||
|
|
## When to silence
|
|||
|
|
|
|||
|
|
- Confirmed degradation is a known maintenance window already announced : silence for the maintenance window's duration.
|
|||
|
|
- Single-instance issue, that instance has been drained : silence for 1h.
|
|||
|
|
- Otherwise, **do not silence** — let the alert keep firing until the burn rate drops below threshold naturally.
|
|||
|
|
|
|||
|
|
## Recovery verification
|
|||
|
|
|
|||
|
|
After mitigation, both burn-rate windows must drop below threshold for the alert to clear (1h and 5m for fast burn, 6h and 30m for slow burn). The 6h window means the slow-burn alert can stay green for hours after the issue is fixed — don't be surprised.
|
|||
|
|
|
|||
|
|
## Postmortem trigger
|
|||
|
|
|
|||
|
|
A page-grade alert that fires for > 15 minutes triggers a postmortem doc (`docs/postmortems/YYYY-MM-DD-<slug>.md`). Include the timeline, the trace IDs, and the metric query screenshots.
|