Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
* SLO_API_AVAILABILITY : 99.5% on read (GET) endpoints
* SLO_API_LATENCY : 99% writes p95 < 500ms
* SLO_PAYMENT_SUCCESS : 99.5% on POST /api/v1/orders -> 2xx
Each SLO has two alerts :
* <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
* <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)
- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
+ PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
+ db-failover, redis-down, disk-full, cert-expiring-soon : one stub
per likely page. Each lists first moves under 5min + common causes.
Acceptance (Day 10) : promtool check rules vert.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.7 KiB
Runbook — API availability SLO burn
SLO : 99.5% of GET requests on
/api/v1/*return non-5xx (monthly window). Alerts :APIAvailabilitySLOFastBurn(page) ·APIAvailabilitySLOSlowBurn(ticket) Owner : backend on-call.
What tripped me
The 5xx ratio on read endpoints is consuming the monthly error budget faster than the steady-state rate allows :
- Fast burn (
page=true) : 14.4× over 1h ⇒ entire monthly budget gone in ~3.5 days. - Slow burn (
page=false) : 6× over 6h ⇒ entire budget gone in ~7 days.
First moves (under 5 minutes)
-
Confirm the alert is real, not a metric-pipeline glitch :
# Live error rate on the GETs we measure : sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m])) / sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET"}[5m]))Open Grafana → "Veza API Overview" dashboard, panel "Request rate by path".
-
Identify the affected endpoint. The fastest pivot :
topk(5, sum by (path, status) ( rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m]) )) -
Drop into traces. Open the "Veza Service Map (Tempo)" dashboard and filter the slowest-spans table for the offending path. If the failures correlate with one downstream (Redis, Postgres, Hyperswitch), the trace will show it.
Common causes
| Symptom | Likely cause | Fix |
|---|---|---|
5xx concentrated on /feed, /library |
Postgres slow / connection pool exhausted | See db-failover.md — check pg_auto_failover state |
5xx concentrated on /search, /tracks |
Postgres FTS index churn or autovacuum holding row locks | SELECT pid, query FROM pg_stat_activity WHERE state='active' ORDER BY xact_start LIMIT 5; |
| 5xx across all paths, sudden | Pod just restarted / migration broken / DB unreachable | kubectl get pods -n veza or systemctl status veza-backend-api |
| 5xx slowly climbing | Memory leak; container approaching OOMKill | kubectl top pod -n veza and bounce the leaking pod |
| 5xx confined to one instance | Single bad replica (config, certs, networking) | Drain that instance from the load balancer |
If the page is real
- Page the secondary on-call if the primary fix is going to take > 15 min.
- Update the status page (
status.veza.fr) with "Investigating elevated error rates." - Post in #incident-response with the alert link + first hypothesis.
When to silence
- Confirmed degradation is a known maintenance window already announced : silence for the maintenance window's duration.
- Single-instance issue, that instance has been drained : silence for 1h.
- Otherwise, do not silence — let the alert keep firing until the burn rate drops below threshold naturally.
Recovery verification
After mitigation, both burn-rate windows must drop below threshold for the alert to clear (1h and 5m for fast burn, 6h and 30m for slow burn). The 6h window means the slow-burn alert can stay green for hours after the issue is fixed — don't be surprised.
Postmortem trigger
A page-grade alert that fires for > 15 minutes triggers a postmortem doc (docs/postmortems/YYYY-MM-DD-<slug>.md). Include the timeline, the trace IDs, and the metric query screenshots.