veza/docs/runbooks/api-availability-slo-burn.md
senke c78bf1b765
Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m4s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s
Veza CI / Backend (Go) (push) Failing after 15m45s
Veza CI / Frontend (Web) (push) Successful in 18m7s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Successful in 24m9s
feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)
Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
  * SLO_API_AVAILABILITY  : 99.5% on read (GET) endpoints
  * SLO_API_LATENCY       : 99% writes p95 < 500ms
  * SLO_PAYMENT_SUCCESS   : 99.5% on POST /api/v1/orders -> 2xx

Each SLO has two alerts :
  * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
  * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)

- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
  check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
  + PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
  + db-failover, redis-down, disk-full, cert-expiring-soon : one stub
  per likely page. Each lists first moves under 5min + common causes.

Acceptance (Day 10) : promtool check rules vert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 01:30:34 +02:00

3.7 KiB
Raw Blame History

Runbook — API availability SLO burn

SLO : 99.5% of GET requests on /api/v1/* return non-5xx (monthly window). Alerts : APIAvailabilitySLOFastBurn (page) · APIAvailabilitySLOSlowBurn (ticket) Owner : backend on-call.

What tripped me

The 5xx ratio on read endpoints is consuming the monthly error budget faster than the steady-state rate allows :

  • Fast burn (page=true) : 14.4× over 1h ⇒ entire monthly budget gone in ~3.5 days.
  • Slow burn (page=false) : 6× over 6h ⇒ entire budget gone in ~7 days.

First moves (under 5 minutes)

  1. Confirm the alert is real, not a metric-pipeline glitch :

    # Live error rate on the GETs we measure :
    sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m]))
    /
    sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET"}[5m]))
    

    Open Grafana → "Veza API Overview" dashboard, panel "Request rate by path".

  2. Identify the affected endpoint. The fastest pivot :

    topk(5, sum by (path, status) (
      rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m])
    ))
    
  3. Drop into traces. Open the "Veza Service Map (Tempo)" dashboard and filter the slowest-spans table for the offending path. If the failures correlate with one downstream (Redis, Postgres, Hyperswitch), the trace will show it.

Common causes

Symptom Likely cause Fix
5xx concentrated on /feed, /library Postgres slow / connection pool exhausted See db-failover.md — check pg_auto_failover state
5xx concentrated on /search, /tracks Postgres FTS index churn or autovacuum holding row locks SELECT pid, query FROM pg_stat_activity WHERE state='active' ORDER BY xact_start LIMIT 5;
5xx across all paths, sudden Pod just restarted / migration broken / DB unreachable kubectl get pods -n veza or systemctl status veza-backend-api
5xx slowly climbing Memory leak; container approaching OOMKill kubectl top pod -n veza and bounce the leaking pod
5xx confined to one instance Single bad replica (config, certs, networking) Drain that instance from the load balancer

If the page is real

  1. Page the secondary on-call if the primary fix is going to take > 15 min.
  2. Update the status page (status.veza.fr) with "Investigating elevated error rates."
  3. Post in #incident-response with the alert link + first hypothesis.

When to silence

  • Confirmed degradation is a known maintenance window already announced : silence for the maintenance window's duration.
  • Single-instance issue, that instance has been drained : silence for 1h.
  • Otherwise, do not silence — let the alert keep firing until the burn rate drops below threshold naturally.

Recovery verification

After mitigation, both burn-rate windows must drop below threshold for the alert to clear (1h and 5m for fast burn, 6h and 30m for slow burn). The 6h window means the slow-burn alert can stay green for hours after the issue is fixed — don't be surprised.

Postmortem trigger

A page-grade alert that fires for > 15 minutes triggers a postmortem doc (docs/postmortems/YYYY-MM-DD-<slug>.md). Include the timeline, the trace IDs, and the metric query screenshots.