veza/docs/runbooks/api-availability-slo-burn.md

# Runbook — API availability SLO burn

> **SLO** : 99.5% of GET requests on `/api/v1/*` return non-5xx (monthly window).
> **Alerts** : `APIAvailabilitySLOFastBurn` (page) · `APIAvailabilitySLOSlowBurn` (ticket)
> **Owner** : backend on-call.

## What tripped me

The 5xx ratio on read endpoints is consuming the monthly error budget faster than the steady-state rate allows :

- **Fast burn** (`page=true`) : 14.4× over 1h ⇒ entire monthly budget gone in ~3.5 days.
- **Slow burn** (`page=false`) : 6× over 6h ⇒ entire budget gone in ~7 days.

## First moves (under 5 minutes)

1. **Confirm the alert is real**, not a metric-pipeline glitch :
   ```bash
   # Live error rate on the GETs we measure :
   sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m]))
   /
   sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET"}[5m]))
   ```
   Open Grafana → "Veza API Overview" dashboard, panel "Request rate by path".

2. **Identify the affected endpoint**. The fastest pivot :
   ```promql
   topk(5, sum by (path, status) (
     rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m])
   ))
   ```

3. **Drop into traces**. Open the "Veza Service Map (Tempo)" dashboard and filter the slowest-spans table for the offending path. If the failures correlate with one downstream (Redis, Postgres, Hyperswitch), the trace will show it.

## Common causes

| Symptom                                      | Likely cause                                                | Fix                                                  |
| -------------------------------------------- | ----------------------------------------------------------- | ---------------------------------------------------- |
| 5xx concentrated on `/feed`, `/library`      | Postgres slow / connection pool exhausted                   | See `db-failover.md` — check `pg_auto_failover` state |
| 5xx concentrated on `/search`, `/tracks`     | Postgres FTS index churn or autovacuum holding row locks   | `SELECT pid, query FROM pg_stat_activity WHERE state='active' ORDER BY xact_start LIMIT 5;` |
| 5xx across all paths, sudden                 | Pod just restarted / migration broken / DB unreachable     | `kubectl get pods -n veza` or `systemctl status veza-backend-api` |
| 5xx slowly climbing                          | Memory leak; container approaching OOMKill                  | `kubectl top pod -n veza` and bounce the leaking pod |
| 5xx confined to one instance                 | Single bad replica (config, certs, networking)              | Drain that instance from the load balancer          |

## If the page is real

1. **Page the secondary on-call** if the primary fix is going to take > 15 min.
2. **Update the status page** (`status.veza.fr`) with "Investigating elevated error rates."
3. **Post in #incident-response** with the alert link + first hypothesis.

## When to silence

- Confirmed degradation is a known maintenance window already announced : silence for the maintenance window's duration.
- Single-instance issue, that instance has been drained : silence for 1h.
- Otherwise, **do not silence** — let the alert keep firing until the burn rate drops below threshold naturally.

## Recovery verification

After mitigation, both burn-rate windows must drop below threshold for the alert to clear (1h and 5m for fast burn, 6h and 30m for slow burn). The 6h window means the slow-burn alert can stay green for hours after the issue is fixed — don't be surprised.

## Postmortem trigger

A page-grade alert that fires for > 15 minutes triggers a postmortem doc (`docs/postmortems/YYYY-MM-DD-<slug>.md`). Include the timeline, the trace IDs, and the metric query screenshots.
-												feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)

Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
  * SLO_API_AVAILABILITY  : 99.5% on read (GET) endpoints
  * SLO_API_LATENCY       : 99% writes p95 < 500ms
  * SLO_PAYMENT_SUCCESS   : 99.5% on POST /api/v1/orders -> 2xx

Each SLO has two alerts :
  * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
  * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)

- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
  check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
  + PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
  + db-failover, redis-down, disk-full, cert-expiring-soon : one stub
  per likely page. Each lists first moves under 5min + common causes.

Acceptance (Day 10) : promtool check rules vert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-27 23:30:34 +00:00
+								# Runbook — API availability SLO burn
 								> **SLO** : 99.5% of GET requests on `/api/v1/*` return non-5xx (monthly window).
 								> **Alerts** : `APIAvailabilitySLOFastBurn` (page) · `APIAvailabilitySLOSlowBurn` (ticket)
 								> **Owner** : backend on-call.
 								## What tripped me
 								The 5xx ratio on read endpoints is consuming the monthly error budget faster than the steady-state rate allows :
 								- **Fast burn** (`page=true`) : 14.4× over 1h ⇒ entire monthly budget gone in ~3.5 days.
 								- **Slow burn** (`page=false`) : 6× over 6h ⇒ entire budget gone in ~7 days.
 								## First moves (under 5 minutes)
 . **Confirm the alert is real**, not a metric-pipeline glitch :
 								   ```bash
 								   # Live error rate on the GETs we measure :
 								   sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m]))
 								   /
 								   sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET"}[5m]))
 								   ```
 								   Open Grafana → "Veza API Overview" dashboard, panel "Request rate by path".
 . **Identify the affected endpoint**. The fastest pivot :
 								   ```promql
 								   topk(5, sum by (path, status) (
 								     rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m])
 								   ))
 								   ```
 . **Drop into traces**. Open the "Veza Service Map (Tempo)" dashboard and filter the slowest-spans table for the offending path. If the failures correlate with one downstream (Redis, Postgres, Hyperswitch), the trace will show it.
 								## Common causes
 								| Symptom                                      | Likely cause                                                | Fix                                                  |
 								| -------------------------------------------- | ----------------------------------------------------------- | ---------------------------------------------------- |
 								| 5xx concentrated on `/feed`, `/library`      | Postgres slow / connection pool exhausted                   | See `db-failover.md` — check `pg_auto_failover` state |
 								| 5xx concentrated on `/search`, `/tracks`     | Postgres FTS index churn or autovacuum holding row locks   | `SELECT pid, query FROM pg_stat_activity WHERE state='active' ORDER BY xact_start LIMIT 5;` |
 								| 5xx across all paths, sudden                 | Pod just restarted / migration broken / DB unreachable     | `kubectl get pods -n veza` or `systemctl status veza-backend-api` |
 								| 5xx slowly climbing                          | Memory leak; container approaching OOMKill                  | `kubectl top pod -n veza` and bounce the leaking pod |
 								| 5xx confined to one instance                 | Single bad replica (config, certs, networking)              | Drain that instance from the load balancer          |
 								## If the page is real
 . **Page the secondary on-call** if the primary fix is going to take > 15 min.
 . **Update the status page** (`status.veza.fr`) with "Investigating elevated error rates."
 . **Post in #incident-response** with the alert link + first hypothesis.
 								## When to silence
 								- Confirmed degradation is a known maintenance window already announced : silence for the maintenance window's duration.
 								- Single-instance issue, that instance has been drained : silence for 1h.
 								- Otherwise, **do not silence** — let the alert keep firing until the burn rate drops below threshold naturally.
 								## Recovery verification
 								After mitigation, both burn-rate windows must drop below threshold for the alert to clear (1h and 5m for fast burn, 6h and 30m for slow burn). The 6h window means the slow-burn alert can stay green for hours after the issue is fixed — don't be surprised.
 								## Postmortem trigger
 								A page-grade alert that fires for > 15 minutes triggers a postmortem doc (`docs/postmortems/YYYY-MM-DD-<slug>.md`). Include the timeline, the trace IDs, and the metric query screenshots.