senke/veza

Fork 0

senke c78bf1b765

Veza CI / Rust (Stream Server) (push) Successful in 5m4s

Details

Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s

Details

Veza CI / Backend (Go) (push) Failing after 15m45s

Details

Veza CI / Frontend (Web) (push) Successful in 18m7s

Details

Veza CI / Notify on failure (push) Successful in 6s

Details

E2E Playwright / e2e (full) (push) Successful in 24m9s

Details

feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)

Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
  * SLO_API_AVAILABILITY  : 99.5% on read (GET) endpoints
  * SLO_API_LATENCY       : 99% writes p95 < 500ms
  * SLO_PAYMENT_SUCCESS   : 99.5% on POST /api/v1/orders -> 2xx

Each SLO has two alerts :
  * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
  * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)

- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
  check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
  + PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
  + db-failover, redis-down, disk-full, cert-expiring-soon : one stub
  per likely page. Each lists first moves under 5min + common causes.

Acceptance (Day 10) : promtool check rules vert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 01:30:34 +02:00

3.7 KiB

Raw Blame History

Runbook — API availability SLO burn

SLO : 99.5% of GET requests on /api/v1/* return non-5xx (monthly window). Alerts : APIAvailabilitySLOFastBurn (page) · APIAvailabilitySLOSlowBurn (ticket) Owner : backend on-call.

What tripped me

The 5xx ratio on read endpoints is consuming the monthly error budget faster than the steady-state rate allows :

Fast burn (page=true) : 14.4× over 1h ⇒ entire monthly budget gone in ~3.5 days.
Slow burn (page=false) : 6× over 6h ⇒ entire budget gone in ~7 days.

First moves (under 5 minutes)

Confirm the alert is real, not a metric-pipeline glitch :

# Live error rate on the GETs we measure :
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m]))
/
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET"}[5m]))

Open Grafana → "Veza API Overview" dashboard, panel "Request rate by path".

Identify the affected endpoint. The fastest pivot :

topk(5, sum by (path, status) (
  rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m])
))

Drop into traces. Open the "Veza Service Map (Tempo)" dashboard and filter the slowest-spans table for the offending path. If the failures correlate with one downstream (Redis, Postgres, Hyperswitch), the trace will show it.

Common causes

Symptom	Likely cause	Fix
5xx concentrated on `/feed`, `/library`	Postgres slow / connection pool exhausted	See `db-failover.md` — check `pg_auto_failover` state
5xx concentrated on `/search`, `/tracks`	Postgres FTS index churn or autovacuum holding row locks	`SELECT pid, query FROM pg_stat_activity WHERE state='active' ORDER BY xact_start LIMIT 5;`
5xx across all paths, sudden	Pod just restarted / migration broken / DB unreachable	`kubectl get pods -n veza` or `systemctl status veza-backend-api`
5xx slowly climbing	Memory leak; container approaching OOMKill	`kubectl top pod -n veza` and bounce the leaking pod
5xx confined to one instance	Single bad replica (config, certs, networking)	Drain that instance from the load balancer

If the page is real

Page the secondary on-call if the primary fix is going to take > 15 min.
Update the status page (status.veza.fr) with "Investigating elevated error rates."
Post in #incident-response with the alert link + first hypothesis.

When to silence

Confirmed degradation is a known maintenance window already announced : silence for the maintenance window's duration.
Single-instance issue, that instance has been drained : silence for 1h.
Otherwise, do not silence — let the alert keep firing until the burn rate drops below threshold naturally.

Recovery verification

After mitigation, both burn-rate windows must drop below threshold for the alert to clear (1h and 5m for fast burn, 6h and 30m for slow burn). The 6h window means the slow-burn alert can stay green for hours after the issue is fixed — don't be surprised.

Postmortem trigger

A page-grade alert that fires for > 15 minutes triggers a postmortem doc (docs/postmortems/YYYY-MM-DD-<slug>.md). Include the timeline, the trace IDs, and the metric query screenshots.

3.7 KiB Raw Blame History Unescape Escape

Runbook — API availability SLO burn

What tripped me

First moves (under 5 minutes)

Common causes

If the page is real

When to silence

Recovery verification

Postmortem trigger

3.7 KiB

Raw Blame History