Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m4s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s
Veza CI / Backend (Go) (push) Failing after 15m45s
Veza CI / Frontend (Web) (push) Successful in 18m7s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Successful in 24m9s
Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
* SLO_API_AVAILABILITY : 99.5% on read (GET) endpoints
* SLO_API_LATENCY : 99% writes p95 < 500ms
* SLO_PAYMENT_SUCCESS : 99.5% on POST /api/v1/orders -> 2xx
Each SLO has two alerts :
* <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
* <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)
- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
+ PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
+ db-failover, redis-down, disk-full, cert-expiring-soon : one stub
per likely page. Each lists first moves under 5min + common causes.
Acceptance (Day 10) : promtool check rules vert.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
47 lines
2.3 KiB
Markdown
47 lines
2.3 KiB
Markdown
# Runbook — API latency SLO burn
|
|
|
|
> **SLO** : 99% of write requests (POST/PUT/PATCH/DELETE) return in < 500ms p95 (monthly window).
|
|
> **Alerts** : `APILatencySLOFastBurn` (page) · `APILatencySLOSlowBurn` (ticket)
|
|
> **Owner** : backend on-call.
|
|
|
|
## What tripped me
|
|
|
|
Writes are taking longer than 500ms p95. The fast burn fires when > 14.4% of writes are slow over 1h.
|
|
|
|
## First moves (under 5 minutes)
|
|
|
|
1. **Identify the slow endpoints** :
|
|
```promql
|
|
topk(5, histogram_quantile(0.95,
|
|
sum by (path, le) (rate(veza_gin_http_request_duration_seconds_bucket{job="veza-backend",method=~"POST|PUT|PATCH|DELETE"}[5m]))
|
|
))
|
|
```
|
|
2. **Open Tempo service-map dashboard** ("Veza Service Map (Tempo)") and check the slow-spans table for the same paths.
|
|
|
|
## Common causes
|
|
|
|
| Symptom | Likely cause | Pointer |
|
|
| ------------------------------------------------ | -------------------------------------------------- | ----------------------------- |
|
|
| Slow on `/api/v1/orders` (POST) | Hyperswitch upstream latency | `payment-success-slo-burn.md` |
|
|
| Slow on `/api/v1/tracks` (POST) | S3 multipart pre-sign / commit latency | Check MinIO health |
|
|
| Slow across all writes | Postgres lock contention / autovacuum | `db-failover.md` §autovacuum |
|
|
| Slow only on one host | One bad replica (CPU starvation, disk) | Drain & investigate |
|
|
| Slow + DB pool exhausted in logs | A slow query holding the pool | `db-failover.md` §pool |
|
|
|
|
## Mitigation
|
|
|
|
- If Hyperswitch : nothing to do but wait + status-page banner.
|
|
- If DB lock contention : `pg_blocking_pids()` + cancel the offender :
|
|
```sql
|
|
SELECT pg_cancel_backend(pid) FROM pg_stat_activity
|
|
WHERE state = 'active' AND xact_start < now() - INTERVAL '30 seconds';
|
|
```
|
|
- If a single bad replica : drain it from the LB and investigate offline.
|
|
|
|
## Recovery
|
|
|
|
The slow-burn alert can take 6h to clear after a fix. Don't silence — let it ride down.
|
|
|
|
## Postmortem trigger
|
|
|
|
Same threshold as the availability runbook — fast burn > 15 min triggers a postmortem.
|