# Runbook — API latency SLO burn > **SLO** : 99% of write requests (POST/PUT/PATCH/DELETE) return in < 500ms p95 (monthly window). > **Alerts** : `APILatencySLOFastBurn` (page) · `APILatencySLOSlowBurn` (ticket) > **Owner** : backend on-call. ## What tripped me Writes are taking longer than 500ms p95. The fast burn fires when > 14.4% of writes are slow over 1h. ## First moves (under 5 minutes) 1. **Identify the slow endpoints** : ```promql topk(5, histogram_quantile(0.95, sum by (path, le) (rate(veza_gin_http_request_duration_seconds_bucket{job="veza-backend",method=~"POST|PUT|PATCH|DELETE"}[5m])) )) ``` 2. **Open Tempo service-map dashboard** ("Veza Service Map (Tempo)") and check the slow-spans table for the same paths. ## Common causes | Symptom | Likely cause | Pointer | | ------------------------------------------------ | -------------------------------------------------- | ----------------------------- | | Slow on `/api/v1/orders` (POST) | Hyperswitch upstream latency | `payment-success-slo-burn.md` | | Slow on `/api/v1/tracks` (POST) | S3 multipart pre-sign / commit latency | Check MinIO health | | Slow across all writes | Postgres lock contention / autovacuum | `db-failover.md` §autovacuum | | Slow only on one host | One bad replica (CPU starvation, disk) | Drain & investigate | | Slow + DB pool exhausted in logs | A slow query holding the pool | `db-failover.md` §pool | ## Mitigation - If Hyperswitch : nothing to do but wait + status-page banner. - If DB lock contention : `pg_blocking_pids()` + cancel the offender : ```sql SELECT pg_cancel_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND xact_start < now() - INTERVAL '30 seconds'; ``` - If a single bad replica : drain it from the LB and investigate offline. ## Recovery The slow-burn alert can take 6h to clear after a fix. Don't silence — let it ride down. ## Postmortem trigger Same threshold as the availability runbook — fast burn > 15 min triggers a postmortem.