veza/docs/runbooks/api-latency-slo-burn.md

# Runbook — API latency SLO burn

> **SLO** : 99% of write requests (POST/PUT/PATCH/DELETE) return in < 500ms p95 (monthly window).
> **Alerts** : `APILatencySLOFastBurn` (page) · `APILatencySLOSlowBurn` (ticket)
> **Owner** : backend on-call.

## What tripped me

Writes are taking longer than 500ms p95. The fast burn fires when > 14.4% of writes are slow over 1h.

## First moves (under 5 minutes)

1. **Identify the slow endpoints** :
   ```promql
   topk(5, histogram_quantile(0.95,
     sum by (path, le) (rate(veza_gin_http_request_duration_seconds_bucket{job="veza-backend",method=~"POST|PUT|PATCH|DELETE"}[5m]))
   ))
   ```
2. **Open Tempo service-map dashboard** ("Veza Service Map (Tempo)") and check the slow-spans table for the same paths.

## Common causes

| Symptom                                          | Likely cause                                       | Pointer                       |
| ------------------------------------------------ | -------------------------------------------------- | ----------------------------- |
| Slow on `/api/v1/orders` (POST)                  | Hyperswitch upstream latency                       | `payment-success-slo-burn.md` |
| Slow on `/api/v1/tracks` (POST)                  | S3 multipart pre-sign / commit latency             | Check MinIO health            |
| Slow across all writes                           | Postgres lock contention / autovacuum              | `db-failover.md` §autovacuum  |
| Slow only on one host                            | One bad replica (CPU starvation, disk)             | Drain & investigate           |
| Slow + DB pool exhausted in logs                 | A slow query holding the pool                      | `db-failover.md` §pool        |

## Mitigation

- If Hyperswitch : nothing to do but wait + status-page banner.
- If DB lock contention : `pg_blocking_pids()` + cancel the offender :
  ```sql
  SELECT pg_cancel_backend(pid) FROM pg_stat_activity
  WHERE state = 'active' AND xact_start < now() - INTERVAL '30 seconds';
  ```
- If a single bad replica : drain it from the LB and investigate offline.

## Recovery

The slow-burn alert can take 6h to clear after a fix. Don't silence — let it ride down.

## Postmortem trigger

Same threshold as the availability runbook — fast burn > 15 min triggers a postmortem.