Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
* SLO_API_AVAILABILITY : 99.5% on read (GET) endpoints
* SLO_API_LATENCY : 99% writes p95 < 500ms
* SLO_PAYMENT_SUCCESS : 99.5% on POST /api/v1/orders -> 2xx
Each SLO has two alerts :
* <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
* <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)
- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
+ PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
+ db-failover, redis-down, disk-full, cert-expiring-soon : one stub
per likely page. Each lists first moves under 5min + common causes.
Acceptance (Day 10) : promtool check rules vert.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.3 KiB
Runbook — Payment success SLO burn
SLO : 99.5% of
POST /api/v1/ordersreturn 2xx (monthly window). Alerts :PaymentSuccessSLOFastBurn(page) ·PaymentSuccessSLOSlowBurn(ticket) Owner : payments on-call (rotates with backend on-call until v2.0).
Why this is critical
A failing checkout means money lost (charged customer, no license issued) or money taken twice (double-submitted on retry). Worst-case fraud window is the time it takes to roll the upstream change. Treat fast-burn here like a Sev-1 incident.
First moves (under 5 minutes)
-
Confirm the alert :
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*",status!~"2.."}[5m])) / sum(rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*"}[5m])) -
Pivot on status code :
sum by (status) (rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*"}[5m]))- Spike in 502/503 → Hyperswitch unreachable. See "Hyperswitch outage" below.
- Spike in 400 → marketplace validation failing. New deploy regressed something — check recent commits to
internal/core/marketplace/. - Spike in 500 → DB / connection / panic. Check logs for stack traces.
-
Trace pivot. "Veza Service Map (Tempo)" → filter
payment.webhookforstatus=errorrecent spans.
Hyperswitch outage
If Hyperswitch is the upstream culprit :
# Check Hyperswitch's own status :
curl -fsS https://api.hyperswitch.io/health
# Check the last successful webhook landing :
psql "$DATABASE_URL" -c "
SELECT id, hyperswitch_payment_id, status, payment_status, updated_at
FROM orders
WHERE updated_at > NOW() - INTERVAL '15 minutes'
ORDER BY updated_at DESC LIMIT 10;
"
If they're all stuck in payment_status=pending, Hyperswitch is silently dropping our webhooks. Engage their support and queue a manual reconciliation pass once they're back :
# Manual reconciliation script (still TODO — tracked in W4 day 17) :
go run ./cmd/tools/reconcile_orders --since=15m
DB / pool exhaustion
If the failures are 500s and the API logs show pq: too many connections or context deadline exceeded :
- Check pgbouncer queue length :
psql -h pgaf-pgbouncer.lxd -p 6432 -U pgbouncer pgbouncer -c "SHOW POOLS;" - If
cl_waiting > 0consistently, a slow query is holding pool slots — seedb-failover.mdfor finding it. - Last resort : restart the backend pod to drop in-flight requests (loses idempotency on retried requests; only do this if Hyperswitch is not in flight on those orders).
Recovery verification
After fix :
# Most recent 10 orders should be `completed` or `pending` (not `failed`) :
psql "$DATABASE_URL" -c "
SELECT status, COUNT(*) FROM orders
WHERE created_at > NOW() - INTERVAL '5 minutes'
GROUP BY status;
"
The slow-burn window (6h) takes hours to clear after recovery. Don't silence — wait for the metric.
Reconciliation post-incident
Every fast-burn incident requires a reconciliation pass within 24h :
- Pull the list of
orderswithpayment_status='pending'older than 30 minutes. - For each, query Hyperswitch directly via
GET /payments/{payment_id}and update. - File a postmortem with the count of mismatches resolved.