veza/docs/runbooks/payment-success-slo-burn.md
senke c78bf1b765
Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m4s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s
Veza CI / Backend (Go) (push) Failing after 15m45s
Veza CI / Frontend (Web) (push) Successful in 18m7s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Successful in 24m9s
feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)
Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
  * SLO_API_AVAILABILITY  : 99.5% on read (GET) endpoints
  * SLO_API_LATENCY       : 99% writes p95 < 500ms
  * SLO_PAYMENT_SUCCESS   : 99.5% on POST /api/v1/orders -> 2xx

Each SLO has two alerts :
  * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
  * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)

- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
  check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
  + PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
  + db-failover, redis-down, disk-full, cert-expiring-soon : one stub
  per likely page. Each lists first moves under 5min + common causes.

Acceptance (Day 10) : promtool check rules vert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 01:30:34 +02:00

3.3 KiB

Runbook — Payment success SLO burn

SLO : 99.5% of POST /api/v1/orders return 2xx (monthly window). Alerts : PaymentSuccessSLOFastBurn (page) · PaymentSuccessSLOSlowBurn (ticket) Owner : payments on-call (rotates with backend on-call until v2.0).

Why this is critical

A failing checkout means money lost (charged customer, no license issued) or money taken twice (double-submitted on retry). Worst-case fraud window is the time it takes to roll the upstream change. Treat fast-burn here like a Sev-1 incident.

First moves (under 5 minutes)

  1. Confirm the alert :

    sum(rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*",status!~"2.."}[5m]))
    /
    sum(rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*"}[5m]))
    
  2. Pivot on status code :

    sum by (status) (rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*"}[5m]))
    
    • Spike in 502/503 → Hyperswitch unreachable. See "Hyperswitch outage" below.
    • Spike in 400 → marketplace validation failing. New deploy regressed something — check recent commits to internal/core/marketplace/.
    • Spike in 500 → DB / connection / panic. Check logs for stack traces.
  3. Trace pivot. "Veza Service Map (Tempo)" → filter payment.webhook for status=error recent spans.

Hyperswitch outage

If Hyperswitch is the upstream culprit :

# Check Hyperswitch's own status :
curl -fsS https://api.hyperswitch.io/health

# Check the last successful webhook landing :
psql "$DATABASE_URL" -c "
  SELECT id, hyperswitch_payment_id, status, payment_status, updated_at
  FROM orders
  WHERE updated_at > NOW() - INTERVAL '15 minutes'
  ORDER BY updated_at DESC LIMIT 10;
"

If they're all stuck in payment_status=pending, Hyperswitch is silently dropping our webhooks. Engage their support and queue a manual reconciliation pass once they're back :

# Manual reconciliation script (still TODO — tracked in W4 day 17) :
go run ./cmd/tools/reconcile_orders --since=15m

DB / pool exhaustion

If the failures are 500s and the API logs show pq: too many connections or context deadline exceeded :

  1. Check pgbouncer queue length :
    psql -h pgaf-pgbouncer.lxd -p 6432 -U pgbouncer pgbouncer -c "SHOW POOLS;"
    
  2. If cl_waiting > 0 consistently, a slow query is holding pool slots — see db-failover.md for finding it.
  3. Last resort : restart the backend pod to drop in-flight requests (loses idempotency on retried requests; only do this if Hyperswitch is not in flight on those orders).

Recovery verification

After fix :

# Most recent 10 orders should be `completed` or `pending` (not `failed`) :
psql "$DATABASE_URL" -c "
  SELECT status, COUNT(*) FROM orders
  WHERE created_at > NOW() - INTERVAL '5 minutes'
  GROUP BY status;
"

The slow-burn window (6h) takes hours to clear after recovery. Don't silence — wait for the metric.

Reconciliation post-incident

Every fast-burn incident requires a reconciliation pass within 24h :

  1. Pull the list of orders with payment_status='pending' older than 30 minutes.
  2. For each, query Hyperswitch directly via GET /payments/{payment_id} and update.
  3. File a postmortem with the count of mismatches resolved.