senke/veza

Fork 0

senke c78bf1b765

Veza CI / Rust (Stream Server) (push) Successful in 5m4s

Details

Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s

Details

Veza CI / Backend (Go) (push) Failing after 15m45s

Details

Veza CI / Frontend (Web) (push) Successful in 18m7s

Details

Veza CI / Notify on failure (push) Successful in 6s

Details

E2E Playwright / e2e (full) (push) Successful in 24m9s

Details

feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)

Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
  * SLO_API_AVAILABILITY  : 99.5% on read (GET) endpoints
  * SLO_API_LATENCY       : 99% writes p95 < 500ms
  * SLO_PAYMENT_SUCCESS   : 99.5% on POST /api/v1/orders -> 2xx

Each SLO has two alerts :
  * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
  * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)

- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
  check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
  + PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
  + db-failover, redis-down, disk-full, cert-expiring-soon : one stub
  per likely page. Each lists first moves under 5min + common causes.

Acceptance (Day 10) : promtool check rules vert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 01:30:34 +02:00

2.3 KiB

Raw Blame History

Runbook — API latency SLO burn

SLO : 99% of write requests (POST/PUT/PATCH/DELETE) return in < 500ms p95 (monthly window). Alerts : APILatencySLOFastBurn (page) · APILatencySLOSlowBurn (ticket) Owner : backend on-call.

What tripped me

Writes are taking longer than 500ms p95. The fast burn fires when > 14.4% of writes are slow over 1h.

First moves (under 5 minutes)

Identify the slow endpoints :

topk(5, histogram_quantile(0.95,
  sum by (path, le) (rate(veza_gin_http_request_duration_seconds_bucket{job="veza-backend",method=~"POST|PUT|PATCH|DELETE"}[5m]))
))

Open Tempo service-map dashboard ("Veza Service Map (Tempo)") and check the slow-spans table for the same paths.

Common causes

Symptom	Likely cause	Pointer
Slow on `/api/v1/orders` (POST)	Hyperswitch upstream latency	`payment-success-slo-burn.md`
Slow on `/api/v1/tracks` (POST)	S3 multipart pre-sign / commit latency	Check MinIO health
Slow across all writes	Postgres lock contention / autovacuum	`db-failover.md` §autovacuum
Slow only on one host	One bad replica (CPU starvation, disk)	Drain & investigate
Slow + DB pool exhausted in logs	A slow query holding the pool	`db-failover.md` §pool

Mitigation

If Hyperswitch : nothing to do but wait + status-page banner.

If DB lock contention : pg_blocking_pids() + cancel the offender :

SELECT pg_cancel_backend(pid) FROM pg_stat_activity
WHERE state = 'active' AND xact_start < now() - INTERVAL '30 seconds';

If a single bad replica : drain it from the LB and investigate offline.

Recovery

The slow-burn alert can take 6h to clear after a fix. Don't silence — let it ride down.

Postmortem trigger

Same threshold as the availability runbook — fast burn > 15 min triggers a postmortem.

2.3 KiB Raw Blame History

Runbook — API latency SLO burn

What tripped me

First moves (under 5 minutes)

Common causes

Mitigation

Recovery

Postmortem trigger

2.3 KiB

Raw Blame History