veza/config/alertmanager at dfc61e84084700a4744ddf021f29a7de7a5aa892 - senke/veza

senke/veza

History

senke c78bf1b765 Some checks failed Veza CI / Rust (Stream Server) (push) Successful in 5m4s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s Details Veza CI / Backend (Go) (push) Failing after 15m45s Details Veza CI / Frontend (Web) (push) Successful in 18m7s Details Veza CI / Notify on failure (push) Successful in 6s Details E2E Playwright / e2e (full) (push) Successful in 24m9s Details feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10) Three SLOs with multi-window burn-rate alerts (Google SRE workbook methodology) : * SLO_API_AVAILABILITY : 99.5% on read (GET) endpoints * SLO_API_LATENCY : 99% writes p95 < 500ms * SLO_PAYMENT_SUCCESS : 99.5% on POST /api/v1/orders -> 2xx Each SLO has two alerts : * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows) * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m) - config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool check rules => SUCCESS: 18 rules found. - config/alertmanager/routes.yml : routing tree splits page-oncall (slack + PagerDuty) from ticket-oncall (slack only). - docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md + db-failover, redis-down, disk-full, cert-expiring-soon : one stub per likely page. Each lists first moves under 5min + common causes. Acceptance (Day 10) : promtool check rules vert. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-28 01:30:34 +02:00
..
alertmanager.yml	feat(monitoring): add Alertmanager with Slack notifications	2026-02-23 19:54:55 +01:00
ledger.yml	feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F	2026-04-18 03:40:14 +02:00
routes.yml	feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)	2026-04-28 01:30:34 +02:00