veza/config at d9753f85a686b6e13f82ef77346055e28e123f35 - senke/veza

senke/veza

History

senke bc19a8dd40 Some checks failed Veza CI / Backend (Go) (push) Failing after 0s Details Veza CI / Frontend (Web) (push) Failing after 0s Details Veza CI / Rust (Stream Server) (push) Failing after 0s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 0s Details Veza CI / Notify on failure (push) Failing after 0s Details feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F Five Prometheus gauges + reconciler metrics + Grafana dashboard + three alert rules. Closes axis-1 P1.8 and adds observability for item C's reconciler (user review: "F should include reconciler_* metrics, otherwise tag is blind on the worker we just shipped"). Gauges (veza_ledger_, sampled every 60s): * orphan_refund_rows — THE canary. Pending refunds with empty hyperswitch_refund_id older than 5m = Phase 2 crash in RefundOrder. Alert: > 0 for 5m → page. * stuck_orders_pending — order pending > 30m with non-empty payment_id. Alert: > 0 for 10m → page. * stuck_refunds_pending — refund pending > 30m with hs_id. * failed_transfers_at_max_retry — permanently_failed rows. * reversal_pending_transfers — item B rows stuck > 30m. Reconciler metrics (veza_reconciler_): * actions_total{phase} — counter by phase. * orphan_refunds_total — two-phase-bug canary. * sweep_duration_seconds — exponential histogram. * last_run_timestamp — alert: stale > 2h → page (worker dead). Implementation notes: * Sampler thresholds hardcoded to match reconciler defaults — intentional mismatch allowed (alerts fire while reconciler already working = correct behavior). * Query error sets gauge to -1 (sentinel for "sampler broken"). * marketplace package routes through monitoring recorders so it doesn't import prometheus directly. * Sampler runs regardless of Hyperswitch enablement; gauges default 0 when pipeline idle. * Graceful shutdown wired in cmd/api/main.go. Alert rules in config/alertmanager/ledger.yml with runbook pointers + detailed descriptions — each alert explains WHAT happened, WHY the reconciler may not resolve it, and WHERE to look first. Grafana dashboard config/grafana/dashboards/ledger-health.json — top row = 5 stat panels (orphan first, color-coded red on > 0), middle row = trend timeseries + reconciler action rate by phase, bottom row = sweep duration p50/p95/p99 + seconds-since-last-tick + orphan cumulative. Tests — 6 cases, all green (sqlite :memory:): * CountsStuckOrdersPending (includes the filter on non-empty payment_id) * StuckOrdersZeroWhenAllCompleted * CountsOrphanRefunds (THE canary) * CountsStuckRefundsWithHsID (gauge-orthogonality check) * CountsFailedAndReversalPendingTransfers * ReconcilerRecorders (counter + gauge shape) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-18 03:40:14 +02:00
..
alertmanager	feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F	2026-04-18 03:40:14 +02:00
baremetal/apache	state-ownership: delete unused optimisticStoreUpdates.ts file	2026-01-15 19:26:53 +01:00
caddy	chore(cleanup): remove veza-chat-server directory and all operational references	2026-02-22 21:13:00 +01:00
docker	chore(infra): J6 — mark 3 dormant docker-compose files as deprecated	2026-04-15 12:58:39 +02:00
grafana	feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F	2026-04-18 03:40:14 +02:00
haproxy	feat(infra): blue-green deployment via HAProxy	2026-02-23 19:52:19 +01:00
incus	chore(cleanup): remove veza-chat-server directory and all operational references	2026-02-22 21:13:00 +01:00
prometheus	chore(release): v0.952 — Observe (Grafana v1-overview, Prometheus alert_rules_v1)	2026-03-02 19:08:55 +01:00
ssl	fix(infra): HAProxy HTTPS and stats security	2026-02-15 15:58:51 +01:00
env.example	v0.9.5	2026-03-06 10:02:53 +01:00
logging.toml	docs: add project documentation, logging config, status script	2026-03-18 11:36:36 +01:00
metrics.yaml	BASE: completing the initial repo state	2025-12-03 22:56:50 +01:00
prometheus.yml	feat(monitoring): add Alertmanager with Slack notifications	2026-02-23 19:54:55 +01:00