veza/config/grafana/dashboards/ledger-health.json
senke 94dfc80b73 feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F
Five Prometheus gauges + reconciler metrics + Grafana dashboard +
three alert rules. Closes axis-1 P1.8 and adds observability for
item C's reconciler (user review: "F should include reconciler_*
metrics, otherwise tag is blind on the worker we just shipped").

Gauges (veza_ledger_, sampled every 60s):
  * orphan_refund_rows — THE canary. Pending refunds with empty
    hyperswitch_refund_id older than 5m = Phase 2 crash in
    RefundOrder. Alert: > 0 for 5m → page.
  * stuck_orders_pending — order pending > 30m with non-empty
    payment_id. Alert: > 0 for 10m → page.
  * stuck_refunds_pending — refund pending > 30m with hs_id.
  * failed_transfers_at_max_retry — permanently_failed rows.
  * reversal_pending_transfers — item B rows stuck > 30m.

Reconciler metrics (veza_reconciler_):
  * actions_total{phase} — counter by phase.
  * orphan_refunds_total — two-phase-bug canary.
  * sweep_duration_seconds — exponential histogram.
  * last_run_timestamp — alert: stale > 2h → page (worker dead).

Implementation notes:
  * Sampler thresholds hardcoded to match reconciler defaults —
    intentional mismatch allowed (alerts fire while reconciler
    already working = correct behavior).
  * Query error sets gauge to -1 (sentinel for "sampler broken").
  * marketplace package routes through monitoring recorders so it
    doesn't import prometheus directly.
  * Sampler runs regardless of Hyperswitch enablement; gauges
    default 0 when pipeline idle.
  * Graceful shutdown wired in cmd/api/main.go.

Alert rules in config/alertmanager/ledger.yml with runbook
pointers + detailed descriptions — each alert explains WHAT
happened, WHY the reconciler may not resolve it, and WHERE to
look first.

Grafana dashboard config/grafana/dashboards/ledger-health.json —
top row = 5 stat panels (orphan first, color-coded red on > 0),
middle row = trend timeseries + reconciler action rate by phase,
bottom row = sweep duration p50/p95/p99 + seconds-since-last-tick
+ orphan cumulative.

Tests — 6 cases, all green (sqlite :memory:):
  * CountsStuckOrdersPending (includes the filter on
    non-empty payment_id)
  * StuckOrdersZeroWhenAllCompleted
  * CountsOrphanRefunds (THE canary)
  * CountsStuckRefundsWithHsID (gauge-orthogonality check)
  * CountsFailedAndReversalPendingTransfers
  * ReconcilerRecorders (counter + gauge shape)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 03:40:14 +02:00

136 lines
5.1 KiB
JSON

{
"title": "VEZA Ledger Health (v1.0.7)",
"description": "Five stuck-state gauges + reconciler action rate. The top row tells you 'is money stuck right now?', the bottom row tells you 'is the reconciler keeping up?'. Paired with alert rules in config/alertmanager/ledger.yml — orphan_refund_rows > 0 for 5m pages ops.",
"tags": ["veza", "ledger", "money-movement", "v1.0.7"],
"timezone": "browser",
"refresh": "1m",
"schemaVersion": 39,
"version": 1,
"panels": [
{
"id": 1,
"title": "Orphan refund rows (PAGE if > 0 for 5m)",
"type": "stat",
"gridPos": {"x": 0, "y": 0, "w": 5, "h": 5},
"targets": [{"expr": "veza_ledger_orphan_refund_rows", "refId": "A"}],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "red", "value": 1}
]
}
}
}
},
{
"id": 2,
"title": "Stuck orders (pending > 30m)",
"type": "stat",
"gridPos": {"x": 5, "y": 0, "w": 5, "h": 5},
"targets": [{"expr": "veza_ledger_stuck_orders_pending", "refId": "A"}],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1},
{"color": "red", "value": 5}
]
}
}
}
},
{
"id": 3,
"title": "Stuck refunds (pending w/ PSP id > 30m)",
"type": "stat",
"gridPos": {"x": 10, "y": 0, "w": 5, "h": 5},
"targets": [{"expr": "veza_ledger_stuck_refunds_pending", "refId": "A"}]
},
{
"id": 4,
"title": "Reversal pending transfers (> 30m)",
"type": "stat",
"gridPos": {"x": 15, "y": 0, "w": 5, "h": 5},
"targets": [{"expr": "veza_ledger_reversal_pending_transfers", "refId": "A"}]
},
{
"id": 5,
"title": "Permanently-failed transfers",
"type": "stat",
"gridPos": {"x": 20, "y": 0, "w": 4, "h": 5},
"targets": [{"expr": "veza_ledger_failed_transfers_at_max_retry", "refId": "A"}]
},
{
"id": 6,
"title": "Stuck-state trends (last 6h)",
"type": "timeseries",
"gridPos": {"x": 0, "y": 5, "w": 12, "h": 8},
"targets": [
{"expr": "veza_ledger_stuck_orders_pending", "refId": "A", "legendFormat": "stuck orders"},
{"expr": "veza_ledger_stuck_refunds_pending", "refId": "B", "legendFormat": "stuck refunds"},
{"expr": "veza_ledger_orphan_refund_rows", "refId": "C", "legendFormat": "orphan refunds"},
{"expr": "veza_ledger_reversal_pending_transfers", "refId": "D", "legendFormat": "reversal pending"},
{"expr": "veza_ledger_failed_transfers_at_max_retry", "refId": "E", "legendFormat": "permanently failed transfers"}
]
},
{
"id": 7,
"title": "Reconciler actions (rate, by phase)",
"type": "timeseries",
"gridPos": {"x": 12, "y": 5, "w": 12, "h": 8},
"description": "Actions the reconciler took per phase. A healthy system shows occasional stuck_orders + stuck_refunds spikes (PSP hiccups) and near-zero orphan_refunds (Phase 2 crashes should be rare).",
"targets": [
{"expr": "rate(veza_reconciler_actions_total[5m])", "refId": "A", "legendFormat": "{{phase}}"}
]
},
{
"id": 8,
"title": "Reconciler sweep duration",
"type": "timeseries",
"gridPos": {"x": 0, "y": 13, "w": 8, "h": 6},
"targets": [
{"expr": "histogram_quantile(0.5, rate(veza_reconciler_sweep_duration_seconds_bucket[5m]))", "refId": "A", "legendFormat": "p50"},
{"expr": "histogram_quantile(0.95, rate(veza_reconciler_sweep_duration_seconds_bucket[5m]))", "refId": "B", "legendFormat": "p95"},
{"expr": "histogram_quantile(0.99, rate(veza_reconciler_sweep_duration_seconds_bucket[5m]))", "refId": "C", "legendFormat": "p99"}
]
},
{
"id": 9,
"title": "Seconds since last reconciler tick",
"type": "stat",
"gridPos": {"x": 8, "y": 13, "w": 8, "h": 6},
"description": "Alerts fire if this exceeds 2h (2 * RECONCILE_INTERVAL default). Sustained growth = worker is dead.",
"targets": [
{"expr": "time() - veza_reconciler_last_run_timestamp", "refId": "A"}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 3600},
{"color": "red", "value": 7200}
]
}
}
}
},
{
"id": 10,
"title": "Orphan refunds auto-failed (cumulative)",
"type": "timeseries",
"gridPos": {"x": 16, "y": 13, "w": 8, "h": 6},
"description": "Each increment = a Phase 2 crash the reconciler caught. Non-zero rate = investigate root cause.",
"targets": [
{"expr": "veza_reconciler_orphan_refunds_total", "refId": "A", "legendFormat": "total"}
]
}
]
}