Five Prometheus gauges + reconciler metrics + Grafana dashboard +
three alert rules. Closes axis-1 P1.8 and adds observability for
item C's reconciler (user review: "F should include reconciler_*
metrics, otherwise tag is blind on the worker we just shipped").
Gauges (veza_ledger_, sampled every 60s):
* orphan_refund_rows — THE canary. Pending refunds with empty
hyperswitch_refund_id older than 5m = Phase 2 crash in
RefundOrder. Alert: > 0 for 5m → page.
* stuck_orders_pending — order pending > 30m with non-empty
payment_id. Alert: > 0 for 10m → page.
* stuck_refunds_pending — refund pending > 30m with hs_id.
* failed_transfers_at_max_retry — permanently_failed rows.
* reversal_pending_transfers — item B rows stuck > 30m.
Reconciler metrics (veza_reconciler_):
* actions_total{phase} — counter by phase.
* orphan_refunds_total — two-phase-bug canary.
* sweep_duration_seconds — exponential histogram.
* last_run_timestamp — alert: stale > 2h → page (worker dead).
Implementation notes:
* Sampler thresholds hardcoded to match reconciler defaults —
intentional mismatch allowed (alerts fire while reconciler
already working = correct behavior).
* Query error sets gauge to -1 (sentinel for "sampler broken").
* marketplace package routes through monitoring recorders so it
doesn't import prometheus directly.
* Sampler runs regardless of Hyperswitch enablement; gauges
default 0 when pipeline idle.
* Graceful shutdown wired in cmd/api/main.go.
Alert rules in config/alertmanager/ledger.yml with runbook
pointers + detailed descriptions — each alert explains WHAT
happened, WHY the reconciler may not resolve it, and WHERE to
look first.
Grafana dashboard config/grafana/dashboards/ledger-health.json —
top row = 5 stat panels (orphan first, color-coded red on > 0),
middle row = trend timeseries + reconciler action rate by phase,
bottom row = sweep duration p50/p95/p99 + seconds-since-last-tick
+ orphan cumulative.
Tests — 6 cases, all green (sqlite :memory:):
* CountsStuckOrdersPending (includes the filter on
non-empty payment_id)
* StuckOrdersZeroWhenAllCompleted
* CountsOrphanRefunds (THE canary)
* CountsStuckRefundsWithHsID (gauge-orthogonality check)
* CountsFailedAndReversalPendingTransfers
* ReconcilerRecorders (counter + gauge shape)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
136 lines
5.1 KiB
JSON
136 lines
5.1 KiB
JSON
{
|
|
"title": "VEZA Ledger Health (v1.0.7)",
|
|
"description": "Five stuck-state gauges + reconciler action rate. The top row tells you 'is money stuck right now?', the bottom row tells you 'is the reconciler keeping up?'. Paired with alert rules in config/alertmanager/ledger.yml — orphan_refund_rows > 0 for 5m pages ops.",
|
|
"tags": ["veza", "ledger", "money-movement", "v1.0.7"],
|
|
"timezone": "browser",
|
|
"refresh": "1m",
|
|
"schemaVersion": 39,
|
|
"version": 1,
|
|
"panels": [
|
|
{
|
|
"id": 1,
|
|
"title": "Orphan refund rows (PAGE if > 0 for 5m)",
|
|
"type": "stat",
|
|
"gridPos": {"x": 0, "y": 0, "w": 5, "h": 5},
|
|
"targets": [{"expr": "veza_ledger_orphan_refund_rows", "refId": "A"}],
|
|
"fieldConfig": {
|
|
"defaults": {
|
|
"thresholds": {
|
|
"mode": "absolute",
|
|
"steps": [
|
|
{"color": "green", "value": null},
|
|
{"color": "red", "value": 1}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"id": 2,
|
|
"title": "Stuck orders (pending > 30m)",
|
|
"type": "stat",
|
|
"gridPos": {"x": 5, "y": 0, "w": 5, "h": 5},
|
|
"targets": [{"expr": "veza_ledger_stuck_orders_pending", "refId": "A"}],
|
|
"fieldConfig": {
|
|
"defaults": {
|
|
"thresholds": {
|
|
"mode": "absolute",
|
|
"steps": [
|
|
{"color": "green", "value": null},
|
|
{"color": "yellow", "value": 1},
|
|
{"color": "red", "value": 5}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"id": 3,
|
|
"title": "Stuck refunds (pending w/ PSP id > 30m)",
|
|
"type": "stat",
|
|
"gridPos": {"x": 10, "y": 0, "w": 5, "h": 5},
|
|
"targets": [{"expr": "veza_ledger_stuck_refunds_pending", "refId": "A"}]
|
|
},
|
|
{
|
|
"id": 4,
|
|
"title": "Reversal pending transfers (> 30m)",
|
|
"type": "stat",
|
|
"gridPos": {"x": 15, "y": 0, "w": 5, "h": 5},
|
|
"targets": [{"expr": "veza_ledger_reversal_pending_transfers", "refId": "A"}]
|
|
},
|
|
{
|
|
"id": 5,
|
|
"title": "Permanently-failed transfers",
|
|
"type": "stat",
|
|
"gridPos": {"x": 20, "y": 0, "w": 4, "h": 5},
|
|
"targets": [{"expr": "veza_ledger_failed_transfers_at_max_retry", "refId": "A"}]
|
|
},
|
|
{
|
|
"id": 6,
|
|
"title": "Stuck-state trends (last 6h)",
|
|
"type": "timeseries",
|
|
"gridPos": {"x": 0, "y": 5, "w": 12, "h": 8},
|
|
"targets": [
|
|
{"expr": "veza_ledger_stuck_orders_pending", "refId": "A", "legendFormat": "stuck orders"},
|
|
{"expr": "veza_ledger_stuck_refunds_pending", "refId": "B", "legendFormat": "stuck refunds"},
|
|
{"expr": "veza_ledger_orphan_refund_rows", "refId": "C", "legendFormat": "orphan refunds"},
|
|
{"expr": "veza_ledger_reversal_pending_transfers", "refId": "D", "legendFormat": "reversal pending"},
|
|
{"expr": "veza_ledger_failed_transfers_at_max_retry", "refId": "E", "legendFormat": "permanently failed transfers"}
|
|
]
|
|
},
|
|
{
|
|
"id": 7,
|
|
"title": "Reconciler actions (rate, by phase)",
|
|
"type": "timeseries",
|
|
"gridPos": {"x": 12, "y": 5, "w": 12, "h": 8},
|
|
"description": "Actions the reconciler took per phase. A healthy system shows occasional stuck_orders + stuck_refunds spikes (PSP hiccups) and near-zero orphan_refunds (Phase 2 crashes should be rare).",
|
|
"targets": [
|
|
{"expr": "rate(veza_reconciler_actions_total[5m])", "refId": "A", "legendFormat": "{{phase}}"}
|
|
]
|
|
},
|
|
{
|
|
"id": 8,
|
|
"title": "Reconciler sweep duration",
|
|
"type": "timeseries",
|
|
"gridPos": {"x": 0, "y": 13, "w": 8, "h": 6},
|
|
"targets": [
|
|
{"expr": "histogram_quantile(0.5, rate(veza_reconciler_sweep_duration_seconds_bucket[5m]))", "refId": "A", "legendFormat": "p50"},
|
|
{"expr": "histogram_quantile(0.95, rate(veza_reconciler_sweep_duration_seconds_bucket[5m]))", "refId": "B", "legendFormat": "p95"},
|
|
{"expr": "histogram_quantile(0.99, rate(veza_reconciler_sweep_duration_seconds_bucket[5m]))", "refId": "C", "legendFormat": "p99"}
|
|
]
|
|
},
|
|
{
|
|
"id": 9,
|
|
"title": "Seconds since last reconciler tick",
|
|
"type": "stat",
|
|
"gridPos": {"x": 8, "y": 13, "w": 8, "h": 6},
|
|
"description": "Alerts fire if this exceeds 2h (2 * RECONCILE_INTERVAL default). Sustained growth = worker is dead.",
|
|
"targets": [
|
|
{"expr": "time() - veza_reconciler_last_run_timestamp", "refId": "A"}
|
|
],
|
|
"fieldConfig": {
|
|
"defaults": {
|
|
"unit": "s",
|
|
"thresholds": {
|
|
"mode": "absolute",
|
|
"steps": [
|
|
{"color": "green", "value": null},
|
|
{"color": "yellow", "value": 3600},
|
|
{"color": "red", "value": 7200}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"id": 10,
|
|
"title": "Orphan refunds auto-failed (cumulative)",
|
|
"type": "timeseries",
|
|
"gridPos": {"x": 16, "y": 13, "w": 8, "h": 6},
|
|
"description": "Each increment = a Phase 2 crash the reconciler caught. Non-zero rate = investigate root cause.",
|
|
"targets": [
|
|
{"expr": "veza_reconciler_orphan_refunds_total", "refId": "A", "legendFormat": "total"}
|
|
]
|
|
}
|
|
]
|
|
}
|