Five Prometheus gauges + reconciler metrics + Grafana dashboard +
three alert rules. Closes axis-1 P1.8 and adds observability for
item C's reconciler (user review: "F should include reconciler_*
metrics, otherwise tag is blind on the worker we just shipped").
Gauges (veza_ledger_, sampled every 60s):
* orphan_refund_rows — THE canary. Pending refunds with empty
hyperswitch_refund_id older than 5m = Phase 2 crash in
RefundOrder. Alert: > 0 for 5m → page.
* stuck_orders_pending — order pending > 30m with non-empty
payment_id. Alert: > 0 for 10m → page.
* stuck_refunds_pending — refund pending > 30m with hs_id.
* failed_transfers_at_max_retry — permanently_failed rows.
* reversal_pending_transfers — item B rows stuck > 30m.
Reconciler metrics (veza_reconciler_):
* actions_total{phase} — counter by phase.
* orphan_refunds_total — two-phase-bug canary.
* sweep_duration_seconds — exponential histogram.
* last_run_timestamp — alert: stale > 2h → page (worker dead).
Implementation notes:
* Sampler thresholds hardcoded to match reconciler defaults —
intentional mismatch allowed (alerts fire while reconciler
already working = correct behavior).
* Query error sets gauge to -1 (sentinel for "sampler broken").
* marketplace package routes through monitoring recorders so it
doesn't import prometheus directly.
* Sampler runs regardless of Hyperswitch enablement; gauges
default 0 when pipeline idle.
* Graceful shutdown wired in cmd/api/main.go.
Alert rules in config/alertmanager/ledger.yml with runbook
pointers + detailed descriptions — each alert explains WHAT
happened, WHY the reconciler may not resolve it, and WHERE to
look first.
Grafana dashboard config/grafana/dashboards/ledger-health.json —
top row = 5 stat panels (orphan first, color-coded red on > 0),
middle row = trend timeseries + reconciler action rate by phase,
bottom row = sweep duration p50/p95/p99 + seconds-since-last-tick
+ orphan cumulative.
Tests — 6 cases, all green (sqlite :memory:):
* CountsStuckOrdersPending (includes the filter on
non-empty payment_id)
* StuckOrdersZeroWhenAllCompleted
* CountsOrphanRefunds (THE canary)
* CountsStuckRefundsWithHsID (gauge-orthogonality check)
* CountsFailedAndReversalPendingTransfers
* ReconcilerRecorders (counter + gauge shape)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
82 lines
3.7 KiB
YAML
82 lines
3.7 KiB
YAML
# Prometheus alert rules for VEZA ledger health (v1.0.7 item F).
|
|
#
|
|
# Loaded by Prometheus via its `rule_files:` directive — point your
|
|
# prometheus.yml at this file (or a glob that covers it) and reload.
|
|
#
|
|
# Only two alerts for v1.0.7, intentionally. The other three ledger
|
|
# gauges (stuck_refunds, failed_transfers, reversal_pending) are
|
|
# visible on the dashboard and deserve *human* investigation when
|
|
# they trend up, not a page that fires at 3am — they're symptoms of
|
|
# slower PSP / Connect health issues, not data-integrity bugs.
|
|
#
|
|
# A non-zero orphan_refund_rows is different: it means a Phase 2
|
|
# crash happened (between our DB commit and the Hyperswitch PSP
|
|
# call) and the row sits with money on our side but no refund in
|
|
# flight. That's a real two-phase-commit bug; the reconciler will
|
|
# auto-fail it after 5m but ops needs to know WHY Phase 2 crashed.
|
|
|
|
groups:
|
|
- name: veza_ledger_health
|
|
interval: 30s
|
|
rules:
|
|
- alert: VezaStuckOrdersPending
|
|
expr: veza_ledger_stuck_orders_pending > 0
|
|
for: 10m
|
|
labels:
|
|
severity: page
|
|
team: payments
|
|
runbook: "docs/runbooks/stuck-orders.md"
|
|
annotations:
|
|
summary: "{{ $value }} order(s) stuck in `pending` for >30m"
|
|
description: |
|
|
An order sat in status=pending for more than 30 minutes
|
|
with a non-empty hyperswitch_payment_id. This means we
|
|
opened the payment at Hyperswitch but never received the
|
|
terminal webhook. The ReconcileHyperswitchWorker should
|
|
resolve this automatically at its next tick (default 1h).
|
|
If the count keeps growing across ticks, the reconciler
|
|
itself is stuck — check veza_reconciler_last_run_timestamp.
|
|
|
|
- alert: VezaOrphanRefundRows
|
|
expr: veza_ledger_orphan_refund_rows > 0
|
|
for: 5m
|
|
labels:
|
|
severity: page
|
|
team: payments
|
|
runbook: "docs/runbooks/orphan-refunds.md"
|
|
annotations:
|
|
summary: "{{ $value }} orphan refund row(s) — Phase 2 crash"
|
|
description: |
|
|
A Refund row exists in 'pending' with no
|
|
hyperswitch_refund_id, older than 5 minutes. This is a
|
|
bug in the two-phase commit between our DB and
|
|
Hyperswitch: Phase 1 (create pending Refund row +
|
|
flip order to refund_pending) ran, Phase 2 (POST
|
|
/refunds at Hyperswitch) never did. The reconciler
|
|
will auto-fail the row at its next tick, but the ROOT
|
|
CAUSE of the Phase 2 crash must be investigated — this
|
|
indicates a panic, OOM, or network timeout in
|
|
RefundOrder. Check app logs for the affected refund_id
|
|
timestamp and look for the crash signal.
|
|
|
|
# -- Reconciler liveness (item C self-monitoring) ---------
|
|
# Fires if the reconciler hasn't ticked within 2 intervals.
|
|
# RECONCILE_INTERVAL default is 1h, so 2h without a tick.
|
|
- alert: VezaReconcilerStale
|
|
expr: time() - veza_reconciler_last_run_timestamp > 7200
|
|
for: 5m
|
|
labels:
|
|
severity: page
|
|
team: payments
|
|
runbook: "docs/runbooks/reconciler-stale.md"
|
|
annotations:
|
|
summary: "Reconciliation worker has not run in >2h"
|
|
description: |
|
|
veza_reconciler_last_run_timestamp is stale by more than
|
|
2 * RECONCILE_INTERVAL (default 1h, so 2h threshold).
|
|
Either the worker's goroutine crashed or ctx was
|
|
cancelled without restart. Without the reconciler,
|
|
stuck orders + orphan refunds accumulate indefinitely.
|
|
Restart the backend; if it persists, check the logs
|
|
for 'ReconcileHyperswitchWorker stopped' or a panic
|
|
trace.
|