veza/config/alertmanager/ledger.yml

# Prometheus alert rules for VEZA ledger health (v1.0.7 item F).
#
# Loaded by Prometheus via its `rule_files:` directive — point your
# prometheus.yml at this file (or a glob that covers it) and reload.
#
# Only two alerts for v1.0.7, intentionally. The other three ledger
# gauges (stuck_refunds, failed_transfers, reversal_pending) are
# visible on the dashboard and deserve *human* investigation when
# they trend up, not a page that fires at 3am — they're symptoms of
# slower PSP / Connect health issues, not data-integrity bugs.
#
# A non-zero orphan_refund_rows is different: it means a Phase 2
# crash happened (between our DB commit and the Hyperswitch PSP
# call) and the row sits with money on our side but no refund in
# flight. That's a real two-phase-commit bug; the reconciler will
# auto-fail it after 5m but ops needs to know WHY Phase 2 crashed.

groups:
  - name: veza_ledger_health
    interval: 30s
    rules:
      - alert: VezaStuckOrdersPending
        expr: veza_ledger_stuck_orders_pending > 0
        for: 10m
        labels:
          severity: page
          team: payments
          runbook: "docs/runbooks/stuck-orders.md"
        annotations:
          summary: "{{ $value }} order(s) stuck in `pending` for >30m"
          description: |
            An order sat in status=pending for more than 30 minutes
            with a non-empty hyperswitch_payment_id. This means we
            opened the payment at Hyperswitch but never received the
            terminal webhook. The ReconcileHyperswitchWorker should
            resolve this automatically at its next tick (default 1h).
            If the count keeps growing across ticks, the reconciler
            itself is stuck — check veza_reconciler_last_run_timestamp.

      - alert: VezaOrphanRefundRows
        expr: veza_ledger_orphan_refund_rows > 0
        for: 5m
        labels:
          severity: page
          team: payments
          runbook: "docs/runbooks/orphan-refunds.md"
        annotations:
          summary: "{{ $value }} orphan refund row(s) — Phase 2 crash"
          description: |
            A Refund row exists in 'pending' with no
            hyperswitch_refund_id, older than 5 minutes. This is a
            bug in the two-phase commit between our DB and
            Hyperswitch: Phase 1 (create pending Refund row +
            flip order to refund_pending) ran, Phase 2 (POST
            /refunds at Hyperswitch) never did. The reconciler
            will auto-fail the row at its next tick, but the ROOT
            CAUSE of the Phase 2 crash must be investigated — this
            indicates a panic, OOM, or network timeout in
            RefundOrder. Check app logs for the affected refund_id
            timestamp and look for the crash signal.

      # -- Reconciler liveness (item C self-monitoring) ---------
      # Fires if the reconciler hasn't ticked within 2 intervals.
      # RECONCILE_INTERVAL default is 1h, so 2h without a tick.
      - alert: VezaReconcilerStale
        expr: time() - veza_reconciler_last_run_timestamp > 7200
        for: 5m
        labels:
          severity: page
          team: payments
          runbook: "docs/runbooks/reconciler-stale.md"
        annotations:
          summary: "Reconciliation worker has not run in >2h"
          description: |
            veza_reconciler_last_run_timestamp is stale by more than
            2 * RECONCILE_INTERVAL (default 1h, so 2h threshold).
            Either the worker's goroutine crashed or ctx was
            cancelled without restart. Without the reconciler,
            stuck orders + orphan refunds accumulate indefinitely.
            Restart the backend; if it persists, check the logs
            for 'ReconcileHyperswitchWorker stopped' or a panic
            trace.
feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F Five Prometheus gauges + reconciler metrics + Grafana dashboard + three alert rules. Closes axis-1 P1.8 and adds observability for item C's reconciler (user review: "F should include reconciler_* metrics, otherwise tag is blind on the worker we just shipped"). Gauges (veza_ledger_, sampled every 60s): * orphan_refund_rows — THE canary. Pending refunds with empty hyperswitch_refund_id older than 5m = Phase 2 crash in RefundOrder. Alert: > 0 for 5m → page. * stuck_orders_pending — order pending > 30m with non-empty payment_id. Alert: > 0 for 10m → page. * stuck_refunds_pending — refund pending > 30m with hs_id. * failed_transfers_at_max_retry — permanently_failed rows. * reversal_pending_transfers — item B rows stuck > 30m. Reconciler metrics (veza_reconciler_): * actions_total{phase} — counter by phase. * orphan_refunds_total — two-phase-bug canary. * sweep_duration_seconds — exponential histogram. * last_run_timestamp — alert: stale > 2h → page (worker dead). Implementation notes: * Sampler thresholds hardcoded to match reconciler defaults — intentional mismatch allowed (alerts fire while reconciler already working = correct behavior). * Query error sets gauge to -1 (sentinel for "sampler broken"). * marketplace package routes through monitoring recorders so it doesn't import prometheus directly. * Sampler runs regardless of Hyperswitch enablement; gauges default 0 when pipeline idle. * Graceful shutdown wired in cmd/api/main.go. Alert rules in config/alertmanager/ledger.yml with runbook pointers + detailed descriptions — each alert explains WHAT happened, WHY the reconciler may not resolve it, and WHERE to look first. Grafana dashboard config/grafana/dashboards/ledger-health.json — top row = 5 stat panels (orphan first, color-coded red on > 0), middle row = trend timeseries + reconciler action rate by phase, bottom row = sweep duration p50/p95/p99 + seconds-since-last-tick + orphan cumulative. Tests — 6 cases, all green (sqlite :memory:): * CountsStuckOrdersPending (includes the filter on non-empty payment_id) * StuckOrdersZeroWhenAllCompleted * CountsOrphanRefunds (THE canary) * CountsStuckRefundsWithHsID (gauge-orthogonality check) * CountsFailedAndReversalPendingTransfers * ReconcilerRecorders (counter + gauge shape) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-04-18 01:40:14 +00:00			`# Prometheus alert rules for VEZA ledger health (v1.0.7 item F).`
			`#`
			# Loaded by Prometheus via its `rule_files:` directive — point your
			`# prometheus.yml at this file (or a glob that covers it) and reload.`
			`#`
			`# Only two alerts for v1.0.7, intentionally. The other three ledger`
			`# gauges (stuck_refunds, failed_transfers, reversal_pending) are`
			`# visible on the dashboard and deserve human investigation when`
			`# they trend up, not a page that fires at 3am — they're symptoms of`
			`# slower PSP / Connect health issues, not data-integrity bugs.`
			`#`
			`# A non-zero orphan_refund_rows is different: it means a Phase 2`
			`# crash happened (between our DB commit and the Hyperswitch PSP`
			`# call) and the row sits with money on our side but no refund in`
			`# flight. That's a real two-phase-commit bug; the reconciler will`
			`# auto-fail it after 5m but ops needs to know WHY Phase 2 crashed.`

			`groups:`
			`- name: veza_ledger_health`
			`interval: 30s`
			`rules:`
			`- alert: VezaStuckOrdersPending`
			`expr: veza_ledger_stuck_orders_pending > 0`
			`for: 10m`
			`labels:`
			`severity: page`
			`team: payments`
			`runbook: "docs/runbooks/stuck-orders.md"`
			`annotations:`
			summary: "{{ $value }} order(s) stuck in `pending` for >30m"
			`description: \|`
			`An order sat in status=pending for more than 30 minutes`
			`with a non-empty hyperswitch_payment_id. This means we`
			`opened the payment at Hyperswitch but never received the`
			`terminal webhook. The ReconcileHyperswitchWorker should`
			`resolve this automatically at its next tick (default 1h).`
			`If the count keeps growing across ticks, the reconciler`
			`itself is stuck — check veza_reconciler_last_run_timestamp.`

			`- alert: VezaOrphanRefundRows`
			`expr: veza_ledger_orphan_refund_rows > 0`
			`for: 5m`
			`labels:`
			`severity: page`
			`team: payments`
			`runbook: "docs/runbooks/orphan-refunds.md"`
			`annotations:`
			`summary: "{{ $value }} orphan refund row(s) — Phase 2 crash"`
			`description: \|`
			`A Refund row exists in 'pending' with no`
			`hyperswitch_refund_id, older than 5 minutes. This is a`
			`bug in the two-phase commit between our DB and`
			`Hyperswitch: Phase 1 (create pending Refund row +`
			`flip order to refund_pending) ran, Phase 2 (POST`
			`/refunds at Hyperswitch) never did. The reconciler`
			`will auto-fail the row at its next tick, but the ROOT`
			`CAUSE of the Phase 2 crash must be investigated — this`
			`indicates a panic, OOM, or network timeout in`
			`RefundOrder. Check app logs for the affected refund_id`
			`timestamp and look for the crash signal.`

			`# -- Reconciler liveness (item C self-monitoring) ---------`
			`# Fires if the reconciler hasn't ticked within 2 intervals.`
			`# RECONCILE_INTERVAL default is 1h, so 2h without a tick.`
			`- alert: VezaReconcilerStale`
			`expr: time() - veza_reconciler_last_run_timestamp > 7200`
			`for: 5m`
			`labels:`
			`severity: page`
			`team: payments`
			`runbook: "docs/runbooks/reconciler-stale.md"`
			`annotations:`
			`summary: "Reconciliation worker has not run in >2h"`
			`description: \|`
			`veza_reconciler_last_run_timestamp is stale by more than`
			`2 * RECONCILE_INTERVAL (default 1h, so 2h threshold).`
			`Either the worker's goroutine crashed or ctx was`
			`cancelled without restart. Without the reconciler,`
			`stuck orders + orphan refunds accumulate indefinitely.`
			`Restart the backend; if it persists, check the logs`
			`for 'ReconcileHyperswitchWorker stopped' or a panic`
			`trace.`