# Prometheus alert rules for VEZA ledger health (v1.0.7 item F). # # Loaded by Prometheus via its `rule_files:` directive — point your # prometheus.yml at this file (or a glob that covers it) and reload. # # Only two alerts for v1.0.7, intentionally. The other three ledger # gauges (stuck_refunds, failed_transfers, reversal_pending) are # visible on the dashboard and deserve *human* investigation when # they trend up, not a page that fires at 3am — they're symptoms of # slower PSP / Connect health issues, not data-integrity bugs. # # A non-zero orphan_refund_rows is different: it means a Phase 2 # crash happened (between our DB commit and the Hyperswitch PSP # call) and the row sits with money on our side but no refund in # flight. That's a real two-phase-commit bug; the reconciler will # auto-fail it after 5m but ops needs to know WHY Phase 2 crashed. groups: - name: veza_ledger_health interval: 30s rules: - alert: VezaStuckOrdersPending expr: veza_ledger_stuck_orders_pending > 0 for: 10m labels: severity: page team: payments runbook: "docs/runbooks/stuck-orders.md" annotations: summary: "{{ $value }} order(s) stuck in `pending` for >30m" description: | An order sat in status=pending for more than 30 minutes with a non-empty hyperswitch_payment_id. This means we opened the payment at Hyperswitch but never received the terminal webhook. The ReconcileHyperswitchWorker should resolve this automatically at its next tick (default 1h). If the count keeps growing across ticks, the reconciler itself is stuck — check veza_reconciler_last_run_timestamp. - alert: VezaOrphanRefundRows expr: veza_ledger_orphan_refund_rows > 0 for: 5m labels: severity: page team: payments runbook: "docs/runbooks/orphan-refunds.md" annotations: summary: "{{ $value }} orphan refund row(s) — Phase 2 crash" description: | A Refund row exists in 'pending' with no hyperswitch_refund_id, older than 5 minutes. This is a bug in the two-phase commit between our DB and Hyperswitch: Phase 1 (create pending Refund row + flip order to refund_pending) ran, Phase 2 (POST /refunds at Hyperswitch) never did. The reconciler will auto-fail the row at its next tick, but the ROOT CAUSE of the Phase 2 crash must be investigated — this indicates a panic, OOM, or network timeout in RefundOrder. Check app logs for the affected refund_id timestamp and look for the crash signal. # -- Reconciler liveness (item C self-monitoring) --------- # Fires if the reconciler hasn't ticked within 2 intervals. # RECONCILE_INTERVAL default is 1h, so 2h without a tick. - alert: VezaReconcilerStale expr: time() - veza_reconciler_last_run_timestamp > 7200 for: 5m labels: severity: page team: payments runbook: "docs/runbooks/reconciler-stale.md" annotations: summary: "Reconciliation worker has not run in >2h" description: | veza_reconciler_last_run_timestamp is stale by more than 2 * RECONCILE_INTERVAL (default 1h, so 2h threshold). Either the worker's goroutine crashed or ctx was cancelled without restart. Without the reconciler, stuck orders + orphan refunds accumulate indefinitely. Restart the backend; if it persists, check the logs for 'ReconcileHyperswitchWorker stopped' or a panic trace.