83 lines
3.7 KiB
YAML
83 lines
3.7 KiB
YAML
|
|
# Prometheus alert rules for VEZA ledger health (v1.0.7 item F).
|
||
|
|
#
|
||
|
|
# Loaded by Prometheus via its `rule_files:` directive — point your
|
||
|
|
# prometheus.yml at this file (or a glob that covers it) and reload.
|
||
|
|
#
|
||
|
|
# Only two alerts for v1.0.7, intentionally. The other three ledger
|
||
|
|
# gauges (stuck_refunds, failed_transfers, reversal_pending) are
|
||
|
|
# visible on the dashboard and deserve *human* investigation when
|
||
|
|
# they trend up, not a page that fires at 3am — they're symptoms of
|
||
|
|
# slower PSP / Connect health issues, not data-integrity bugs.
|
||
|
|
#
|
||
|
|
# A non-zero orphan_refund_rows is different: it means a Phase 2
|
||
|
|
# crash happened (between our DB commit and the Hyperswitch PSP
|
||
|
|
# call) and the row sits with money on our side but no refund in
|
||
|
|
# flight. That's a real two-phase-commit bug; the reconciler will
|
||
|
|
# auto-fail it after 5m but ops needs to know WHY Phase 2 crashed.
|
||
|
|
|
||
|
|
groups:
|
||
|
|
- name: veza_ledger_health
|
||
|
|
interval: 30s
|
||
|
|
rules:
|
||
|
|
- alert: VezaStuckOrdersPending
|
||
|
|
expr: veza_ledger_stuck_orders_pending > 0
|
||
|
|
for: 10m
|
||
|
|
labels:
|
||
|
|
severity: page
|
||
|
|
team: payments
|
||
|
|
runbook: "docs/runbooks/stuck-orders.md"
|
||
|
|
annotations:
|
||
|
|
summary: "{{ $value }} order(s) stuck in `pending` for >30m"
|
||
|
|
description: |
|
||
|
|
An order sat in status=pending for more than 30 minutes
|
||
|
|
with a non-empty hyperswitch_payment_id. This means we
|
||
|
|
opened the payment at Hyperswitch but never received the
|
||
|
|
terminal webhook. The ReconcileHyperswitchWorker should
|
||
|
|
resolve this automatically at its next tick (default 1h).
|
||
|
|
If the count keeps growing across ticks, the reconciler
|
||
|
|
itself is stuck — check veza_reconciler_last_run_timestamp.
|
||
|
|
|
||
|
|
- alert: VezaOrphanRefundRows
|
||
|
|
expr: veza_ledger_orphan_refund_rows > 0
|
||
|
|
for: 5m
|
||
|
|
labels:
|
||
|
|
severity: page
|
||
|
|
team: payments
|
||
|
|
runbook: "docs/runbooks/orphan-refunds.md"
|
||
|
|
annotations:
|
||
|
|
summary: "{{ $value }} orphan refund row(s) — Phase 2 crash"
|
||
|
|
description: |
|
||
|
|
A Refund row exists in 'pending' with no
|
||
|
|
hyperswitch_refund_id, older than 5 minutes. This is a
|
||
|
|
bug in the two-phase commit between our DB and
|
||
|
|
Hyperswitch: Phase 1 (create pending Refund row +
|
||
|
|
flip order to refund_pending) ran, Phase 2 (POST
|
||
|
|
/refunds at Hyperswitch) never did. The reconciler
|
||
|
|
will auto-fail the row at its next tick, but the ROOT
|
||
|
|
CAUSE of the Phase 2 crash must be investigated — this
|
||
|
|
indicates a panic, OOM, or network timeout in
|
||
|
|
RefundOrder. Check app logs for the affected refund_id
|
||
|
|
timestamp and look for the crash signal.
|
||
|
|
|
||
|
|
# -- Reconciler liveness (item C self-monitoring) ---------
|
||
|
|
# Fires if the reconciler hasn't ticked within 2 intervals.
|
||
|
|
# RECONCILE_INTERVAL default is 1h, so 2h without a tick.
|
||
|
|
- alert: VezaReconcilerStale
|
||
|
|
expr: time() - veza_reconciler_last_run_timestamp > 7200
|
||
|
|
for: 5m
|
||
|
|
labels:
|
||
|
|
severity: page
|
||
|
|
team: payments
|
||
|
|
runbook: "docs/runbooks/reconciler-stale.md"
|
||
|
|
annotations:
|
||
|
|
summary: "Reconciliation worker has not run in >2h"
|
||
|
|
description: |
|
||
|
|
veza_reconciler_last_run_timestamp is stale by more than
|
||
|
|
2 * RECONCILE_INTERVAL (default 1h, so 2h threshold).
|
||
|
|
Either the worker's goroutine crashed or ctx was
|
||
|
|
cancelled without restart. Without the reconciler,
|
||
|
|
stuck orders + orphan refunds accumulate indefinitely.
|
||
|
|
Restart the backend; if it persists, check the logs
|
||
|
|
for 'ReconcileHyperswitchWorker stopped' or a panic
|
||
|
|
trace.
|