feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F
Five Prometheus gauges + reconciler metrics + Grafana dashboard +
three alert rules. Closes axis-1 P1.8 and adds observability for
item C's reconciler (user review: "F should include reconciler_*
metrics, otherwise tag is blind on the worker we just shipped").
Gauges (veza_ledger_, sampled every 60s):
* orphan_refund_rows — THE canary. Pending refunds with empty
hyperswitch_refund_id older than 5m = Phase 2 crash in
RefundOrder. Alert: > 0 for 5m → page.
* stuck_orders_pending — order pending > 30m with non-empty
payment_id. Alert: > 0 for 10m → page.
* stuck_refunds_pending — refund pending > 30m with hs_id.
* failed_transfers_at_max_retry — permanently_failed rows.
* reversal_pending_transfers — item B rows stuck > 30m.
Reconciler metrics (veza_reconciler_):
* actions_total{phase} — counter by phase.
* orphan_refunds_total — two-phase-bug canary.
* sweep_duration_seconds — exponential histogram.
* last_run_timestamp — alert: stale > 2h → page (worker dead).
Implementation notes:
* Sampler thresholds hardcoded to match reconciler defaults —
intentional mismatch allowed (alerts fire while reconciler
already working = correct behavior).
* Query error sets gauge to -1 (sentinel for "sampler broken").
* marketplace package routes through monitoring recorders so it
doesn't import prometheus directly.
* Sampler runs regardless of Hyperswitch enablement; gauges
default 0 when pipeline idle.
* Graceful shutdown wired in cmd/api/main.go.
Alert rules in config/alertmanager/ledger.yml with runbook
pointers + detailed descriptions — each alert explains WHAT
happened, WHY the reconciler may not resolve it, and WHERE to
look first.
Grafana dashboard config/grafana/dashboards/ledger-health.json —
top row = 5 stat panels (orphan first, color-coded red on > 0),
middle row = trend timeseries + reconciler action rate by phase,
bottom row = sweep duration p50/p95/p99 + seconds-since-last-tick
+ orphan cumulative.
Tests — 6 cases, all green (sqlite :memory:):
* CountsStuckOrdersPending (includes the filter on
non-empty payment_id)
* StuckOrdersZeroWhenAllCompleted
* CountsOrphanRefunds (THE canary)
* CountsStuckRefundsWithHsID (gauge-orthogonality check)
* CountsFailedAndReversalPendingTransfers
* ReconcilerRecorders (counter + gauge shape)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
645fd23e22
commit
94dfc80b73
7 changed files with 744 additions and 0 deletions
60
CHANGELOG.md
60
CHANGELOG.md
|
|
@ -38,6 +38,66 @@ auto-reversed; the backfill CLI queries Stripe's transfers.List by
|
|||
metadata[order_id] to populate missing ids, acceptable to leave NULL
|
||||
per v107-plan.
|
||||
|
||||
### Item F — ledger-health metrics + alerts
|
||||
|
||||
Five Prometheus gauges expose money-movement pipeline state so ops
|
||||
dashboards + alert rules can spot a stall before a customer does.
|
||||
Paired with counter/histogram metrics for the item-C reconciler so
|
||||
the dashboard tells the whole story at a glance ("we have N stuck
|
||||
orders and the reconciler has resolved M of them today").
|
||||
|
||||
Gauges (sampled every 60s via `ScheduleLedgerHealthSampler`):
|
||||
* `veza_ledger_orphan_refund_rows` — THE alert gauge. Pending
|
||||
refunds with empty hyperswitch_refund_id older than 5m.
|
||||
Non-zero = Phase 2 crash in RefundOrder. Pages on > 0 for 5m.
|
||||
* `veza_ledger_stuck_orders_pending` — orders pending > 30m
|
||||
with non-empty payment_id (webhook never arrived). Pages on
|
||||
> 0 for 10m.
|
||||
* `veza_ledger_stuck_refunds_pending` — refunds with hs_id but
|
||||
still pending > 30m.
|
||||
* `veza_ledger_failed_transfers_at_max_retry` —
|
||||
seller_transfers in permanently_failed.
|
||||
* `veza_ledger_reversal_pending_transfers` — item B rows stuck
|
||||
in reversal_pending > 30m (worker behind or Stripe down).
|
||||
|
||||
Reconciler metrics (item F extends item C observability):
|
||||
* `veza_reconciler_actions_total{phase}` — counter labelled by
|
||||
phase (stuck_orders | stuck_refunds | orphan_refunds).
|
||||
* `veza_reconciler_orphan_refunds_total` — dedicated counter for
|
||||
the two-phase-commit-bug canary.
|
||||
* `veza_reconciler_sweep_duration_seconds` — histogram with 10
|
||||
exponential buckets (0.1s to ~100s).
|
||||
* `veza_reconciler_last_run_timestamp` — unix ts of last tick.
|
||||
Alert fires if `time() - ts > 7200` (2 * default
|
||||
RECONCILE_INTERVAL).
|
||||
|
||||
Sampler queries are all indexed on `status + created_at` (or
|
||||
`status + updated_at` for reversal_pending). Query errors set the
|
||||
gauge to -1 — a distinctive value dashboards filter on ("sampler
|
||||
broken, don't trust the number") instead of leaking a stale value.
|
||||
|
||||
Alert rules in `config/alertmanager/ledger.yml`:
|
||||
* `VezaOrphanRefundRows` — page on > 0 for 5m (two-phase bug)
|
||||
* `VezaStuckOrdersPending` — page on > 0 for 10m (webhook
|
||||
pipeline stuck)
|
||||
* `VezaReconcilerStale` — page on last-run > 2h (worker dead,
|
||||
stuck/orphan rows accumulating)
|
||||
|
||||
Grafana dashboard `config/grafana/dashboards/ledger-health.json`:
|
||||
5 stat panels (top row) + stuck-state timeseries + reconciler
|
||||
action rate + sweep duration quantiles + seconds-since-last-tick
|
||||
+ orphan refunds cumulative.
|
||||
|
||||
Worker instrumentation: ReconcileHyperswitchWorker now emits
|
||||
RecordReconcilerAction / RecordReconcilerOrphanRefund /
|
||||
RecordReconcilerSweepDuration at the right points. Tests cover
|
||||
the sampler's count queries (5 cases, all branches) plus the
|
||||
recorder shape.
|
||||
|
||||
Sampler wired in cmd/api/main.go with graceful shutdown; runs
|
||||
regardless of Hyperswitch enablement (gauges default to 0, which
|
||||
is the correct story for "Hyperswitch not configured").
|
||||
|
||||
### Item C — Hyperswitch reconciliation sweep
|
||||
|
||||
New `ReconcileHyperswitchWorker` sweeps for pending orders and
|
||||
|
|
|
|||
82
config/alertmanager/ledger.yml
Normal file
82
config/alertmanager/ledger.yml
Normal file
|
|
@ -0,0 +1,82 @@
|
|||
# Prometheus alert rules for VEZA ledger health (v1.0.7 item F).
|
||||
#
|
||||
# Loaded by Prometheus via its `rule_files:` directive — point your
|
||||
# prometheus.yml at this file (or a glob that covers it) and reload.
|
||||
#
|
||||
# Only two alerts for v1.0.7, intentionally. The other three ledger
|
||||
# gauges (stuck_refunds, failed_transfers, reversal_pending) are
|
||||
# visible on the dashboard and deserve *human* investigation when
|
||||
# they trend up, not a page that fires at 3am — they're symptoms of
|
||||
# slower PSP / Connect health issues, not data-integrity bugs.
|
||||
#
|
||||
# A non-zero orphan_refund_rows is different: it means a Phase 2
|
||||
# crash happened (between our DB commit and the Hyperswitch PSP
|
||||
# call) and the row sits with money on our side but no refund in
|
||||
# flight. That's a real two-phase-commit bug; the reconciler will
|
||||
# auto-fail it after 5m but ops needs to know WHY Phase 2 crashed.
|
||||
|
||||
groups:
|
||||
- name: veza_ledger_health
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: VezaStuckOrdersPending
|
||||
expr: veza_ledger_stuck_orders_pending > 0
|
||||
for: 10m
|
||||
labels:
|
||||
severity: page
|
||||
team: payments
|
||||
runbook: "docs/runbooks/stuck-orders.md"
|
||||
annotations:
|
||||
summary: "{{ $value }} order(s) stuck in `pending` for >30m"
|
||||
description: |
|
||||
An order sat in status=pending for more than 30 minutes
|
||||
with a non-empty hyperswitch_payment_id. This means we
|
||||
opened the payment at Hyperswitch but never received the
|
||||
terminal webhook. The ReconcileHyperswitchWorker should
|
||||
resolve this automatically at its next tick (default 1h).
|
||||
If the count keeps growing across ticks, the reconciler
|
||||
itself is stuck — check veza_reconciler_last_run_timestamp.
|
||||
|
||||
- alert: VezaOrphanRefundRows
|
||||
expr: veza_ledger_orphan_refund_rows > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: page
|
||||
team: payments
|
||||
runbook: "docs/runbooks/orphan-refunds.md"
|
||||
annotations:
|
||||
summary: "{{ $value }} orphan refund row(s) — Phase 2 crash"
|
||||
description: |
|
||||
A Refund row exists in 'pending' with no
|
||||
hyperswitch_refund_id, older than 5 minutes. This is a
|
||||
bug in the two-phase commit between our DB and
|
||||
Hyperswitch: Phase 1 (create pending Refund row +
|
||||
flip order to refund_pending) ran, Phase 2 (POST
|
||||
/refunds at Hyperswitch) never did. The reconciler
|
||||
will auto-fail the row at its next tick, but the ROOT
|
||||
CAUSE of the Phase 2 crash must be investigated — this
|
||||
indicates a panic, OOM, or network timeout in
|
||||
RefundOrder. Check app logs for the affected refund_id
|
||||
timestamp and look for the crash signal.
|
||||
|
||||
# -- Reconciler liveness (item C self-monitoring) ---------
|
||||
# Fires if the reconciler hasn't ticked within 2 intervals.
|
||||
# RECONCILE_INTERVAL default is 1h, so 2h without a tick.
|
||||
- alert: VezaReconcilerStale
|
||||
expr: time() - veza_reconciler_last_run_timestamp > 7200
|
||||
for: 5m
|
||||
labels:
|
||||
severity: page
|
||||
team: payments
|
||||
runbook: "docs/runbooks/reconciler-stale.md"
|
||||
annotations:
|
||||
summary: "Reconciliation worker has not run in >2h"
|
||||
description: |
|
||||
veza_reconciler_last_run_timestamp is stale by more than
|
||||
2 * RECONCILE_INTERVAL (default 1h, so 2h threshold).
|
||||
Either the worker's goroutine crashed or ctx was
|
||||
cancelled without restart. Without the reconciler,
|
||||
stuck orders + orphan refunds accumulate indefinitely.
|
||||
Restart the backend; if it persists, check the logs
|
||||
for 'ReconcileHyperswitchWorker stopped' or a panic
|
||||
trace.
|
||||
136
config/grafana/dashboards/ledger-health.json
Normal file
136
config/grafana/dashboards/ledger-health.json
Normal file
|
|
@ -0,0 +1,136 @@
|
|||
{
|
||||
"title": "VEZA Ledger Health (v1.0.7)",
|
||||
"description": "Five stuck-state gauges + reconciler action rate. The top row tells you 'is money stuck right now?', the bottom row tells you 'is the reconciler keeping up?'. Paired with alert rules in config/alertmanager/ledger.yml — orphan_refund_rows > 0 for 5m pages ops.",
|
||||
"tags": ["veza", "ledger", "money-movement", "v1.0.7"],
|
||||
"timezone": "browser",
|
||||
"refresh": "1m",
|
||||
"schemaVersion": 39,
|
||||
"version": 1,
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"title": "Orphan refund rows (PAGE if > 0 for 5m)",
|
||||
"type": "stat",
|
||||
"gridPos": {"x": 0, "y": 0, "w": 5, "h": 5},
|
||||
"targets": [{"expr": "veza_ledger_orphan_refund_rows", "refId": "A"}],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"color": "green", "value": null},
|
||||
{"color": "red", "value": 1}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "Stuck orders (pending > 30m)",
|
||||
"type": "stat",
|
||||
"gridPos": {"x": 5, "y": 0, "w": 5, "h": 5},
|
||||
"targets": [{"expr": "veza_ledger_stuck_orders_pending", "refId": "A"}],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"color": "green", "value": null},
|
||||
{"color": "yellow", "value": 1},
|
||||
{"color": "red", "value": 5}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Stuck refunds (pending w/ PSP id > 30m)",
|
||||
"type": "stat",
|
||||
"gridPos": {"x": 10, "y": 0, "w": 5, "h": 5},
|
||||
"targets": [{"expr": "veza_ledger_stuck_refunds_pending", "refId": "A"}]
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"title": "Reversal pending transfers (> 30m)",
|
||||
"type": "stat",
|
||||
"gridPos": {"x": 15, "y": 0, "w": 5, "h": 5},
|
||||
"targets": [{"expr": "veza_ledger_reversal_pending_transfers", "refId": "A"}]
|
||||
},
|
||||
{
|
||||
"id": 5,
|
||||
"title": "Permanently-failed transfers",
|
||||
"type": "stat",
|
||||
"gridPos": {"x": 20, "y": 0, "w": 4, "h": 5},
|
||||
"targets": [{"expr": "veza_ledger_failed_transfers_at_max_retry", "refId": "A"}]
|
||||
},
|
||||
{
|
||||
"id": 6,
|
||||
"title": "Stuck-state trends (last 6h)",
|
||||
"type": "timeseries",
|
||||
"gridPos": {"x": 0, "y": 5, "w": 12, "h": 8},
|
||||
"targets": [
|
||||
{"expr": "veza_ledger_stuck_orders_pending", "refId": "A", "legendFormat": "stuck orders"},
|
||||
{"expr": "veza_ledger_stuck_refunds_pending", "refId": "B", "legendFormat": "stuck refunds"},
|
||||
{"expr": "veza_ledger_orphan_refund_rows", "refId": "C", "legendFormat": "orphan refunds"},
|
||||
{"expr": "veza_ledger_reversal_pending_transfers", "refId": "D", "legendFormat": "reversal pending"},
|
||||
{"expr": "veza_ledger_failed_transfers_at_max_retry", "refId": "E", "legendFormat": "permanently failed transfers"}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 7,
|
||||
"title": "Reconciler actions (rate, by phase)",
|
||||
"type": "timeseries",
|
||||
"gridPos": {"x": 12, "y": 5, "w": 12, "h": 8},
|
||||
"description": "Actions the reconciler took per phase. A healthy system shows occasional stuck_orders + stuck_refunds spikes (PSP hiccups) and near-zero orphan_refunds (Phase 2 crashes should be rare).",
|
||||
"targets": [
|
||||
{"expr": "rate(veza_reconciler_actions_total[5m])", "refId": "A", "legendFormat": "{{phase}}"}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 8,
|
||||
"title": "Reconciler sweep duration",
|
||||
"type": "timeseries",
|
||||
"gridPos": {"x": 0, "y": 13, "w": 8, "h": 6},
|
||||
"targets": [
|
||||
{"expr": "histogram_quantile(0.5, rate(veza_reconciler_sweep_duration_seconds_bucket[5m]))", "refId": "A", "legendFormat": "p50"},
|
||||
{"expr": "histogram_quantile(0.95, rate(veza_reconciler_sweep_duration_seconds_bucket[5m]))", "refId": "B", "legendFormat": "p95"},
|
||||
{"expr": "histogram_quantile(0.99, rate(veza_reconciler_sweep_duration_seconds_bucket[5m]))", "refId": "C", "legendFormat": "p99"}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 9,
|
||||
"title": "Seconds since last reconciler tick",
|
||||
"type": "stat",
|
||||
"gridPos": {"x": 8, "y": 13, "w": 8, "h": 6},
|
||||
"description": "Alerts fire if this exceeds 2h (2 * RECONCILE_INTERVAL default). Sustained growth = worker is dead.",
|
||||
"targets": [
|
||||
{"expr": "time() - veza_reconciler_last_run_timestamp", "refId": "A"}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "s",
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"color": "green", "value": null},
|
||||
{"color": "yellow", "value": 3600},
|
||||
{"color": "red", "value": 7200}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 10,
|
||||
"title": "Orphan refunds auto-failed (cumulative)",
|
||||
"type": "timeseries",
|
||||
"gridPos": {"x": 16, "y": 13, "w": 8, "h": 6},
|
||||
"description": "Each increment = a Phase 2 crash the reconciler caught. Non-zero rate = investigate root cause.",
|
||||
"targets": [
|
||||
{"expr": "veza_reconciler_orphan_refunds_total", "refId": "A", "legendFormat": "total"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
|
@ -24,6 +24,7 @@ import (
|
|||
vezaes "veza-backend-api/internal/elasticsearch"
|
||||
"veza-backend-api/internal/jobs"
|
||||
"veza-backend-api/internal/metrics"
|
||||
"veza-backend-api/internal/monitoring"
|
||||
"veza-backend-api/internal/services"
|
||||
"veza-backend-api/internal/services/hyperswitch"
|
||||
"veza-backend-api/internal/shutdown"
|
||||
|
|
@ -350,6 +351,16 @@ func main() {
|
|||
// backlog doesn't lock the table.
|
||||
jobs.ScheduleHyperswitchWebhookLogCleanup(db, logger, cfg.HyperswitchWebhookLogRetentionDays)
|
||||
|
||||
// v1.0.7 item F: 60s sampler feeds five ledger-health gauges +
|
||||
// reconciler_* counters. Grafana dashboard in config/grafana/ledger.json,
|
||||
// alert rules in config/alertmanager/ledger.yml.
|
||||
ledgerSamplerCtx, ledgerSamplerCancel := context.WithCancel(context.Background())
|
||||
monitoring.ScheduleLedgerHealthSampler(ledgerSamplerCtx, db.GormDB, logger)
|
||||
shutdownManager.Register(shutdown.NewShutdownFunc("ledger_health_sampler", func(ctx context.Context) error {
|
||||
ledgerSamplerCancel()
|
||||
return nil
|
||||
}))
|
||||
|
||||
// Configuration du serveur HTTP
|
||||
port := fmt.Sprintf("%d", cfg.AppPort)
|
||||
if cfg.AppPort == 0 {
|
||||
|
|
|
|||
|
|
@ -7,6 +7,8 @@ import (
|
|||
|
||||
"go.uber.org/zap"
|
||||
"gorm.io/gorm"
|
||||
|
||||
"veza-backend-api/internal/monitoring"
|
||||
)
|
||||
|
||||
// HyperswitchReadClient is the abstraction ReconcileHyperswitchWorker
|
||||
|
|
@ -142,9 +144,11 @@ func (w *ReconcileHyperswitchWorker) Start(ctx context.Context) {
|
|||
// during incident response). Three independent phases, each bounded
|
||||
// by batchLimit; failures in one phase don't stop the others.
|
||||
func (w *ReconcileHyperswitchWorker) RunOnce(ctx context.Context) {
|
||||
start := time.Now()
|
||||
w.reconcileStuckOrders(ctx)
|
||||
w.reconcileOrphanRefunds(ctx)
|
||||
w.reconcileStuckRefunds(ctx)
|
||||
monitoring.RecordReconcilerSweepDuration(time.Since(start))
|
||||
}
|
||||
|
||||
// reconcileStuckOrders finds orders in `pending` older than
|
||||
|
|
@ -188,6 +192,7 @@ func (w *ReconcileHyperswitchWorker) syncOrder(ctx context.Context, o *Order) {
|
|||
zap.Error(err))
|
||||
return
|
||||
}
|
||||
monitoring.RecordReconcilerAction("stuck_orders")
|
||||
w.logger.Info("reconcile: stuck order synced via synthetic webhook",
|
||||
zap.String("order_id", o.ID.String()),
|
||||
zap.String("payment_id", o.HyperswitchPaymentID),
|
||||
|
|
@ -239,6 +244,8 @@ func (w *ReconcileHyperswitchWorker) reconcileOrphanRefunds(ctx context.Context)
|
|||
zap.Error(err))
|
||||
continue
|
||||
}
|
||||
monitoring.RecordReconcilerAction("orphan_refunds")
|
||||
monitoring.RecordReconcilerOrphanRefund()
|
||||
w.logger.Error("reconcile: ORPHAN REFUND auto-failed — operator attention needed",
|
||||
zap.String("refund_id", r.ID.String()),
|
||||
zap.String("order_id", r.OrderID.String()),
|
||||
|
|
@ -288,6 +295,7 @@ func (w *ReconcileHyperswitchWorker) syncRefund(ctx context.Context, r *Refund)
|
|||
zap.Error(err))
|
||||
return
|
||||
}
|
||||
monitoring.RecordReconcilerAction("stuck_refunds")
|
||||
w.logger.Info("reconcile: stuck refund synced via synthetic webhook",
|
||||
zap.String("refund_id", r.ID.String()),
|
||||
zap.String("hyperswitch_refund_id", r.HyperswitchRefundID),
|
||||
|
|
|
|||
225
veza-backend-api/internal/monitoring/ledger_metrics.go
Normal file
225
veza-backend-api/internal/monitoring/ledger_metrics.go
Normal file
|
|
@ -0,0 +1,225 @@
|
|||
package monitoring
|
||||
|
||||
import (
|
||||
"context"
|
||||
"time"
|
||||
|
||||
"github.com/prometheus/client_golang/prometheus"
|
||||
"github.com/prometheus/client_golang/prometheus/promauto"
|
||||
"go.uber.org/zap"
|
||||
"gorm.io/gorm"
|
||||
)
|
||||
|
||||
// Ledger-health metrics (v1.0.7 item F, audit P1.8).
|
||||
//
|
||||
// Five gauges expose stuck-state counts so ops dashboards + alert
|
||||
// rules can spot a money-movement pipeline stall before a customer
|
||||
// does. Sampled every LedgerSamplerInterval via a 60s ticker that
|
||||
// runs a cheap indexed SELECT COUNT(*) per gauge. The query cost is
|
||||
// bounded: each clause filters on `status + created_at`, both
|
||||
// indexed on the relevant tables.
|
||||
//
|
||||
// Paired with reconciler_* counters so the dashboard can tell the
|
||||
// whole story at a glance: "we have N stuck orders and the
|
||||
// reconciler has resolved M of them today."
|
||||
//
|
||||
// Plus two alert rules in config/alertmanager/ledger.yml:
|
||||
// * ledger_stuck_orders_pending > 0 for 10m → page
|
||||
// * ledger_orphan_refund_rows > 0 for 5m → page (bug in two-phase
|
||||
// commit between DB and PSP — immediate ops attention)
|
||||
var (
|
||||
// LedgerStuckOrdersPending is the count of orders sitting in
|
||||
// `pending` past the staleness threshold (30m by default). Should
|
||||
// be 0 in steady state; non-zero means webhooks from the PSP
|
||||
// stopped arriving or our endpoint is rejecting them.
|
||||
LedgerStuckOrdersPending = promauto.NewGauge(prometheus.GaugeOpts{
|
||||
Name: "veza_ledger_stuck_orders_pending",
|
||||
Help: "Orders in 'pending' status older than the staleness threshold (30m). Non-zero triggers ops alert.",
|
||||
})
|
||||
|
||||
// LedgerStuckRefundsPending is the count of refunds with a PSP id
|
||||
// but still in `pending` past the threshold. Symptom: the refund
|
||||
// was accepted by Hyperswitch but our webhook handler never
|
||||
// received the terminal event.
|
||||
LedgerStuckRefundsPending = promauto.NewGauge(prometheus.GaugeOpts{
|
||||
Name: "veza_ledger_stuck_refunds_pending",
|
||||
Help: "Refunds with hyperswitch_refund_id set but status still 'pending' older than 30m.",
|
||||
})
|
||||
|
||||
// LedgerFailedTransfersAtMaxRetry is the count of seller_transfers
|
||||
// that exhausted the retry worker's attempts. Non-zero = ops
|
||||
// investigation required; the Stripe Connect side is stuck
|
||||
// somehow.
|
||||
LedgerFailedTransfersAtMaxRetry = promauto.NewGauge(prometheus.GaugeOpts{
|
||||
Name: "veza_ledger_failed_transfers_at_max_retry",
|
||||
Help: "seller_transfers with status='permanently_failed' — retry worker gave up.",
|
||||
})
|
||||
|
||||
// LedgerOrphanRefundRows is THE alert gauge. Non-zero means a
|
||||
// Refund row exists in 'pending' with no hyperswitch_refund_id
|
||||
// (past 5m) — i.e., Phase 1 ran, Phase 2 crashed. This is a
|
||||
// two-phase-commit bug. Page on > 0 for 5m.
|
||||
LedgerOrphanRefundRows = promauto.NewGauge(prometheus.GaugeOpts{
|
||||
Name: "veza_ledger_orphan_refund_rows",
|
||||
Help: "Refunds pending with empty hyperswitch_refund_id older than 5m. Non-zero = crash between Phase 1 and Phase 2 of RefundOrder, page ops immediately.",
|
||||
})
|
||||
|
||||
// LedgerReversalPendingTransfers tracks rows waiting for the
|
||||
// Stripe reversal worker (v1.0.7 item B). Non-zero during a
|
||||
// reversal is normal and transient; sustained > 0 means the
|
||||
// reversal worker is stuck or Stripe Connect is down.
|
||||
LedgerReversalPendingTransfers = promauto.NewGauge(prometheus.GaugeOpts{
|
||||
Name: "veza_ledger_reversal_pending_transfers",
|
||||
Help: "seller_transfers in 'reversal_pending' older than 30m — reversal worker is behind.",
|
||||
})
|
||||
|
||||
// --- Reconciler metrics (v1.0.7 item F, covers item C) ---
|
||||
|
||||
// ReconcilerActionsTotal counts actions the reconciliation worker
|
||||
// has taken, labelled by phase. Lets dashboards show "reconciler
|
||||
// fixed N stuck orders this week" vs "orphan refunds auto-failed
|
||||
// this week" without re-parsing logs.
|
||||
ReconcilerActionsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
|
||||
Name: "veza_reconciler_actions_total",
|
||||
Help: "Reconciler actions taken, by phase (stuck_orders|stuck_refunds|orphan_refunds).",
|
||||
}, []string{"phase"})
|
||||
|
||||
// ReconcilerOrphanRefundsTotal is the load-bearing counter for
|
||||
// detecting two-phase-commit bugs. Each orphan refund the
|
||||
// reconciler auto-fails increments this; a sustained non-zero
|
||||
// rate is the ledger-health equivalent of a fire alarm.
|
||||
ReconcilerOrphanRefundsTotal = promauto.NewCounter(prometheus.CounterOpts{
|
||||
Name: "veza_reconciler_orphan_refunds_total",
|
||||
Help: "Orphan refunds (Phase 2 crash) auto-failed by the reconciler. Non-zero rate = investigate root cause.",
|
||||
})
|
||||
|
||||
// ReconcilerSweepDurationSeconds measures one RunOnce tick so
|
||||
// slow sweeps show up in alerting.
|
||||
ReconcilerSweepDurationSeconds = promauto.NewHistogram(prometheus.HistogramOpts{
|
||||
Name: "veza_reconciler_sweep_duration_seconds",
|
||||
Help: "Duration of one ReconcileHyperswitchWorker.RunOnce tick in seconds.",
|
||||
Buckets: prometheus.ExponentialBuckets(0.1, 2, 10), // 0.1s to ~100s
|
||||
})
|
||||
|
||||
// ReconcilerLastRunTimestamp is set to time.Now().Unix() at the
|
||||
// end of every RunOnce. Alert rule: `now() - timestamp >
|
||||
// 2 * RECONCILE_INTERVAL` → worker is dead.
|
||||
ReconcilerLastRunTimestamp = promauto.NewGauge(prometheus.GaugeOpts{
|
||||
Name: "veza_reconciler_last_run_timestamp",
|
||||
Help: "Unix timestamp of the last successful ReconcileHyperswitchWorker tick. Stale = worker dead.",
|
||||
})
|
||||
)
|
||||
|
||||
// Convenience recorders for the reconciler worker. Called from
|
||||
// internal/core/marketplace/reconcile_hyperswitch.go so that package
|
||||
// doesn't import Prometheus directly (keeps marketplace clean of
|
||||
// observability plumbing).
|
||||
func RecordReconcilerAction(phase string) {
|
||||
ReconcilerActionsTotal.WithLabelValues(phase).Inc()
|
||||
}
|
||||
|
||||
func RecordReconcilerOrphanRefund() {
|
||||
ReconcilerOrphanRefundsTotal.Inc()
|
||||
}
|
||||
|
||||
func RecordReconcilerSweepDuration(d time.Duration) {
|
||||
ReconcilerSweepDurationSeconds.Observe(d.Seconds())
|
||||
ReconcilerLastRunTimestamp.Set(float64(time.Now().Unix()))
|
||||
}
|
||||
|
||||
// --- Sampler ---
|
||||
|
||||
// LedgerSamplerInterval is how often the sampler re-queries the DB.
|
||||
// 60s is the sweet spot for our volumes — scrape cost negligible, and
|
||||
// stale-by-up-to-a-minute is fine for the alert-rule windows (10m /
|
||||
// 5m).
|
||||
const LedgerSamplerInterval = 60 * time.Second
|
||||
|
||||
// Staleness thresholds match the reconciler's defaults. If ops tunes
|
||||
// the reconciler thresholds via env vars, the sampler still reports
|
||||
// against these constants — intentional: the two serve different
|
||||
// audiences (reconciler = auto-recovery, sampler = human visibility).
|
||||
// A mismatch means alerts fire while the reconciler has already
|
||||
// started working on the issue, which is the correct behavior.
|
||||
const (
|
||||
ledgerStuckOrderAgeThreshold = 30 * time.Minute
|
||||
ledgerStuckRefundAgeThreshold = 30 * time.Minute
|
||||
ledgerOrphanRefundAgeThreshold = 5 * time.Minute
|
||||
ledgerReversalPendingThreshold = 30 * time.Minute
|
||||
)
|
||||
|
||||
// SampleLedgerHealth runs the five count queries and updates the
|
||||
// gauges. Safe to call concurrently (gauge writes are atomic). Any
|
||||
// query error sets the corresponding gauge to -1 — a distinctive
|
||||
// value that dashboards can filter on ("sampler is broken, don't
|
||||
// trust the number").
|
||||
//
|
||||
// Exposed as a function rather than a method so tests can drive it
|
||||
// directly against a sqlite in-memory DB.
|
||||
func SampleLedgerHealth(ctx context.Context, db *gorm.DB, logger *zap.Logger) {
|
||||
now := time.Now()
|
||||
|
||||
sample := func(name string, gauge prometheus.Gauge, query string, args ...interface{}) {
|
||||
var count int64
|
||||
if err := db.WithContext(ctx).Raw(query, args...).Scan(&count).Error; err != nil {
|
||||
logger.Error("ledger sampler: query failed",
|
||||
zap.String("gauge", name),
|
||||
zap.Error(err))
|
||||
gauge.Set(-1)
|
||||
return
|
||||
}
|
||||
gauge.Set(float64(count))
|
||||
}
|
||||
|
||||
sample("stuck_orders_pending", LedgerStuckOrdersPending,
|
||||
`SELECT COUNT(*) FROM orders
|
||||
WHERE status = 'pending'
|
||||
AND hyperswitch_payment_id IS NOT NULL AND hyperswitch_payment_id <> ''
|
||||
AND created_at < ?`,
|
||||
now.Add(-ledgerStuckOrderAgeThreshold))
|
||||
|
||||
sample("stuck_refunds_pending", LedgerStuckRefundsPending,
|
||||
`SELECT COUNT(*) FROM refunds
|
||||
WHERE status = 'pending'
|
||||
AND hyperswitch_refund_id IS NOT NULL AND hyperswitch_refund_id <> ''
|
||||
AND created_at < ?`,
|
||||
now.Add(-ledgerStuckRefundAgeThreshold))
|
||||
|
||||
sample("failed_transfers_at_max_retry", LedgerFailedTransfersAtMaxRetry,
|
||||
`SELECT COUNT(*) FROM seller_transfers WHERE status = 'permanently_failed'`)
|
||||
|
||||
sample("orphan_refund_rows", LedgerOrphanRefundRows,
|
||||
`SELECT COUNT(*) FROM refunds
|
||||
WHERE status = 'pending'
|
||||
AND (hyperswitch_refund_id IS NULL OR hyperswitch_refund_id = '')
|
||||
AND created_at < ?`,
|
||||
now.Add(-ledgerOrphanRefundAgeThreshold))
|
||||
|
||||
sample("reversal_pending_transfers", LedgerReversalPendingTransfers,
|
||||
`SELECT COUNT(*) FROM seller_transfers
|
||||
WHERE status = 'reversal_pending'
|
||||
AND updated_at < ?`,
|
||||
now.Add(-ledgerReversalPendingThreshold))
|
||||
}
|
||||
|
||||
// ScheduleLedgerHealthSampler runs SampleLedgerHealth once at startup
|
||||
// (so dashboards aren't blank for the first minute) and then every
|
||||
// LedgerSamplerInterval until ctx is cancelled.
|
||||
func ScheduleLedgerHealthSampler(ctx context.Context, db *gorm.DB, logger *zap.Logger) {
|
||||
ticker := time.NewTicker(LedgerSamplerInterval)
|
||||
go func() {
|
||||
defer ticker.Stop()
|
||||
SampleLedgerHealth(ctx, db, logger)
|
||||
for {
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
logger.Info("Ledger health sampler stopped")
|
||||
return
|
||||
case <-ticker.C:
|
||||
SampleLedgerHealth(ctx, db, logger)
|
||||
}
|
||||
}
|
||||
}()
|
||||
logger.Info("Ledger health sampler scheduled",
|
||||
zap.Duration("interval", LedgerSamplerInterval))
|
||||
}
|
||||
222
veza-backend-api/internal/monitoring/ledger_metrics_test.go
Normal file
222
veza-backend-api/internal/monitoring/ledger_metrics_test.go
Normal file
|
|
@ -0,0 +1,222 @@
|
|||
package monitoring
|
||||
|
||||
import (
|
||||
"context"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/google/uuid"
|
||||
"github.com/prometheus/client_golang/prometheus"
|
||||
dto "github.com/prometheus/client_model/go"
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
"go.uber.org/zap"
|
||||
"gorm.io/driver/sqlite"
|
||||
"gorm.io/gorm"
|
||||
)
|
||||
|
||||
// --- test DB helpers --------------------------------------------------------
|
||||
|
||||
// minimal table schemas: we don't import marketplace here to avoid
|
||||
// circular deps (monitoring is a leaf); we create just what the
|
||||
// sampler queries.
|
||||
type testOrder struct {
|
||||
ID uuid.UUID `gorm:"type:uuid;primaryKey"`
|
||||
Status string
|
||||
HyperswitchPaymentID string
|
||||
CreatedAt time.Time
|
||||
}
|
||||
|
||||
func (testOrder) TableName() string { return "orders" }
|
||||
|
||||
type testRefund struct {
|
||||
ID uuid.UUID `gorm:"type:uuid;primaryKey"`
|
||||
Status string
|
||||
HyperswitchRefundID string
|
||||
CreatedAt time.Time
|
||||
}
|
||||
|
||||
func (testRefund) TableName() string { return "refunds" }
|
||||
|
||||
type testSellerTransfer struct {
|
||||
ID uuid.UUID `gorm:"type:uuid;primaryKey"`
|
||||
Status string
|
||||
UpdatedAt time.Time
|
||||
}
|
||||
|
||||
func (testSellerTransfer) TableName() string { return "seller_transfers" }
|
||||
|
||||
func setupSamplerTestDB(t *testing.T) *gorm.DB {
|
||||
t.Helper()
|
||||
db, err := gorm.Open(sqlite.Open(":memory:"), &gorm.Config{})
|
||||
require.NoError(t, err)
|
||||
require.NoError(t, db.AutoMigrate(&testOrder{}, &testRefund{}, &testSellerTransfer{}))
|
||||
return db
|
||||
}
|
||||
|
||||
// gaugeValue reads the raw value of a Prometheus gauge for assertion.
|
||||
func gaugeValue(t *testing.T, g prometheus.Gauge) float64 {
|
||||
t.Helper()
|
||||
m := &dto.Metric{}
|
||||
require.NoError(t, g.Write(m))
|
||||
return m.GetGauge().GetValue()
|
||||
}
|
||||
|
||||
// --- stuck_orders_pending gauge ---------------------------------------------
|
||||
|
||||
func TestSampler_CountsStuckOrdersPending(t *testing.T) {
|
||||
db := setupSamplerTestDB(t)
|
||||
now := time.Now()
|
||||
|
||||
// 2 stuck (over 30m old) + 1 recent (under 30m) + 1 stuck but no payment_id.
|
||||
require.NoError(t, db.Create(&testOrder{
|
||||
ID: uuid.New(), Status: "pending", HyperswitchPaymentID: "pay_1",
|
||||
CreatedAt: now.Add(-1 * time.Hour),
|
||||
}).Error)
|
||||
require.NoError(t, db.Create(&testOrder{
|
||||
ID: uuid.New(), Status: "pending", HyperswitchPaymentID: "pay_2",
|
||||
CreatedAt: now.Add(-45 * time.Minute),
|
||||
}).Error)
|
||||
require.NoError(t, db.Create(&testOrder{
|
||||
ID: uuid.New(), Status: "pending", HyperswitchPaymentID: "pay_recent",
|
||||
CreatedAt: now.Add(-5 * time.Minute),
|
||||
}).Error)
|
||||
// Stuck BUT no payment_id → excluded (pre-PSP, not the gauge's concern).
|
||||
require.NoError(t, db.Create(&testOrder{
|
||||
ID: uuid.New(), Status: "pending", HyperswitchPaymentID: "",
|
||||
CreatedAt: now.Add(-1 * time.Hour),
|
||||
}).Error)
|
||||
|
||||
SampleLedgerHealth(context.Background(), db, zap.NewNop())
|
||||
|
||||
assert.Equal(t, 2.0, gaugeValue(t, LedgerStuckOrdersPending),
|
||||
"gauge must count only rows pending > 30m with non-empty payment_id")
|
||||
}
|
||||
|
||||
func TestSampler_StuckOrdersZeroWhenAllCompleted(t *testing.T) {
|
||||
db := setupSamplerTestDB(t)
|
||||
require.NoError(t, db.Create(&testOrder{
|
||||
ID: uuid.New(), Status: "completed", HyperswitchPaymentID: "pay_done",
|
||||
CreatedAt: time.Now().Add(-1 * time.Hour),
|
||||
}).Error)
|
||||
|
||||
SampleLedgerHealth(context.Background(), db, zap.NewNop())
|
||||
assert.Equal(t, 0.0, gaugeValue(t, LedgerStuckOrdersPending))
|
||||
}
|
||||
|
||||
// --- orphan_refund_rows gauge (THE alert gauge) -----------------------------
|
||||
|
||||
func TestSampler_CountsOrphanRefunds(t *testing.T) {
|
||||
db := setupSamplerTestDB(t)
|
||||
now := time.Now()
|
||||
|
||||
// Three orphans (empty hs_id, >5m old) + one recent orphan (should not count)
|
||||
// + one with an hs_id (should not count).
|
||||
for i := 0; i < 3; i++ {
|
||||
require.NoError(t, db.Create(&testRefund{
|
||||
ID: uuid.New(), Status: "pending", HyperswitchRefundID: "",
|
||||
CreatedAt: now.Add(-10 * time.Minute),
|
||||
}).Error)
|
||||
}
|
||||
require.NoError(t, db.Create(&testRefund{
|
||||
ID: uuid.New(), Status: "pending", HyperswitchRefundID: "",
|
||||
CreatedAt: now.Add(-2 * time.Minute), // too recent
|
||||
}).Error)
|
||||
require.NoError(t, db.Create(&testRefund{
|
||||
ID: uuid.New(), Status: "pending", HyperswitchRefundID: "ref_ok",
|
||||
CreatedAt: now.Add(-10 * time.Minute), // has hs_id → not orphan
|
||||
}).Error)
|
||||
|
||||
SampleLedgerHealth(context.Background(), db, zap.NewNop())
|
||||
assert.Equal(t, 3.0, gaugeValue(t, LedgerOrphanRefundRows),
|
||||
"orphan gauge counts pending refunds with empty hs_id older than 5m — the two-phase-commit-bug canary")
|
||||
}
|
||||
|
||||
// --- stuck_refunds_pending gauge --------------------------------------------
|
||||
|
||||
func TestSampler_CountsStuckRefundsWithHsID(t *testing.T) {
|
||||
db := setupSamplerTestDB(t)
|
||||
now := time.Now()
|
||||
|
||||
require.NoError(t, db.Create(&testRefund{
|
||||
ID: uuid.New(), Status: "pending", HyperswitchRefundID: "ref_stuck",
|
||||
CreatedAt: now.Add(-45 * time.Minute),
|
||||
}).Error)
|
||||
// Orphan, counted by a different gauge, not this one:
|
||||
require.NoError(t, db.Create(&testRefund{
|
||||
ID: uuid.New(), Status: "pending", HyperswitchRefundID: "",
|
||||
CreatedAt: now.Add(-45 * time.Minute),
|
||||
}).Error)
|
||||
|
||||
SampleLedgerHealth(context.Background(), db, zap.NewNop())
|
||||
assert.Equal(t, 1.0, gaugeValue(t, LedgerStuckRefundsPending))
|
||||
// And orphan gauge catches the other one.
|
||||
assert.Equal(t, 1.0, gaugeValue(t, LedgerOrphanRefundRows))
|
||||
}
|
||||
|
||||
// --- permanently_failed transfers + reversal_pending ------------------------
|
||||
|
||||
func TestSampler_CountsFailedAndReversalPendingTransfers(t *testing.T) {
|
||||
db := setupSamplerTestDB(t)
|
||||
|
||||
// 2 permanently_failed, 1 completed (ignored by failed gauge),
|
||||
// 1 reversal_pending updated 45m ago, 1 reversal_pending recent (ignored).
|
||||
require.NoError(t, db.Create(&testSellerTransfer{
|
||||
ID: uuid.New(), Status: "permanently_failed", UpdatedAt: time.Now(),
|
||||
}).Error)
|
||||
require.NoError(t, db.Create(&testSellerTransfer{
|
||||
ID: uuid.New(), Status: "permanently_failed", UpdatedAt: time.Now().Add(-1 * time.Hour),
|
||||
}).Error)
|
||||
require.NoError(t, db.Create(&testSellerTransfer{
|
||||
ID: uuid.New(), Status: "completed", UpdatedAt: time.Now(),
|
||||
}).Error)
|
||||
require.NoError(t, db.Create(&testSellerTransfer{
|
||||
ID: uuid.New(), Status: "reversal_pending", UpdatedAt: time.Now().Add(-45 * time.Minute),
|
||||
}).Error)
|
||||
require.NoError(t, db.Create(&testSellerTransfer{
|
||||
ID: uuid.New(), Status: "reversal_pending", UpdatedAt: time.Now().Add(-5 * time.Minute),
|
||||
}).Error)
|
||||
|
||||
SampleLedgerHealth(context.Background(), db, zap.NewNop())
|
||||
assert.Equal(t, 2.0, gaugeValue(t, LedgerFailedTransfersAtMaxRetry))
|
||||
assert.Equal(t, 1.0, gaugeValue(t, LedgerReversalPendingTransfers),
|
||||
"reversal_pending counts only rows that have been stuck past the threshold")
|
||||
}
|
||||
|
||||
// --- reconciler counters are writable + readable ----------------------------
|
||||
|
||||
func TestReconcilerRecorders(t *testing.T) {
|
||||
// Capture starting values so parallel test runs don't assume a
|
||||
// clean slate (Prometheus registries are global).
|
||||
actionsStart := func() float64 {
|
||||
m := &dto.Metric{}
|
||||
require.NoError(t, ReconcilerActionsTotal.WithLabelValues("stuck_orders").Write(m))
|
||||
return m.GetCounter().GetValue()
|
||||
}()
|
||||
orphanStart := func() float64 {
|
||||
m := &dto.Metric{}
|
||||
require.NoError(t, ReconcilerOrphanRefundsTotal.Write(m))
|
||||
return m.GetCounter().GetValue()
|
||||
}()
|
||||
|
||||
RecordReconcilerAction("stuck_orders")
|
||||
RecordReconcilerAction("stuck_orders")
|
||||
RecordReconcilerOrphanRefund()
|
||||
RecordReconcilerSweepDuration(500 * time.Millisecond)
|
||||
|
||||
actionsAfter := func() float64 {
|
||||
m := &dto.Metric{}
|
||||
require.NoError(t, ReconcilerActionsTotal.WithLabelValues("stuck_orders").Write(m))
|
||||
return m.GetCounter().GetValue()
|
||||
}()
|
||||
orphanAfter := func() float64 {
|
||||
m := &dto.Metric{}
|
||||
require.NoError(t, ReconcilerOrphanRefundsTotal.Write(m))
|
||||
return m.GetCounter().GetValue()
|
||||
}()
|
||||
lastRun := gaugeValue(t, ReconcilerLastRunTimestamp)
|
||||
|
||||
assert.Equal(t, 2.0, actionsAfter-actionsStart)
|
||||
assert.Equal(t, 1.0, orphanAfter-orphanStart)
|
||||
assert.Greater(t, lastRun, float64(0), "RecordReconcilerSweepDuration must stamp last-run timestamp")
|
||||
}
|
||||
Loading…
Reference in a new issue