veza/veza-backend-api/internal
senke 94dfc80b73 feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F
Five Prometheus gauges + reconciler metrics + Grafana dashboard +
three alert rules. Closes axis-1 P1.8 and adds observability for
item C's reconciler (user review: "F should include reconciler_*
metrics, otherwise tag is blind on the worker we just shipped").

Gauges (veza_ledger_, sampled every 60s):
  * orphan_refund_rows — THE canary. Pending refunds with empty
    hyperswitch_refund_id older than 5m = Phase 2 crash in
    RefundOrder. Alert: > 0 for 5m → page.
  * stuck_orders_pending — order pending > 30m with non-empty
    payment_id. Alert: > 0 for 10m → page.
  * stuck_refunds_pending — refund pending > 30m with hs_id.
  * failed_transfers_at_max_retry — permanently_failed rows.
  * reversal_pending_transfers — item B rows stuck > 30m.

Reconciler metrics (veza_reconciler_):
  * actions_total{phase} — counter by phase.
  * orphan_refunds_total — two-phase-bug canary.
  * sweep_duration_seconds — exponential histogram.
  * last_run_timestamp — alert: stale > 2h → page (worker dead).

Implementation notes:
  * Sampler thresholds hardcoded to match reconciler defaults —
    intentional mismatch allowed (alerts fire while reconciler
    already working = correct behavior).
  * Query error sets gauge to -1 (sentinel for "sampler broken").
  * marketplace package routes through monitoring recorders so it
    doesn't import prometheus directly.
  * Sampler runs regardless of Hyperswitch enablement; gauges
    default 0 when pipeline idle.
  * Graceful shutdown wired in cmd/api/main.go.

Alert rules in config/alertmanager/ledger.yml with runbook
pointers + detailed descriptions — each alert explains WHAT
happened, WHY the reconciler may not resolve it, and WHERE to
look first.

Grafana dashboard config/grafana/dashboards/ledger-health.json —
top row = 5 stat panels (orphan first, color-coded red on > 0),
middle row = trend timeseries + reconciler action rate by phase,
bottom row = sweep duration p50/p95/p99 + seconds-since-last-tick
+ orphan cumulative.

Tests — 6 cases, all green (sqlite :memory:):
  * CountsStuckOrdersPending (includes the filter on
    non-empty payment_id)
  * StuckOrdersZeroWhenAllCompleted
  * CountsOrphanRefunds (THE canary)
  * CountsStuckRefundsWithHsID (gauge-orthogonality check)
  * CountsFailedAndReversalPendingTransfers
  * ReconcilerRecorders (counter + gauge shape)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 03:40:14 +02:00
..
api feat(webhooks): persist raw hyperswitch payloads to audit log — v1.0.7 item E 2026-04-18 02:44:58 +02:00
common v0.9.2 2026-03-05 19:27:34 +01:00
config feat(workers): hyperswitch reconciliation sweep for stuck pending states — v1.0.7 item C 2026-04-18 03:08:15 +02:00
core feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F 2026-04-18 03:40:14 +02:00
database v0.9.4 2026-03-05 23:03:43 +01:00
dto Phase 2 stabilisation: code mort, Modal→Dialog, feature flags, tests, router split, Rust legacy 2026-02-14 17:23:32 +01:00
elasticsearch style(backend): gofmt -w on 85 files (whitespace only) 2026-04-14 12:22:14 +02:00
email refactor(backend,infra): unify SMTP env schema on canonical SMTP_* names 2026-04-16 20:44:09 +02:00
errors v0.9.8 2026-03-06 19:13:16 +01:00
eventbus fix(eventbus): log RabbitMQ publish failures instead of silent drop 2026-04-16 20:50:51 +02:00
features adding initial backend API (Go) 2025-12-03 20:29:37 +01:00
handlers feat(marketplace): async stripe connect reversal worker — v1.0.7 item B day 2 2026-04-17 15:34:29 +02:00
infrastructure v0.9.4 2026-03-05 23:03:43 +01:00
integration style(backend): gofmt -w on 85 files (whitespace only) 2026-04-14 12:22:14 +02:00
interfaces adding initial backend API (Go) 2025-12-03 20:29:37 +01:00
jobs feat(webhooks): persist raw hyperswitch payloads to audit log — v1.0.7 item E 2026-04-18 02:44:58 +02:00
logging style(backend): gofmt -w on 85 files (whitespace only) 2026-04-14 12:22:14 +02:00
metrics v0.9.4 2026-03-05 23:03:43 +01:00
middleware fix(middleware): bypass response cache for range-aware media endpoints 2026-04-16 16:13:02 +02:00
models feat(backend,web): self-service creator role upgrade via /settings 2026-04-16 18:35:07 +02:00
monitoring feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F 2026-04-18 03:40:14 +02:00
pagination v0.9.8 2026-03-06 19:13:16 +01:00
recovery chore(v0.102): consolidate remaining changes — docs, frontend, backend 2026-02-20 13:02:12 +01:00
repositories fix(v0.12.6.1): remediate 2 CRITICAL + 10 HIGH + 1 MEDIUM pentest findings 2026-03-12 05:40:53 +01:00
repository fix(v0.12.6.1): update in-memory UserRepositoryImpl to accept context.Context 2026-03-12 05:47:47 +01:00
resilience chore: consolidate CI, E2E, backend and frontend updates 2026-02-17 16:43:21 +01:00
response fix: stabilize builds, tests, and lint across all stacks 2026-04-05 16:48:07 +02:00
security refactor(backend): replace 40 fmt.Printf calls with zap structured logging 2026-02-22 17:44:38 +01:00
services feat(workers): hyperswitch reconciliation sweep for stuck pending states — v1.0.7 item C 2026-04-18 03:08:15 +02:00
shutdown incus deployement fully implemented, Makefile updated and make fmt ran 2026-01-13 19:47:57 +01:00
testutils ci: retire legacy backend-ci.yml, centralize Docker probe in SkipIfNoIntegration 2026-04-15 16:12:45 +02:00
tracing incus deployement fully implemented, Makefile updated and make fmt ran 2026-01-13 19:47:57 +01:00
types feat(profile): add profile privacy toggle (B3) 2026-02-20 15:10:02 +01:00
upload [INT-015] int: Add file upload format standardization 2025-12-25 15:40:01 +01:00
utils fix(v0.12.6): apply all pentest remediations — 36 findings across 36 files 2026-03-14 00:44:46 +01:00
validators feat(v0.13.3): complete - Polish Sécurité Avancée 2026-03-13 10:09:01 +01:00
websocket style(backend): gofmt -w on 85 files (whitespace only) 2026-04-14 12:22:14 +02:00
workers fix(backend): J4 — GDPR-compliant hard delete with Redis and ES cleanup 2026-04-15 12:25:39 +02:00