senke
|
94dfc80b73
|
feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F
Five Prometheus gauges + reconciler metrics + Grafana dashboard +
three alert rules. Closes axis-1 P1.8 and adds observability for
item C's reconciler (user review: "F should include reconciler_*
metrics, otherwise tag is blind on the worker we just shipped").
Gauges (veza_ledger_, sampled every 60s):
* orphan_refund_rows — THE canary. Pending refunds with empty
hyperswitch_refund_id older than 5m = Phase 2 crash in
RefundOrder. Alert: > 0 for 5m → page.
* stuck_orders_pending — order pending > 30m with non-empty
payment_id. Alert: > 0 for 10m → page.
* stuck_refunds_pending — refund pending > 30m with hs_id.
* failed_transfers_at_max_retry — permanently_failed rows.
* reversal_pending_transfers — item B rows stuck > 30m.
Reconciler metrics (veza_reconciler_):
* actions_total{phase} — counter by phase.
* orphan_refunds_total — two-phase-bug canary.
* sweep_duration_seconds — exponential histogram.
* last_run_timestamp — alert: stale > 2h → page (worker dead).
Implementation notes:
* Sampler thresholds hardcoded to match reconciler defaults —
intentional mismatch allowed (alerts fire while reconciler
already working = correct behavior).
* Query error sets gauge to -1 (sentinel for "sampler broken").
* marketplace package routes through monitoring recorders so it
doesn't import prometheus directly.
* Sampler runs regardless of Hyperswitch enablement; gauges
default 0 when pipeline idle.
* Graceful shutdown wired in cmd/api/main.go.
Alert rules in config/alertmanager/ledger.yml with runbook
pointers + detailed descriptions — each alert explains WHAT
happened, WHY the reconciler may not resolve it, and WHERE to
look first.
Grafana dashboard config/grafana/dashboards/ledger-health.json —
top row = 5 stat panels (orphan first, color-coded red on > 0),
middle row = trend timeseries + reconciler action rate by phase,
bottom row = sweep duration p50/p95/p99 + seconds-since-last-tick
+ orphan cumulative.
Tests — 6 cases, all green (sqlite :memory:):
* CountsStuckOrdersPending (includes the filter on
non-empty payment_id)
* StuckOrdersZeroWhenAllCompleted
* CountsOrphanRefunds (THE canary)
* CountsStuckRefundsWithHsID (gauge-orthogonality check)
* CountsFailedAndReversalPendingTransfers
* ReconcilerRecorders (counter + gauge shape)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-18 03:40:14 +02:00 |
|
senke
|
83ed4f315b
|
chore(release): v0.602 — Payout, Dette Technique & Tests E2E
Backend API CI / test-unit (push) Failing after 0s
Backend API CI / test-integration (push) Failing after 0s
Frontend CI / test (push) Failing after 0s
Storybook Audit / Build & audit Storybook (push) Failing after 0s
- Stripe Connect: onboarding, balance, SellerDashboardView
- Interceptors: auth.ts, error.ts extracted, facade
- Grafana: dashboards enriched (p50, top endpoints, 4xx, WS, commerce)
- E2E commerce: product->order->review->invoice
- SMOKE_TEST_V0602, RETROSPECTIVE_V0602, PAYOUT_MANUAL
- Archive V0_602 scope, V0_603 placeholder, SCOPE_CONTROL v0.603
- Fix sanitizer regex (Go no backreferences)
- Marketplace test schema: product_licenses, product_images, orders, licenses
|
2026-02-23 22:32:01 +01:00 |
|
senke
|
c002e74031
|
feat(monitoring): add 3 Grafana dashboards (API, Chat, Commerce)
- api-overview.json: request rate, p95 latency, 5xx errors, DB pool
- chat-overview.json: WebSocket upgrade rate, chat API
- commerce-overview.json: marketplace/commerce/orders metrics
- system-overview.json: replaces veza-dashboard.json
|
2026-02-23 19:54:01 +01:00 |
|