veza/veza-backend-api/cmd/api at 7e180a2c08750629c75be2200bb9f33eba66c4f1 - senke/veza

senke/veza

History

senke 7e180a2c08 feat(workers): hyperswitch reconciliation sweep for stuck pending states — v1.0.7 item C New ReconcileHyperswitchWorker sweeps for pending orders and refunds whose terminal webhook never arrived. Pulls live PSP state for each stuck row and synthesises a webhook payload to feed the normal ProcessPaymentWebhook / ProcessRefundWebhook dispatcher. The existing terminal-state guards on those handlers make reconciliation idempotent against real webhooks — a late webhook after the reconciler resolved the row is a no-op. Three stuck-state classes covered: 1. Stuck orders (pending > 30m, non-empty payment_id) → GetPaymentStatus + synthetic payment.<status> webhook. 2. Stuck refunds with PSP id (pending > 30m, non-empty hyperswitch_refund_id) → GetRefundStatus + synthetic refund.<status> webhook (error_message forwarded). 3. Orphan refunds (pending > 5m, EMPTY hyperswitch_refund_id) → mark failed + roll order back to completed + log ERROR. This is the "we crashed between Phase 1 and Phase 2 of RefundOrder" case, operator-attention territory. New interfaces: * marketplace.HyperswitchReadClient — read-only PSP surface the worker depends on (GetPaymentStatus, GetRefundStatus). The worker never calls CreatePayment / CreateRefund. * hyperswitch.Client.GetRefund + RefundStatus struct added. * hyperswitch.Provider gains GetRefundStatus + GetPaymentStatus pass-throughs that satisfy the marketplace interface. Configuration (all env-var tunable with sensible defaults): * RECONCILE_WORKER_ENABLED=true * RECONCILE_INTERVAL=1h (ops can drop to 5m during incident response without a code change) * RECONCILE_ORDER_STUCK_AFTER=30m * RECONCILE_REFUND_STUCK_AFTER=30m * RECONCILE_REFUND_ORPHAN_AFTER=5m (shorter because "app crashed" is a different signal from "network hiccup") Operational details: * Batch limit 50 rows per phase per tick so a 10k-row backlog doesn't hammer Hyperswitch. Next tick picks up the rest. * PSP read errors leave the row untouched — next tick retries. Reconciliation is always safe to replay. * Structured log on every action so `grep reconcile` tells the ops story: which order/refund got synced, against what status, how long it was stuck. * Worker wired in cmd/api/main.go, gated on HyperswitchEnabled + HyperswitchAPIKey. Graceful shutdown registered. * RunOnce exposed as public API for ad-hoc ops trigger during incident response. Tests — 10 cases, all green (sqlite :memory:): * TestReconcile_StuckOrder_SyncsViaSyntheticWebhook * TestReconcile_RecentOrder_NotTouched * TestReconcile_CompletedOrder_NotTouched * TestReconcile_OrderWithEmptyPaymentID_NotTouched * TestReconcile_PSPReadErrorLeavesRowIntact * TestReconcile_OrphanRefund_AutoFails_OrderRollsBack * TestReconcile_RecentOrphanRefund_NotTouched * TestReconcile_StuckRefund_SyncsViaSyntheticWebhook * TestReconcile_StuckRefund_FailureStatus_PassesErrorMessage * TestReconcile_AllTerminalStates_NoOp CHANGELOG v1.0.7-rc1 updated with the full item C section between D and the existing E block, matching the order convention (ship order: A → D → B → E → C, CHANGELOG order follows). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 03:08:15 +02:00
..
main.go	feat(workers): hyperswitch reconciliation sweep for stuck pending states — v1.0.7 item C	2026-04-18 03:08:15 +02:00

senke 7e180a2c08 feat(workers): hyperswitch reconciliation sweep for stuck pending states — v1.0.7 item C

New ReconcileHyperswitchWorker sweeps for pending orders and refunds
whose terminal webhook never arrived. Pulls live PSP state for each
stuck row and synthesises a webhook payload to feed the normal
ProcessPaymentWebhook / ProcessRefundWebhook dispatcher. The existing
terminal-state guards on those handlers make reconciliation
idempotent against real webhooks — a late webhook after the reconciler
resolved the row is a no-op.

Three stuck-state classes covered:
  1. Stuck orders (pending > 30m, non-empty payment_id) → GetPaymentStatus
     + synthetic payment.<status> webhook.
  2. Stuck refunds with PSP id (pending > 30m, non-empty
     hyperswitch_refund_id) → GetRefundStatus + synthetic
     refund.<status> webhook (error_message forwarded).
  3. Orphan refunds (pending > 5m, EMPTY hyperswitch_refund_id) →
     mark failed + roll order back to completed + log ERROR. This
     is the "we crashed between Phase 1 and Phase 2 of RefundOrder"
     case, operator-attention territory.

New interfaces:
  * marketplace.HyperswitchReadClient — read-only PSP surface the
    worker depends on (GetPaymentStatus, GetRefundStatus). The
    worker never calls CreatePayment / CreateRefund.
  * hyperswitch.Client.GetRefund + RefundStatus struct added.
  * hyperswitch.Provider gains GetRefundStatus + GetPaymentStatus
    pass-throughs that satisfy the marketplace interface.

Configuration (all env-var tunable with sensible defaults):
  * RECONCILE_WORKER_ENABLED=true
  * RECONCILE_INTERVAL=1h (ops can drop to 5m during incident
    response without a code change)
  * RECONCILE_ORDER_STUCK_AFTER=30m
  * RECONCILE_REFUND_STUCK_AFTER=30m
  * RECONCILE_REFUND_ORPHAN_AFTER=5m (shorter because "app crashed"
    is a different signal from "network hiccup")

Operational details:
  * Batch limit 50 rows per phase per tick so a 10k-row backlog
    doesn't hammer Hyperswitch. Next tick picks up the rest.
  * PSP read errors leave the row untouched — next tick retries.
    Reconciliation is always safe to replay.
  * Structured log on every action so `grep reconcile` tells the
    ops story: which order/refund got synced, against what status,
    how long it was stuck.
  * Worker wired in cmd/api/main.go, gated on
    HyperswitchEnabled + HyperswitchAPIKey. Graceful shutdown
    registered.
  * RunOnce exposed as public API for ad-hoc ops trigger during
    incident response.

Tests — 10 cases, all green (sqlite :memory:):
  * TestReconcile_StuckOrder_SyncsViaSyntheticWebhook
  * TestReconcile_RecentOrder_NotTouched
  * TestReconcile_CompletedOrder_NotTouched
  * TestReconcile_OrderWithEmptyPaymentID_NotTouched
  * TestReconcile_PSPReadErrorLeavesRowIntact
  * TestReconcile_OrphanRefund_AutoFails_OrderRollsBack
  * TestReconcile_RecentOrphanRefund_NotTouched
  * TestReconcile_StuckRefund_SyncsViaSyntheticWebhook
  * TestReconcile_StuckRefund_FailureStatus_PassesErrorMessage
  * TestReconcile_AllTerminalStates_NoOp

CHANGELOG v1.0.7-rc1 updated with the full item C section between D
and the existing E block, matching the order convention (ship order:
A → D → B → E → C, CHANGELOG order follows).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-18 03:08:15 +02:00

main.go

feat(workers): hyperswitch reconciliation sweep for stuck pending states — v1.0.7 item C

2026-04-18 03:08:15 +02:00