veza/docs/audit-2026-04/v107-plan.md
senke 3cd82ba5be fix(hyperswitch): idempotency-key on create-payment and create-refund — v1.0.7 item D
Every outbound POST /payments and POST /refunds from the Hyperswitch
client now carries an Idempotency-Key HTTP header. Key values are
explicit parameters at every call site — no context-carrier magic,
no auto-generation. An empty key is a loud error from the client
(not silent header omission) so a future new call site that forgets
to supply one fails immediately, not months later under an obscure
replay scenario.

Key choices, both stable across HTTP retries of the same logical
call:
  * CreatePayment → order.ID.String() (GORM BeforeCreate populates
    order.ID before the PSP call in ConfirmOrder).
  * CreateRefund → pendingRefund.ID.String() (populated by the
    Phase 1 tx.Create in RefundOrder, available for the Phase 2 PSP
    call).

Scope note (reproduced here for the next reader who grep-s the
commit log for "Idempotency-Key"):

  Idempotency-Key covers HTTP-transport retry (TLS reconnect,
  proxy retry, DNS flap) within a single CreatePayment /
  CreateRefund invocation. It does NOT cover application-level
  replay (user double-click, form double-submit, retry after crash
  before DB write). That class of bug requires state-machine
  preconditions on VEZA side — already addressed by the order
  state machine + the handler-level guards on POST
  /api/v1/payments (for payments) and the partial UNIQUE on
  `refunds.hyperswitch_refund_id` landed in v1.0.6.1 (for refunds).

  Hyperswitch TTL on Idempotency-Key: typically 24h-7d server-side
  (verify against current PSP docs). Beyond TTL, a retry with the
  same key is treated as a new request. Not a concern at current
  volumes; document if retry logic ever extends beyond 1 hour.

Explicitly out of scope: item D does NOT add application-level
retry logic. The current "try once, fail loudly" behavior on PSP
errors is preserved. Adding retries is a separate design exercise
(backoff, max attempts, circuit breaker) not part of this commit.

Interfaces changed:
  * hyperswitch.Client.CreatePayment(ctx, idempotencyKey, ...)
  * hyperswitch.Client.CreatePaymentSimple(...) convenience wrapper
  * hyperswitch.Client.CreateRefund(ctx, idempotencyKey, ...)
  * hyperswitch.Provider.CreatePayment threads through
  * hyperswitch.Provider.CreateRefund threads through
  * marketplace.PaymentProvider interface — first param after ctx
  * marketplace.refundProvider interface — first param after ctx

Removed:
  * hyperswitch.Provider.Refund (zero callers, superseded by
    CreateRefund which returns (refund_id, status, err) and is the
    only method marketplace's refundProvider cares about).

Tests:
  * Two new httptest.Server-backed tests (client_test.go) pin the
    Idempotency-Key header value for CreatePayment and CreateRefund.
  * Two new empty-key tests confirm the client errors rather than
    silently sending no header.
  * TestRefundOrder_OpensPendingRefund gains an assertion that
    f.provider.lastIdempotencyKey == refund.ID.String() — if a
    future refactor threads the key from somewhere else (paymentID,
    uuid.New() per call, etc.) the test fails loudly.
  * Four pre-existing test mocks updated for the new signature
    (mockRefundPaymentProvider in marketplace, mockPaymentProvider
    in tests/integration and tests/contract, mockRefundPayment
    Provider in tests/integration/refund_flow).

Subscription's CreateSubscriptionPayment interface declares its own
shape and has no live Hyperswitch-backed implementation today —
v1.0.6.2 noted this as the payment-gate bypass surface, v1.0.7
item G will ship the real provider. When that lands, item G's
implementation threads the idempotency key through in the same
pattern (documented in v107-plan.md item G acceptance).

CHANGELOG v1.0.7-rc1 entry updated with the full item D scope note
and the "out of scope: retries" caveat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 02:30:02 +02:00

16 KiB

v1.0.7 — plan structuré

Derived from axis-1-correctness.md. The v1.0.6 CHANGELOG listed 4 "parked v1.0.7" items; the axis-1 audit added 4 P0 findings. De-duplicated and sequenced below.

The 9 items (was 8, de-duplicated → 6; +1 after the 2026-04-17 Q2 probe)

# From Title Effort
A audit P0.1 Persist stripe_transfer_id in seller_transfers S
B audit P0.2 ≡ CHANGELOG "Stripe Connect reversal" Connect reversal via reversal_pending state + async worker M
C audit P0.3 Reconciliation sweep for stuck orders / refunds M
D audit P0.4 Idempotency-Key on CreatePayment / CreateRefund XS
E audit P1.5 Webhook raw-payload log table + insert S
F audit P1.8 Ledger-health Prometheus metrics + alerts S
G audit P0.12 follow-up (post v1.0.6.2 hotfix) Subscription pending_payment state + webhook-driven activation; replace if s.paymentProvider != nil short-circuit M

Dropped from v1.0.7 scope:

  • Partial refunds (CHANGELOG-parked) — P2 in audit, feature-class, defer to v1.0.8
  • CloudUploadModal single-source-of-truth (CHANGELOG-parked) — P2, out of money-movement scope
  • Sandbox smoke-test documentation — landed de facto in v1.0.6.1 via the partial-UNIQUE hotfix + the smoke harness artefacts

Effort legend — XS ≤ 2h, S ≤ 1 day, M ≤ 3 days, L > 3 days.

Dependency graph

      ┌──────────────────┐
      │  D  Idempotency  │  independent, can land first as quick win
      └──────────────────┘

      ┌──────────────────┐
      │  A  transfer_id  │  prerequisite for B
      └────────┬─────────┘
               │
               ▼
      ┌──────────────────┐
      │  B  reversal     │  uses transfer_id persisted by A;
      │     worker       │  introduces `reversal_pending` status
      └──────────────────┘

      ┌──────────────────┐
      │  E  webhook log  │  independent; prerequisite for P1.6 (disputes)
      └──────────────────┘

      ┌──────────────────┐
      │  C  reconciler   │  independent, but:
      │     sweep        │  metrics (F) track its effectiveness
      └────────┬─────────┘
               │
               ▼
      ┌──────────────────┐
      │  F  metrics      │  needs the state-shape C produces; alerts wire
      │     + alerts     │  to the buckets the reconciler defines
      └──────────────────┘

      ┌──────────────────┐
      │  G  subscription │  independent of A/B/C/D/E/F; shares the
      │     pending_     │  `pending_payment` pattern with B's
      │     payment      │  `reversal_pending`. Builds on v1.0.6.2
      └──────────────────┘  hotfix (which compensates via filter).

Three parallel tracks:

  • Track 1 (reversal correctness) — A → B
  • Track 2 (operational visibility) — D, E, C → F
  • Track 3 (subscription creation path) — G (single item, independent)

Two developers can work in parallel without stepping on each other. A single developer sequences as ordered: D first (XS quick win, earns trust + unblocks "pre-open" checklist), then A→B, then E, then C→F, then G. G can also run in parallel at any point after D — it shares no data-model surface with the other items.

Commit sequence (single-developer path)

Each item lands as its own commit in the existing v1.0.6-style cadence (per-commit tests + CHANGELOG-worthy).

1. fix(hyperswitch): idempotency-key on create-payment and create-refund — D

Effort: XS. Pure header addition. Tests: the 15-case refund suite already exists; add 2 cases verifying the header is set correctly (httptest.Server assertion on r.Header.Get("Idempotency-Key")).

Acceptance (landed in commit TBD — this entry pinned ahead):

  • Every outbound POST /payments carries Idempotency-Key: <order.id>.
  • Every outbound POST /refunds carries Idempotency-Key: <refund.id>.
  • No implicit-via-ctx magic: each call site sets the header explicitly, greppable.
  • Empty idempotency key returns an error from the client (loud failure, not silent header omission).
  • CHANGELOG entry cross-references P0.4 + its scope note (HTTP retry only, not app-level replay).

Status — landed 2026-04-18 alongside item B day 3 closure. Subscription's CreateSubscriptionPayment interface still lacks a live Hyperswitch impl (deferred to item G); that's where the remaining idempotency-key plumbing goes.

TTL caveat — Hyperswitch (like most PSPs) honours Idempotency-Key server-side only for a finite window: 24 h is common, 7 d at the high end. Beyond the TTL, a replayed call with the same key is treated as a new request. The header therefore protects against HTTP-layer retries within a single request cycle, not against long-tail application replay scenarios (for which the application-level idempotency primitives — order.id on payments, the partial UNIQUE on refunds.hyperswitch_refund_id landed in v1.0.6.1 — are the load-bearing guards). Verify the exact TTL against current Hyperswitch docs before landing and note it in the CHANGELOG scope sentence so anyone reading later knows the envelope.

Ship as v1.0.7-alpha-1 for sandbox testing, don't wait for B/C.

2. refactor(connect): persist stripe_transfer_id on create + retry — A

Effort: S. Touches the TransferService interface (minor breaking change — but only internal callers). Migration: 981_seller_transfers_stripe_reversal_id.sql adds stripe_reversal_id nullable column (prepares ground for B). Note — bumped from 980 to 981 because v1.0.6.2 used 980 for the unpaid-subscription cleanup; all subsequent v1.0.7 migration numbers in this plan shift by +1 when they land.

Acceptance:

  • TransferService.CreateTransfer(...) (string, error) — returns the Stripe transfer id.
  • processSellerTransfers persists st.StripeTransferID before tx.Create(&st).
  • TransferRetryWorker.retryOne also persists on retry success.
  • Backfill: one-shot migration query that fills known transfer_ids for past orders by calling Stripe's transfers.List(Destination=..., Metadata[order_id]=...). Acceptable to leave NULL where Stripe has no match (document: "pre-v1.0.7 transfers cannot be reversed automatically; use admin API P2.9").

3. feat(marketplace): async stripe connect reversal worker — B

Effort: M. The big one. Introduces seller_transfers.status = 'reversal_pending' as a new intermediate terminal-avoidance state. Migration: 981_seller_transfers_reversal_pending_enum.sql.

Acceptance:

  • reverseSellerAccounting transitions seller_transfers.status to reversal_pending, sets next_retry_at = NOW(). Refund finalization completes end-to-end — buyer never sees a Stripe-health-dependent refund UX.
  • New StripeReversalWorker in internal/workers/stripe_reversal.go runs every TRANSFER_RETRY_INTERVAL (reuse existing env var), processes reversal_pending rows. Calls transfer.NewReversal(stripeTransferID, &params).
  • On success → status='reversed', persist stripe_reversal_id.
  • On Stripe 404 → status='reversed' + log INFO (treat as "already reversed out-of-band").
  • On other errors → exponential backoff via next_retry_at + retry_count, hit permanently_failed ceiling after N retries (mirror TransferRetryWorker).
  • Tests: mock the stripe SDK via a local httptest.Server (harder than the Hyperswitch mock because stripe-go's HTTP layer is less trivial to re-point, but feasible via stripe.SetBackend). 8 cases:
    • happy path: pending → reversal_pending → reversed + id persisted
    • Stripe 404: reversal_pending → reversed + log
    • Stripe 5xx transient: retry_count increments, backoff set
    • max retries: → permanently_failed
    • concurrent worker (two instances pick same row): lock wins
    • StripeTransferID empty (legacy row): skip + ERROR log + counter
    • reversal idempotency (Stripe dedupes same transferID): no-op
    • worker graceful shutdown mid-reversal (context cancellation)

4. feat(webhooks): persist raw hyperswitch payloads to audit log — E

Effort: S. Migration: 982_hyperswitch_webhook_log.sql. Insert before signature verification (so we capture attack attempts too).

Acceptance:

  • Every webhook landing on /webhooks/hyperswitch produces exactly one row, regardless of signature-valid or processing outcome.
  • processing_result field captures 'ok', 'error: <msg>', or 'skipped'.
  • Retention: a CleanupWebhookLog worker in the same internal/jobs/ package as the orphan-tracks cleaner, daily, deletes rows older than 90 days.
  • Tests: 3 cases (valid signature + processing ok; invalid signature; processing error).

5. feat(workers): hyperswitch reconciliation sweep for stuck pending states — C

Effort: M. New ReconcileHyperswitchWorker in internal/jobs/. Hourly by default (RECONCILE_INTERVAL=1h), but exposed so ops can drop to 5m during incident response.

Acceptance:

  • Orders in pending > 30min: GET /payments/:id, call the same ProcessPaymentWebhook internal dispatcher with a synthesised payload. Idempotent with real webhooks via the existing terminal-state guard.
  • Refunds in pending with non-empty hyperswitch_refund_id > 30min: GET /refunds/:id, same pattern with ProcessRefundWebhook.
  • Refunds in pending with EMPTY hyperswitch_refund_id > 5min: mark failed, roll order back to completed, log ERROR (operator attention needed — something crashed between Phase 1 and Phase 2 of RefundOrder).
  • Tests: happy sync, no-op when everything is terminal, the empty-refund_id auto-fail case.
  • Structured log on every action taken so grep reconcile tells the ops story.

6. feat(metrics): ledger-health gauges + alert rules — F

Effort: S. New file internal/metrics/ledger_health.go with a 60s sampler. Grafana dashboard JSON in config/grafana/ledger.json. Prometheus alert rules in config/alertmanager/ledger.yml.

Acceptance:

  • 5 gauge metrics (listed in P1.8 action).
  • 2 alert rules (stuck_orders > 0 for 10m, orphan_refunds > 0 for 5m).
  • Sampler queries are cheap: SELECT COUNT(*) WHERE ... AND created_at < NOW() - INTERVAL '30 min' per metric, indexed by status + created_at.
  • If indexes are missing: migration 983_ledger_health_indexes.sql.

7. feat(subscription): pending_payment state + webhook-driven activation — G

Effort: M. The follow-up to v1.0.6.2's gate-filter compensation. v1.0.6.2 closed the feature bypass at the consumption site (GetUserSubscription filters fantôme rows out); G replaces the creation path so no fantôme rows get written in the first place.

Migrations:

  • 984_subscription_pending_payment_enum.sql adds 'pending_payment' to the user_subscriptions.status VARCHAR (no enum at DB level, so just a documentation + backfill step for Go constants).
  • 985_backfill_hs_subscription_id_from_invoice.sql (optional) — backpopulates user_subscriptions.hyperswitch_subscription_id from the attached invoice's PSP intent where available. Documents the join rule for the webhook to reconcile on.

Acceptance:

  • subscription/service.go:createNewSubscription creates paid-plan rows in pending_payment state (never active). Transition to active only via ProcessSubscriptionWebhook on payment_succeeded. On payment_failed: transition to expired
    • log INFO; no invoice charge, no DB fantôme.
  • if s.paymentProvider != nil short-circuit deleted — paid plans without a payment provider configured return 503 payment provider not configured (env misconfig is an ops issue, not a silently-free subscription).
  • GET /me/subscription handles pending_payment explicitly — returns status: pending_payment, client_secret echoed back so the frontend can resume a stalled flow.
  • Recovery endpoint POST /api/v1/subscriptions/complete/:id (or reuse the existing Subscribe response's client_secret) that the frontend can route the user to when the distribution handler returns the "complete payment" message. Without a real endpoint, the v1.0.6.2 error message is a dead end for users who landed in fantôme state via a broken flow (no payment method saved, network error mid-confirmation, etc.). Document the target route the frontend should redirect to in the handler response payload.
  • distribution.checkEligibility treats pending_payment as ineligible (same as the v1.0.6.2 ErrSubscriptionNoPayment path).
  • Remove the TODO(v1.0.7-item-G) annotation; remove the v1.0.6.2 filter from GetUserSubscription (redundant once the creation path is correct) — OR keep the filter as defence-in-depth and note it in the code comment.
  • Webhook dispatcher: new subscription event family. Reuse the webhook_raw_payloads table from item E for persistence.
  • Tests:
    • Subscribe to paid plan with Hyperswitch enabled → pending_payment row + PSP intent, no feature access yet.
    • Webhook subscription.payment_succeeded → row transitions to active, feature access granted.
    • Webhook subscription.payment_failed → row transitions to expired, no charge, no access.
    • Webhook replay (same payment_id) is idempotent.
    • Subscribe with provider misconfigured → 503, no row created.
    • Migration of v1.0.6.2 voided rows — check voided_subscriptions_20260417 entries stay readable and not re-pickable by the new flow.
    • E2E Playwright @critical: POST /subscribe followed by POST /distribution/submit asserts 403 with the "complete payment" message until the payment webhook fires. Today's regression coverage is the shell probe + Go unit tests — neither runs on every commit. Wiring a Playwright @critical test turns the probe into a gate so a refactor of Subscribe or checkEligibility cannot silently re-open the bypass.

Independent of A/B/C/D/E/F. Can land at any point after D.

Release gating

Before tagging v1.0.7:

  • All seven items landed as separate commits.
  • Refund smoke test (see docs/audit-2026-04/smoke/refund.md — TBD) re-run against sandbox with the reversal path: assert Stripe side shows the reversal as well as the refund.
  • Manual reconciliation pass: query the live DB against Hyperswitch dashboard for all completed orders in the past week → zero drift. Document the script used (scripts/reconcile-check.sh), keep for periodic re-use.
  • CHANGELOG entry lists each commit + criticity it resolves.

Cut v1.0.7 as a minor release (not patch), because the interface change in A is technically breaking for any external caller of TransferService (there are none today, but naming it 1.0.7 signals the API change).

Effort total

Worst-case single-developer sequential: ~12-13 working days (XS + S + M + S + M + S + M = 2h + 1d + 3d + 1d + 3d + 1d + 3d).

With two devs on the parallel tracks: ~7 working days end-to-end — track 1 (A→B, 4d), track 2 (D, E, C→F, 5d), track 3 (G, 3d, can run at any point after D). Track 3 extends the critical path unless a third dev picks it up; with 3 devs it lands in the track-2 window.

Unknowns to resolve before starting

  1. Volume order of magnitude — informs whether the reconciliation sweep's default interval of 1h is appropriate. If we're processing

    1000 orders/day, drop to 15m. Ops confirmation needed.

  2. Stripe Connect account state at time of v1.0.7 deploy — any pre-v1.0.7 transfers lack stripe_transfer_id in DB. The backfill migration in A needs to know how many rows to probe and how many will successfully match. Acceptable ceiling: if > 5% of historical transfers fail to backfill, escalate to manual reconciliation session.
  3. Hyperswitch sandbox vs prod webhook secret rotation — item D changes outbound request headers only, no impact. Items B+C call Hyperswitch read endpoints, which use the same HYPERSWITCH_API_KEY. No secret-rotation concern.