Every outbound POST /payments and POST /refunds from the Hyperswitch
client now carries an Idempotency-Key HTTP header. Key values are
explicit parameters at every call site — no context-carrier magic,
no auto-generation. An empty key is a loud error from the client
(not silent header omission) so a future new call site that forgets
to supply one fails immediately, not months later under an obscure
replay scenario.
Key choices, both stable across HTTP retries of the same logical
call:
* CreatePayment → order.ID.String() (GORM BeforeCreate populates
order.ID before the PSP call in ConfirmOrder).
* CreateRefund → pendingRefund.ID.String() (populated by the
Phase 1 tx.Create in RefundOrder, available for the Phase 2 PSP
call).
Scope note (reproduced here for the next reader who grep-s the
commit log for "Idempotency-Key"):
Idempotency-Key covers HTTP-transport retry (TLS reconnect,
proxy retry, DNS flap) within a single CreatePayment /
CreateRefund invocation. It does NOT cover application-level
replay (user double-click, form double-submit, retry after crash
before DB write). That class of bug requires state-machine
preconditions on VEZA side — already addressed by the order
state machine + the handler-level guards on POST
/api/v1/payments (for payments) and the partial UNIQUE on
`refunds.hyperswitch_refund_id` landed in v1.0.6.1 (for refunds).
Hyperswitch TTL on Idempotency-Key: typically 24h-7d server-side
(verify against current PSP docs). Beyond TTL, a retry with the
same key is treated as a new request. Not a concern at current
volumes; document if retry logic ever extends beyond 1 hour.
Explicitly out of scope: item D does NOT add application-level
retry logic. The current "try once, fail loudly" behavior on PSP
errors is preserved. Adding retries is a separate design exercise
(backoff, max attempts, circuit breaker) not part of this commit.
Interfaces changed:
* hyperswitch.Client.CreatePayment(ctx, idempotencyKey, ...)
* hyperswitch.Client.CreatePaymentSimple(...) convenience wrapper
* hyperswitch.Client.CreateRefund(ctx, idempotencyKey, ...)
* hyperswitch.Provider.CreatePayment threads through
* hyperswitch.Provider.CreateRefund threads through
* marketplace.PaymentProvider interface — first param after ctx
* marketplace.refundProvider interface — first param after ctx
Removed:
* hyperswitch.Provider.Refund (zero callers, superseded by
CreateRefund which returns (refund_id, status, err) and is the
only method marketplace's refundProvider cares about).
Tests:
* Two new httptest.Server-backed tests (client_test.go) pin the
Idempotency-Key header value for CreatePayment and CreateRefund.
* Two new empty-key tests confirm the client errors rather than
silently sending no header.
* TestRefundOrder_OpensPendingRefund gains an assertion that
f.provider.lastIdempotencyKey == refund.ID.String() — if a
future refactor threads the key from somewhere else (paymentID,
uuid.New() per call, etc.) the test fails loudly.
* Four pre-existing test mocks updated for the new signature
(mockRefundPaymentProvider in marketplace, mockPaymentProvider
in tests/integration and tests/contract, mockRefundPayment
Provider in tests/integration/refund_flow).
Subscription's CreateSubscriptionPayment interface declares its own
shape and has no live Hyperswitch-backed implementation today —
v1.0.6.2 noted this as the payment-gate bypass surface, v1.0.7
item G will ship the real provider. When that lands, item G's
implementation threads the idempotency key through in the same
pattern (documented in v107-plan.md item G acceptance).
CHANGELOG v1.0.7-rc1 entry updated with the full item D scope note
and the "out of scope: retries" caveat.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
16 KiB
v1.0.7 — plan structuré
Derived from
axis-1-correctness.md. The v1.0.6 CHANGELOG listed 4 "parked v1.0.7" items; the axis-1 audit added 4 P0 findings. De-duplicated and sequenced below.
The 9 items (was 8, de-duplicated → 6; +1 after the 2026-04-17 Q2 probe)
| # | From | Title | Effort |
|---|---|---|---|
| A | audit P0.1 | Persist stripe_transfer_id in seller_transfers |
S |
| B | audit P0.2 ≡ CHANGELOG "Stripe Connect reversal" | Connect reversal via reversal_pending state + async worker |
M |
| C | audit P0.3 | Reconciliation sweep for stuck orders / refunds | M |
| D | audit P0.4 | Idempotency-Key on CreatePayment / CreateRefund |
XS |
| E | audit P1.5 | Webhook raw-payload log table + insert | S |
| F | audit P1.8 | Ledger-health Prometheus metrics + alerts | S |
| G | audit P0.12 follow-up (post v1.0.6.2 hotfix) | Subscription pending_payment state + webhook-driven activation; replace if s.paymentProvider != nil short-circuit |
M |
Dropped from v1.0.7 scope:
- Partial refunds (CHANGELOG-parked) — P2 in audit, feature-class, defer to v1.0.8
CloudUploadModalsingle-source-of-truth (CHANGELOG-parked) — P2, out of money-movement scope- Sandbox smoke-test documentation — landed de facto in v1.0.6.1 via the partial-UNIQUE hotfix + the smoke harness artefacts
Effort legend — XS ≤ 2h, S ≤ 1 day, M ≤ 3 days, L > 3 days.
Dependency graph
┌──────────────────┐
│ D Idempotency │ independent, can land first as quick win
└──────────────────┘
┌──────────────────┐
│ A transfer_id │ prerequisite for B
└────────┬─────────┘
│
▼
┌──────────────────┐
│ B reversal │ uses transfer_id persisted by A;
│ worker │ introduces `reversal_pending` status
└──────────────────┘
┌──────────────────┐
│ E webhook log │ independent; prerequisite for P1.6 (disputes)
└──────────────────┘
┌──────────────────┐
│ C reconciler │ independent, but:
│ sweep │ metrics (F) track its effectiveness
└────────┬─────────┘
│
▼
┌──────────────────┐
│ F metrics │ needs the state-shape C produces; alerts wire
│ + alerts │ to the buckets the reconciler defines
└──────────────────┘
┌──────────────────┐
│ G subscription │ independent of A/B/C/D/E/F; shares the
│ pending_ │ `pending_payment` pattern with B's
│ payment │ `reversal_pending`. Builds on v1.0.6.2
└──────────────────┘ hotfix (which compensates via filter).
Three parallel tracks:
- Track 1 (reversal correctness) — A → B
- Track 2 (operational visibility) — D, E, C → F
- Track 3 (subscription creation path) — G (single item, independent)
Two developers can work in parallel without stepping on each other. A single developer sequences as ordered: D first (XS quick win, earns trust + unblocks "pre-open" checklist), then A→B, then E, then C→F, then G. G can also run in parallel at any point after D — it shares no data-model surface with the other items.
Commit sequence (single-developer path)
Each item lands as its own commit in the existing v1.0.6-style cadence (per-commit tests + CHANGELOG-worthy).
1. fix(hyperswitch): idempotency-key on create-payment and create-refund — D
Effort: XS. Pure header addition. Tests: the 15-case refund suite
already exists; add 2 cases verifying the header is set correctly
(httptest.Server assertion on r.Header.Get("Idempotency-Key")).
Acceptance (landed in commit TBD — this entry pinned ahead):
- Every outbound
POST /paymentscarriesIdempotency-Key: <order.id>. - Every outbound
POST /refundscarriesIdempotency-Key: <refund.id>. - No implicit-via-ctx magic: each call site sets the header explicitly, greppable.
- Empty idempotency key returns an error from the client (loud failure, not silent header omission).
- CHANGELOG entry cross-references P0.4 + its scope note (HTTP retry only, not app-level replay).
Status — landed 2026-04-18 alongside item B day 3 closure. Subscription's CreateSubscriptionPayment interface still lacks a live Hyperswitch impl (deferred to item G); that's where the remaining idempotency-key plumbing goes.
TTL caveat — Hyperswitch (like most PSPs) honours Idempotency-Key
server-side only for a finite window: 24 h is common, 7 d at the high
end. Beyond the TTL, a replayed call with the same key is treated as
a new request. The header therefore protects against HTTP-layer retries
within a single request cycle, not against long-tail application
replay scenarios (for which the application-level idempotency primitives
— order.id on payments, the partial UNIQUE on refunds.hyperswitch_refund_id
landed in v1.0.6.1 — are the load-bearing guards). Verify the exact TTL
against current Hyperswitch docs before landing and note it in the
CHANGELOG scope sentence so anyone reading later knows the envelope.
Ship as v1.0.7-alpha-1 for sandbox testing, don't wait for B/C.
2. refactor(connect): persist stripe_transfer_id on create + retry — A
Effort: S. Touches the TransferService interface (minor breaking
change — but only internal callers). Migration:
981_seller_transfers_stripe_reversal_id.sql adds stripe_reversal_id
nullable column (prepares ground for B). Note — bumped from 980 to
981 because v1.0.6.2 used 980 for the unpaid-subscription cleanup;
all subsequent v1.0.7 migration numbers in this plan shift by +1 when
they land.
Acceptance:
TransferService.CreateTransfer(...) (string, error)— returns the Stripe transfer id.processSellerTransferspersistsst.StripeTransferIDbeforetx.Create(&st).TransferRetryWorker.retryOnealso persists on retry success.- Backfill: one-shot migration query that fills known transfer_ids
for past orders by calling Stripe's
transfers.List(Destination=..., Metadata[order_id]=...). Acceptable to leaveNULLwhere Stripe has no match (document: "pre-v1.0.7 transfers cannot be reversed automatically; use admin API P2.9").
3. feat(marketplace): async stripe connect reversal worker — B
Effort: M. The big one. Introduces seller_transfers.status = 'reversal_pending' as a new intermediate terminal-avoidance state.
Migration: 981_seller_transfers_reversal_pending_enum.sql.
Acceptance:
reverseSellerAccountingtransitionsseller_transfers.statustoreversal_pending, setsnext_retry_at = NOW(). Refund finalization completes end-to-end — buyer never sees a Stripe-health-dependent refund UX.- New
StripeReversalWorkerininternal/workers/stripe_reversal.goruns everyTRANSFER_RETRY_INTERVAL(reuse existing env var), processesreversal_pendingrows. Callstransfer.NewReversal(stripeTransferID, ¶ms). - On success →
status='reversed', persiststripe_reversal_id. - On Stripe 404 →
status='reversed'+ log INFO (treat as "already reversed out-of-band"). - On other errors → exponential backoff via
next_retry_at+retry_count, hitpermanently_failedceiling after N retries (mirrorTransferRetryWorker). - Tests: mock the stripe SDK via a local httptest.Server (harder than
the Hyperswitch mock because stripe-go's HTTP layer is less trivial
to re-point, but feasible via
stripe.SetBackend). 8 cases:- happy path: pending → reversal_pending → reversed + id persisted
- Stripe 404: reversal_pending → reversed + log
- Stripe 5xx transient: retry_count increments, backoff set
- max retries: → permanently_failed
- concurrent worker (two instances pick same row): lock wins
- StripeTransferID empty (legacy row): skip + ERROR log + counter
- reversal idempotency (Stripe dedupes same transferID): no-op
- worker graceful shutdown mid-reversal (context cancellation)
4. feat(webhooks): persist raw hyperswitch payloads to audit log — E
Effort: S. Migration: 982_hyperswitch_webhook_log.sql. Insert
before signature verification (so we capture attack attempts too).
Acceptance:
- Every webhook landing on
/webhooks/hyperswitchproduces exactly one row, regardless of signature-valid or processing outcome. processing_resultfield captures'ok','error: <msg>', or'skipped'.- Retention: a
CleanupWebhookLogworker in the sameinternal/jobs/package as the orphan-tracks cleaner, daily, deletes rows older than 90 days. - Tests: 3 cases (valid signature + processing ok; invalid signature; processing error).
5. feat(workers): hyperswitch reconciliation sweep for stuck pending states — C
Effort: M. New ReconcileHyperswitchWorker in internal/jobs/.
Hourly by default (RECONCILE_INTERVAL=1h), but exposed so ops can
drop to 5m during incident response.
Acceptance:
- Orders in
pending> 30min:GET /payments/:id, call the sameProcessPaymentWebhookinternal dispatcher with a synthesised payload. Idempotent with real webhooks via the existing terminal-state guard. - Refunds in
pendingwith non-emptyhyperswitch_refund_id> 30min:GET /refunds/:id, same pattern withProcessRefundWebhook. - Refunds in
pendingwith EMPTYhyperswitch_refund_id> 5min: markfailed, roll order back tocompleted, log ERROR (operator attention needed — something crashed between Phase 1 and Phase 2 ofRefundOrder). - Tests: happy sync, no-op when everything is terminal, the empty-refund_id auto-fail case.
- Structured log on every action taken so
grep reconciletells the ops story.
6. feat(metrics): ledger-health gauges + alert rules — F
Effort: S. New file internal/metrics/ledger_health.go with a
60s sampler. Grafana dashboard JSON in config/grafana/ledger.json.
Prometheus alert rules in config/alertmanager/ledger.yml.
Acceptance:
- 5 gauge metrics (listed in P1.8 action).
- 2 alert rules (
stuck_orders > 0 for 10m,orphan_refunds > 0 for 5m). - Sampler queries are cheap:
SELECT COUNT(*) WHERE ... AND created_at < NOW() - INTERVAL '30 min'per metric, indexed bystatus + created_at. - If indexes are missing: migration
983_ledger_health_indexes.sql.
7. feat(subscription): pending_payment state + webhook-driven activation — G
Effort: M. The follow-up to v1.0.6.2's gate-filter compensation.
v1.0.6.2 closed the feature bypass at the consumption site
(GetUserSubscription filters fantôme rows out); G replaces the
creation path so no fantôme rows get written in the first place.
Migrations:
984_subscription_pending_payment_enum.sqladds'pending_payment'to theuser_subscriptions.statusVARCHAR (no enum at DB level, so just a documentation + backfill step for Go constants).985_backfill_hs_subscription_id_from_invoice.sql(optional) — backpopulatesuser_subscriptions.hyperswitch_subscription_idfrom the attached invoice's PSP intent where available. Documents the join rule for the webhook to reconcile on.
Acceptance:
subscription/service.go:createNewSubscriptioncreates paid-plan rows inpending_paymentstate (neveractive). Transition toactiveonly viaProcessSubscriptionWebhookonpayment_succeeded. Onpayment_failed: transition toexpired- log INFO; no invoice charge, no DB fantôme.
if s.paymentProvider != nilshort-circuit deleted — paid plans without a payment provider configured return503 payment provider not configured(env misconfig is an ops issue, not a silently-free subscription).GET /me/subscriptionhandlespending_paymentexplicitly — returnsstatus: pending_payment,client_secretechoed back so the frontend can resume a stalled flow.- Recovery endpoint
POST /api/v1/subscriptions/complete/:id(or reuse the existing Subscribe response'sclient_secret) that the frontend can route the user to when the distribution handler returns the "complete payment" message. Without a real endpoint, the v1.0.6.2 error message is a dead end for users who landed in fantôme state via a broken flow (no payment method saved, network error mid-confirmation, etc.). Document the target route the frontend should redirect to in the handler response payload. distribution.checkEligibilitytreatspending_paymentas ineligible (same as the v1.0.6.2ErrSubscriptionNoPaymentpath).- Remove the TODO(v1.0.7-item-G) annotation; remove the v1.0.6.2
filter from
GetUserSubscription(redundant once the creation path is correct) — OR keep the filter as defence-in-depth and note it in the code comment. - Webhook dispatcher: new
subscriptionevent family. Reuse thewebhook_raw_payloadstable from item E for persistence. - Tests:
- Subscribe to paid plan with Hyperswitch enabled →
pending_paymentrow + PSP intent, no feature access yet. - Webhook
subscription.payment_succeeded→ row transitions toactive, feature access granted. - Webhook
subscription.payment_failed→ row transitions toexpired, no charge, no access. - Webhook replay (same payment_id) is idempotent.
- Subscribe with provider misconfigured → 503, no row created.
- Migration of v1.0.6.2 voided rows — check
voided_subscriptions_20260417entries stay readable and not re-pickable by the new flow. - E2E Playwright @critical:
POST /subscribefollowed byPOST /distribution/submitasserts 403 with the "complete payment" message until the payment webhook fires. Today's regression coverage is the shell probe + Go unit tests — neither runs on every commit. Wiring a Playwright @critical test turns the probe into a gate so a refactor ofSubscribeorcheckEligibilitycannot silently re-open the bypass.
- Subscribe to paid plan with Hyperswitch enabled →
Independent of A/B/C/D/E/F. Can land at any point after D.
Release gating
Before tagging v1.0.7:
- All seven items landed as separate commits.
- Refund smoke test (see
docs/audit-2026-04/smoke/refund.md— TBD) re-run against sandbox with the reversal path: assert Stripe side shows the reversal as well as the refund. - Manual reconciliation pass: query the live DB against Hyperswitch
dashboard for all
completedorders in the past week → zero drift. Document the script used (scripts/reconcile-check.sh), keep for periodic re-use. - CHANGELOG entry lists each commit + criticity it resolves.
Cut v1.0.7 as a minor release (not patch), because the interface
change in A is technically breaking for any external caller of
TransferService (there are none today, but naming it 1.0.7 signals
the API change).
Effort total
Worst-case single-developer sequential: ~12-13 working days (XS + S + M + S + M + S + M = 2h + 1d + 3d + 1d + 3d + 1d + 3d).
With two devs on the parallel tracks: ~7 working days end-to-end — track 1 (A→B, 4d), track 2 (D, E, C→F, 5d), track 3 (G, 3d, can run at any point after D). Track 3 extends the critical path unless a third dev picks it up; with 3 devs it lands in the track-2 window.
Unknowns to resolve before starting
- Volume order of magnitude — informs whether the reconciliation
sweep's default interval of 1h is appropriate. If we're processing
1000 orders/day, drop to 15m. Ops confirmation needed.
- Stripe Connect account state at time of v1.0.7 deploy — any
pre-v1.0.7 transfers lack
stripe_transfer_idin DB. The backfill migration in A needs to know how many rows to probe and how many will successfully match. Acceptable ceiling: if > 5% of historical transfers fail to backfill, escalate to manual reconciliation session. - Hyperswitch sandbox vs prod webhook secret rotation — item D
changes outbound request headers only, no impact. Items B+C call
Hyperswitch read endpoints, which use the same
HYPERSWITCH_API_KEY. No secret-rotation concern.