Every POST /webhooks/hyperswitch delivery now writes a row to
`hyperswitch_webhook_log` regardless of signature-valid or
processing outcome. Captures both legitimate deliveries and attack
probes — a forensics query now has the actual bytes to read, not
just a "webhook rejected" log line. Disputes (axis-1 P1.6) ride
along: the log captures dispute.* events alongside payment and
refund events, ready for when disputes get a handler.
Table shape (migration 984):
* payload TEXT — readable in psql, invalid UTF-8 replaced with
empty (forensics value is in headers + ip + timing for those
attacks, not the binary body).
* signature_valid BOOLEAN + partial index for "show me attack
attempts" being instantaneous.
* processing_result TEXT — 'ok' / 'error: <msg>' /
'signature_invalid' / 'skipped'. Matches the P1.5 action
semantic exactly.
* source_ip, user_agent, request_id — forensics essentials.
request_id is captured from Hyperswitch's X-Request-Id header
when present, else a server-side UUID so every row correlates
to VEZA's structured logs.
* event_type — best-effort extract from the JSON payload, NULL
on malformed input.
Hardening:
* 64KB body cap via io.LimitReader rejects oversize with 413
before any INSERT — prevents log-spam DoS.
* Single INSERT per delivery with final state; no two-phase
update race on signature-failure path. signature_invalid and
processing-error rows both land.
* DB persistence failures are logged but swallowed — the
endpoint's contract is to ack Hyperswitch, not perfect audit.
Retention sweep:
* CleanupHyperswitchWebhookLog in internal/jobs, daily tick,
batched DELETE (10k rows + 100ms pause) so a large backlog
doesn't lock the table.
* HYPERSWITCH_WEBHOOK_LOG_RETENTION_DAYS (default 90).
* Same goroutine-ticker pattern as ScheduleOrphanTracksCleanup.
* Wired in cmd/api/main.go alongside the existing cleanup jobs.
Tests: 5 in webhook_log_test.go (persistence, request_id auto-gen,
invalid-JSON leaves event_type empty, invalid-signature capture,
extractEventType 5 sub-cases) + 4 in cleanup_hyperswitch_webhook_
log_test.go (deletes-older-than, noop, default-on-zero,
context-cancel). Migration 984 applied cleanly to local Postgres;
all indexes present.
Also (v107-plan.md):
* Item G acceptance gains an explicit Idempotency-Key threading
requirement with an empty-key loud-fail test — "literally
copy-paste D's 4-line test skeleton". Closes the risk that
item G silently reopens the HTTP-retry duplicate-charge
exposure D closed.
Out of scope for E (noted in CHANGELOG):
* Rate limit on the endpoint — pre-existing middleware covers
it at the router level; adding a per-endpoint limit is
separate scope.
* Readable-payload SQL view — deferred, the TEXT column is
already human-readable; a convenience view is a nice-to-have
not a ship-blocker.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 KiB
v1.0.7 — plan structuré
Derived from
axis-1-correctness.md. The v1.0.6 CHANGELOG listed 4 "parked v1.0.7" items; the axis-1 audit added 4 P0 findings. De-duplicated and sequenced below.
The 9 items (was 8, de-duplicated → 6; +1 after the 2026-04-17 Q2 probe)
| # | From | Title | Effort |
|---|---|---|---|
| A | audit P0.1 | Persist stripe_transfer_id in seller_transfers |
S |
| B | audit P0.2 ≡ CHANGELOG "Stripe Connect reversal" | Connect reversal via reversal_pending state + async worker |
M |
| C | audit P0.3 | Reconciliation sweep for stuck orders / refunds | M |
| D | audit P0.4 | Idempotency-Key on CreatePayment / CreateRefund |
XS |
| E | audit P1.5 | Webhook raw-payload log table + insert | S |
| F | audit P1.8 | Ledger-health Prometheus metrics + alerts | S |
| G | audit P0.12 follow-up (post v1.0.6.2 hotfix) | Subscription pending_payment state + webhook-driven activation; replace if s.paymentProvider != nil short-circuit |
M |
Dropped from v1.0.7 scope:
- Partial refunds (CHANGELOG-parked) — P2 in audit, feature-class, defer to v1.0.8
CloudUploadModalsingle-source-of-truth (CHANGELOG-parked) — P2, out of money-movement scope- Sandbox smoke-test documentation — landed de facto in v1.0.6.1 via the partial-UNIQUE hotfix + the smoke harness artefacts
Effort legend — XS ≤ 2h, S ≤ 1 day, M ≤ 3 days, L > 3 days.
Dependency graph
┌──────────────────┐
│ D Idempotency │ independent, can land first as quick win
└──────────────────┘
┌──────────────────┐
│ A transfer_id │ prerequisite for B
└────────┬─────────┘
│
▼
┌──────────────────┐
│ B reversal │ uses transfer_id persisted by A;
│ worker │ introduces `reversal_pending` status
└──────────────────┘
┌──────────────────┐
│ E webhook log │ independent; prerequisite for P1.6 (disputes)
└──────────────────┘
┌──────────────────┐
│ C reconciler │ independent, but:
│ sweep │ metrics (F) track its effectiveness
└────────┬─────────┘
│
▼
┌──────────────────┐
│ F metrics │ needs the state-shape C produces; alerts wire
│ + alerts │ to the buckets the reconciler defines
└──────────────────┘
┌──────────────────┐
│ G subscription │ independent of A/B/C/D/E/F; shares the
│ pending_ │ `pending_payment` pattern with B's
│ payment │ `reversal_pending`. Builds on v1.0.6.2
└──────────────────┘ hotfix (which compensates via filter).
Three parallel tracks:
- Track 1 (reversal correctness) — A → B
- Track 2 (operational visibility) — D, E, C → F
- Track 3 (subscription creation path) — G (single item, independent)
Two developers can work in parallel without stepping on each other. A single developer sequences as ordered: D first (XS quick win, earns trust + unblocks "pre-open" checklist), then A→B, then E, then C→F, then G. G can also run in parallel at any point after D — it shares no data-model surface with the other items.
Commit sequence (single-developer path)
Each item lands as its own commit in the existing v1.0.6-style cadence (per-commit tests + CHANGELOG-worthy).
1. fix(hyperswitch): idempotency-key on create-payment and create-refund — D
Effort: XS. Pure header addition. Tests: the 15-case refund suite
already exists; add 2 cases verifying the header is set correctly
(httptest.Server assertion on r.Header.Get("Idempotency-Key")).
Acceptance (landed in commit TBD — this entry pinned ahead):
- Every outbound
POST /paymentscarriesIdempotency-Key: <order.id>. - Every outbound
POST /refundscarriesIdempotency-Key: <refund.id>. - No implicit-via-ctx magic: each call site sets the header explicitly, greppable.
- Empty idempotency key returns an error from the client (loud failure, not silent header omission).
- CHANGELOG entry cross-references P0.4 + its scope note (HTTP retry only, not app-level replay).
Status — landed 2026-04-18 alongside item B day 3 closure. Subscription's CreateSubscriptionPayment interface still lacks a live Hyperswitch impl (deferred to item G); that's where the remaining idempotency-key plumbing goes.
TTL caveat — Hyperswitch (like most PSPs) honours Idempotency-Key
server-side only for a finite window: 24 h is common, 7 d at the high
end. Beyond the TTL, a replayed call with the same key is treated as
a new request. The header therefore protects against HTTP-layer retries
within a single request cycle, not against long-tail application
replay scenarios (for which the application-level idempotency primitives
— order.id on payments, the partial UNIQUE on refunds.hyperswitch_refund_id
landed in v1.0.6.1 — are the load-bearing guards). Verify the exact TTL
against current Hyperswitch docs before landing and note it in the
CHANGELOG scope sentence so anyone reading later knows the envelope.
Ship as v1.0.7-alpha-1 for sandbox testing, don't wait for B/C.
2. refactor(connect): persist stripe_transfer_id on create + retry — A
Effort: S. Touches the TransferService interface (minor breaking
change — but only internal callers). Migration:
981_seller_transfers_stripe_reversal_id.sql adds stripe_reversal_id
nullable column (prepares ground for B). Note — bumped from 980 to
981 because v1.0.6.2 used 980 for the unpaid-subscription cleanup;
all subsequent v1.0.7 migration numbers in this plan shift by +1 when
they land.
Acceptance:
TransferService.CreateTransfer(...) (string, error)— returns the Stripe transfer id.processSellerTransferspersistsst.StripeTransferIDbeforetx.Create(&st).TransferRetryWorker.retryOnealso persists on retry success.- Backfill: one-shot migration query that fills known transfer_ids
for past orders by calling Stripe's
transfers.List(Destination=..., Metadata[order_id]=...). Acceptable to leaveNULLwhere Stripe has no match (document: "pre-v1.0.7 transfers cannot be reversed automatically; use admin API P2.9").
3. feat(marketplace): async stripe connect reversal worker — B
Effort: M. The big one. Introduces seller_transfers.status = 'reversal_pending' as a new intermediate terminal-avoidance state.
Migration: 981_seller_transfers_reversal_pending_enum.sql.
Acceptance:
reverseSellerAccountingtransitionsseller_transfers.statustoreversal_pending, setsnext_retry_at = NOW(). Refund finalization completes end-to-end — buyer never sees a Stripe-health-dependent refund UX.- New
StripeReversalWorkerininternal/workers/stripe_reversal.goruns everyTRANSFER_RETRY_INTERVAL(reuse existing env var), processesreversal_pendingrows. Callstransfer.NewReversal(stripeTransferID, ¶ms). - On success →
status='reversed', persiststripe_reversal_id. - On Stripe 404 →
status='reversed'+ log INFO (treat as "already reversed out-of-band"). - On other errors → exponential backoff via
next_retry_at+retry_count, hitpermanently_failedceiling after N retries (mirrorTransferRetryWorker). - Tests: mock the stripe SDK via a local httptest.Server (harder than
the Hyperswitch mock because stripe-go's HTTP layer is less trivial
to re-point, but feasible via
stripe.SetBackend). 8 cases:- happy path: pending → reversal_pending → reversed + id persisted
- Stripe 404: reversal_pending → reversed + log
- Stripe 5xx transient: retry_count increments, backoff set
- max retries: → permanently_failed
- concurrent worker (two instances pick same row): lock wins
- StripeTransferID empty (legacy row): skip + ERROR log + counter
- reversal idempotency (Stripe dedupes same transferID): no-op
- worker graceful shutdown mid-reversal (context cancellation)
4. feat(webhooks): persist raw hyperswitch payloads to audit log — E
Effort: S. Migration: 982_hyperswitch_webhook_log.sql. Insert
before signature verification (so we capture attack attempts too).
Acceptance:
- Every webhook landing on
/webhooks/hyperswitchproduces exactly one row, regardless of signature-valid or processing outcome. processing_resultfield captures'ok','error: <msg>', or'skipped'.- Retention: a
CleanupWebhookLogworker in the sameinternal/jobs/package as the orphan-tracks cleaner, daily, deletes rows older than 90 days. - Tests: 3 cases (valid signature + processing ok; invalid signature; processing error).
5. feat(workers): hyperswitch reconciliation sweep for stuck pending states — C
Effort: M. New ReconcileHyperswitchWorker in internal/jobs/.
Hourly by default (RECONCILE_INTERVAL=1h), but exposed so ops can
drop to 5m during incident response.
Acceptance:
- Orders in
pending> 30min:GET /payments/:id, call the sameProcessPaymentWebhookinternal dispatcher with a synthesised payload. Idempotent with real webhooks via the existing terminal-state guard. - Refunds in
pendingwith non-emptyhyperswitch_refund_id> 30min:GET /refunds/:id, same pattern withProcessRefundWebhook. - Refunds in
pendingwith EMPTYhyperswitch_refund_id> 5min: markfailed, roll order back tocompleted, log ERROR (operator attention needed — something crashed between Phase 1 and Phase 2 ofRefundOrder). - Tests: happy sync, no-op when everything is terminal, the empty-refund_id auto-fail case.
- Structured log on every action taken so
grep reconciletells the ops story.
6. feat(metrics): ledger-health gauges + alert rules — F
Effort: S. New file internal/metrics/ledger_health.go with a
60s sampler. Grafana dashboard JSON in config/grafana/ledger.json.
Prometheus alert rules in config/alertmanager/ledger.yml.
Acceptance:
- 5 gauge metrics (listed in P1.8 action).
- 2 alert rules (
stuck_orders > 0 for 10m,orphan_refunds > 0 for 5m). - Sampler queries are cheap:
SELECT COUNT(*) WHERE ... AND created_at < NOW() - INTERVAL '30 min'per metric, indexed bystatus + created_at. - If indexes are missing: migration
983_ledger_health_indexes.sql.
7. feat(subscription): pending_payment state + webhook-driven activation — G
Effort: M. The follow-up to v1.0.6.2's gate-filter compensation.
v1.0.6.2 closed the feature bypass at the consumption site
(GetUserSubscription filters fantôme rows out); G replaces the
creation path so no fantôme rows get written in the first place.
Migrations:
984_subscription_pending_payment_enum.sqladds'pending_payment'to theuser_subscriptions.statusVARCHAR (no enum at DB level, so just a documentation + backfill step for Go constants).985_backfill_hs_subscription_id_from_invoice.sql(optional) — backpopulatesuser_subscriptions.hyperswitch_subscription_idfrom the attached invoice's PSP intent where available. Documents the join rule for the webhook to reconcile on.
Acceptance:
subscription/service.go:createNewSubscriptioncreates paid-plan rows inpending_paymentstate (neveractive). Transition toactiveonly viaProcessSubscriptionWebhookonpayment_succeeded. Onpayment_failed: transition toexpired- log INFO; no invoice charge, no DB fantôme.
if s.paymentProvider != nilshort-circuit deleted — paid plans without a payment provider configured return503 payment provider not configured(env misconfig is an ops issue, not a silently-free subscription).GET /me/subscriptionhandlespending_paymentexplicitly — returnsstatus: pending_payment,client_secretechoed back so the frontend can resume a stalled flow.- Recovery endpoint
POST /api/v1/subscriptions/complete/:id(or reuse the existing Subscribe response'sclient_secret) that the frontend can route the user to when the distribution handler returns the "complete payment" message. Without a real endpoint, the v1.0.6.2 error message is a dead end for users who landed in fantôme state via a broken flow (no payment method saved, network error mid-confirmation, etc.). Document the target route the frontend should redirect to in the handler response payload. distribution.checkEligibilitytreatspending_paymentas ineligible (same as the v1.0.6.2ErrSubscriptionNoPaymentpath).- Remove the TODO(v1.0.7-item-G) annotation; remove the v1.0.6.2
filter from
GetUserSubscription(redundant once the creation path is correct) — OR keep the filter as defence-in-depth and note it in the code comment. - Webhook dispatcher: new
subscriptionevent family. Reuse thewebhook_raw_payloadstable from item E for persistence. - Tests:
- Subscribe to paid plan with Hyperswitch enabled →
pending_paymentrow + PSP intent, no feature access yet. - Webhook
subscription.payment_succeeded→ row transitions toactive, feature access granted. - Webhook
subscription.payment_failed→ row transitions toexpired, no charge, no access. - Webhook replay (same payment_id) is idempotent.
- Subscribe with provider misconfigured → 503, no row created.
- Migration of v1.0.6.2 voided rows — check
voided_subscriptions_20260417entries stay readable and not re-pickable by the new flow. - Idempotency-Key threading (inherited from item D): the
new Hyperswitch-backed subscription payment provider MUST
accept an explicit
idempotencyKeyparameter and send it as theIdempotency-KeyHTTP header, usingsubscription.ID(UUID) as the key. An empty-key loud-fail test is required — same pattern as D'sTestClient_CreatePayment_RejectsEmpty IdempotencyKey, literally copy-paste the 4-line test skeleton withCreateSubscriptionPaymentsubstituted forCreateRefund. Without this check, item G silently reopens the HTTP-retry duplicate-charge exposure that D closed. - E2E Playwright @critical:
POST /subscribefollowed byPOST /distribution/submitasserts 403 with the "complete payment" message until the payment webhook fires. Today's regression coverage is the shell probe + Go unit tests — neither runs on every commit. Wiring a Playwright @critical test turns the probe into a gate so a refactor ofSubscribeorcheckEligibilitycannot silently re-open the bypass.
- Subscribe to paid plan with Hyperswitch enabled →
Independent of A/B/C/D/E/F. Can land at any point after D.
Release gating
Before tagging v1.0.7:
- All seven items landed as separate commits.
- Refund smoke test (see
docs/audit-2026-04/smoke/refund.md— TBD) re-run against sandbox with the reversal path: assert Stripe side shows the reversal as well as the refund. - Manual reconciliation pass: query the live DB against Hyperswitch
dashboard for all
completedorders in the past week → zero drift. Document the script used (scripts/reconcile-check.sh), keep for periodic re-use. - CHANGELOG entry lists each commit + criticity it resolves.
Cut v1.0.7 as a minor release (not patch), because the interface
change in A is technically breaking for any external caller of
TransferService (there are none today, but naming it 1.0.7 signals
the API change).
Effort total
Worst-case single-developer sequential: ~12-13 working days (XS + S + M + S + M + S + M = 2h + 1d + 3d + 1d + 3d + 1d + 3d).
With two devs on the parallel tracks: ~7 working days end-to-end — track 1 (A→B, 4d), track 2 (D, E, C→F, 5d), track 3 (G, 3d, can run at any point after D). Track 3 extends the critical path unless a third dev picks it up; with 3 devs it lands in the track-2 window.
Unknowns to resolve before starting
- Volume order of magnitude — informs whether the reconciliation
sweep's default interval of 1h is appropriate. If we're processing
1000 orders/day, drop to 15m. Ops confirmation needed.
- Stripe Connect account state at time of v1.0.7 deploy — any
pre-v1.0.7 transfers lack
stripe_transfer_idin DB. The backfill migration in A needs to know how many rows to probe and how many will successfully match. Acceptable ceiling: if > 5% of historical transfers fail to backfill, escalate to manual reconciliation session. - Hyperswitch sandbox vs prod webhook secret rotation — item D
changes outbound request headers only, no impact. Items B+C call
Hyperswitch read endpoints, which use the same
HYPERSWITCH_API_KEY. No secret-rotation concern.