Every POST /webhooks/hyperswitch delivery now writes a row to
`hyperswitch_webhook_log` regardless of signature-valid or
processing outcome. Captures both legitimate deliveries and attack
probes — a forensics query now has the actual bytes to read, not
just a "webhook rejected" log line. Disputes (axis-1 P1.6) ride
along: the log captures dispute.* events alongside payment and
refund events, ready for when disputes get a handler.
Table shape (migration 984):
* payload TEXT — readable in psql, invalid UTF-8 replaced with
empty (forensics value is in headers + ip + timing for those
attacks, not the binary body).
* signature_valid BOOLEAN + partial index for "show me attack
attempts" being instantaneous.
* processing_result TEXT — 'ok' / 'error: <msg>' /
'signature_invalid' / 'skipped'. Matches the P1.5 action
semantic exactly.
* source_ip, user_agent, request_id — forensics essentials.
request_id is captured from Hyperswitch's X-Request-Id header
when present, else a server-side UUID so every row correlates
to VEZA's structured logs.
* event_type — best-effort extract from the JSON payload, NULL
on malformed input.
Hardening:
* 64KB body cap via io.LimitReader rejects oversize with 413
before any INSERT — prevents log-spam DoS.
* Single INSERT per delivery with final state; no two-phase
update race on signature-failure path. signature_invalid and
processing-error rows both land.
* DB persistence failures are logged but swallowed — the
endpoint's contract is to ack Hyperswitch, not perfect audit.
Retention sweep:
* CleanupHyperswitchWebhookLog in internal/jobs, daily tick,
batched DELETE (10k rows + 100ms pause) so a large backlog
doesn't lock the table.
* HYPERSWITCH_WEBHOOK_LOG_RETENTION_DAYS (default 90).
* Same goroutine-ticker pattern as ScheduleOrphanTracksCleanup.
* Wired in cmd/api/main.go alongside the existing cleanup jobs.
Tests: 5 in webhook_log_test.go (persistence, request_id auto-gen,
invalid-JSON leaves event_type empty, invalid-signature capture,
extractEventType 5 sub-cases) + 4 in cleanup_hyperswitch_webhook_
log_test.go (deletes-older-than, noop, default-on-zero,
context-cancel). Migration 984 applied cleanly to local Postgres;
all indexes present.
Also (v107-plan.md):
* Item G acceptance gains an explicit Idempotency-Key threading
requirement with an empty-key loud-fail test — "literally
copy-paste D's 4-line test skeleton". Closes the risk that
item G silently reopens the HTTP-retry duplicate-charge
exposure D closed.
Out of scope for E (noted in CHANGELOG):
* Rate limit on the endpoint — pre-existing middleware covers
it at the router level; adding a per-endpoint limit is
separate scope.
* Readable-payload SQL view — deferred, the TEXT column is
already human-readable; a convenience view is a nice-to-have
not a ship-blocker.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
338 lines
17 KiB
Markdown
338 lines
17 KiB
Markdown
# v1.0.7 — plan structuré
|
|
|
|
> Derived from [`axis-1-correctness.md`](./axis-1-correctness.md).
|
|
> The v1.0.6 CHANGELOG listed 4 "parked v1.0.7" items; the axis-1 audit
|
|
> added 4 P0 findings. De-duplicated and sequenced below.
|
|
|
|
## The 9 items (was 8, de-duplicated → 6; +1 after the 2026-04-17 Q2 probe)
|
|
|
|
| # | From | Title | Effort |
|
|
|---|---|---|---|
|
|
| A | audit P0.1 | Persist `stripe_transfer_id` in `seller_transfers` | **S** |
|
|
| B | audit P0.2 ≡ CHANGELOG "Stripe Connect reversal" | Connect reversal via `reversal_pending` state + async worker | **M** |
|
|
| C | audit P0.3 | Reconciliation sweep for stuck orders / refunds | **M** |
|
|
| D | audit P0.4 | `Idempotency-Key` on `CreatePayment` / `CreateRefund` | **XS** |
|
|
| E | audit P1.5 | Webhook raw-payload log table + insert | **S** |
|
|
| F | audit P1.8 | Ledger-health Prometheus metrics + alerts | **S** |
|
|
| **G** | **audit P0.12 follow-up (post v1.0.6.2 hotfix)** | **Subscription `pending_payment` state + webhook-driven activation; replace `if s.paymentProvider != nil` short-circuit** | **M** |
|
|
|
|
Dropped from v1.0.7 scope:
|
|
- Partial refunds (CHANGELOG-parked) — P2 in audit, feature-class, defer to v1.0.8
|
|
- `CloudUploadModal` single-source-of-truth (CHANGELOG-parked) — P2, out of money-movement scope
|
|
- Sandbox smoke-test documentation — landed de facto in v1.0.6.1 via the partial-UNIQUE hotfix + the smoke harness artefacts
|
|
|
|
Effort legend — **XS** ≤ 2h, **S** ≤ 1 day, **M** ≤ 3 days, **L** > 3 days.
|
|
|
|
## Dependency graph
|
|
|
|
```
|
|
┌──────────────────┐
|
|
│ D Idempotency │ independent, can land first as quick win
|
|
└──────────────────┘
|
|
|
|
┌──────────────────┐
|
|
│ A transfer_id │ prerequisite for B
|
|
└────────┬─────────┘
|
|
│
|
|
▼
|
|
┌──────────────────┐
|
|
│ B reversal │ uses transfer_id persisted by A;
|
|
│ worker │ introduces `reversal_pending` status
|
|
└──────────────────┘
|
|
|
|
┌──────────────────┐
|
|
│ E webhook log │ independent; prerequisite for P1.6 (disputes)
|
|
└──────────────────┘
|
|
|
|
┌──────────────────┐
|
|
│ C reconciler │ independent, but:
|
|
│ sweep │ metrics (F) track its effectiveness
|
|
└────────┬─────────┘
|
|
│
|
|
▼
|
|
┌──────────────────┐
|
|
│ F metrics │ needs the state-shape C produces; alerts wire
|
|
│ + alerts │ to the buckets the reconciler defines
|
|
└──────────────────┘
|
|
|
|
┌──────────────────┐
|
|
│ G subscription │ independent of A/B/C/D/E/F; shares the
|
|
│ pending_ │ `pending_payment` pattern with B's
|
|
│ payment │ `reversal_pending`. Builds on v1.0.6.2
|
|
└──────────────────┘ hotfix (which compensates via filter).
|
|
```
|
|
|
|
**Three parallel tracks:**
|
|
- **Track 1 (reversal correctness)** — A → B
|
|
- **Track 2 (operational visibility)** — D, E, C → F
|
|
- **Track 3 (subscription creation path)** — G (single item, independent)
|
|
|
|
Two developers can work in parallel without stepping on each other. A
|
|
single developer sequences as ordered: D first (XS quick win, earns
|
|
trust + unblocks "pre-open" checklist), then A→B, then E, then C→F,
|
|
then G. G can also run in parallel at any point after D — it shares no
|
|
data-model surface with the other items.
|
|
|
|
## Commit sequence (single-developer path)
|
|
|
|
Each item lands as its own commit in the existing v1.0.6-style cadence
|
|
(per-commit tests + CHANGELOG-worthy).
|
|
|
|
### 1. `fix(hyperswitch): idempotency-key on create-payment and create-refund` — D
|
|
|
|
Effort: **XS**. Pure header addition. Tests: the 15-case refund suite
|
|
already exists; add 2 cases verifying the header is set correctly
|
|
(httptest.Server assertion on `r.Header.Get("Idempotency-Key")`).
|
|
|
|
Acceptance (landed in commit TBD — this entry pinned ahead):
|
|
- Every outbound `POST /payments` carries `Idempotency-Key: <order.id>`.
|
|
- Every outbound `POST /refunds` carries `Idempotency-Key: <refund.id>`.
|
|
- No implicit-via-ctx magic: each call site sets the header explicitly,
|
|
greppable.
|
|
- Empty idempotency key returns an error from the client (loud failure,
|
|
not silent header omission).
|
|
- CHANGELOG entry cross-references P0.4 + its scope note (HTTP retry
|
|
only, not app-level replay).
|
|
|
|
**Status** — landed 2026-04-18 alongside item B day 3 closure.
|
|
Subscription's CreateSubscriptionPayment interface still lacks a live
|
|
Hyperswitch impl (deferred to item G); that's where the remaining
|
|
idempotency-key plumbing goes.
|
|
|
|
**TTL caveat** — Hyperswitch (like most PSPs) honours `Idempotency-Key`
|
|
server-side only for a finite window: 24 h is common, 7 d at the high
|
|
end. Beyond the TTL, a replayed call with the same key is treated as
|
|
a new request. The header therefore protects against HTTP-layer retries
|
|
within a single request cycle, not against long-tail application
|
|
replay scenarios (for which the application-level idempotency primitives
|
|
— order.id on payments, the partial UNIQUE on `refunds.hyperswitch_refund_id`
|
|
landed in v1.0.6.1 — are the load-bearing guards). Verify the exact TTL
|
|
against current Hyperswitch docs before landing and note it in the
|
|
CHANGELOG scope sentence so anyone reading later knows the envelope.
|
|
|
|
**Ship as v1.0.7-alpha-1 for sandbox testing**, don't wait for B/C.
|
|
|
|
### 2. `refactor(connect): persist stripe_transfer_id on create + retry` — A
|
|
|
|
Effort: **S**. Touches the `TransferService` interface (minor breaking
|
|
change — but only internal callers). Migration:
|
|
`981_seller_transfers_stripe_reversal_id.sql` adds `stripe_reversal_id`
|
|
nullable column (prepares ground for B). Note — bumped from 980 to
|
|
981 because v1.0.6.2 used 980 for the unpaid-subscription cleanup;
|
|
all subsequent v1.0.7 migration numbers in this plan shift by +1 when
|
|
they land.
|
|
|
|
Acceptance:
|
|
- `TransferService.CreateTransfer(...) (string, error)` — returns the
|
|
Stripe transfer id.
|
|
- `processSellerTransfers` persists `st.StripeTransferID` before
|
|
`tx.Create(&st)`.
|
|
- `TransferRetryWorker.retryOne` also persists on retry success.
|
|
- Backfill: one-shot migration query that fills known transfer_ids
|
|
for past orders by calling Stripe's
|
|
`transfers.List(Destination=..., Metadata[order_id]=...)`. Acceptable
|
|
to leave `NULL` where Stripe has no match (document: "pre-v1.0.7
|
|
transfers cannot be reversed automatically; use admin API P2.9").
|
|
|
|
### 3. `feat(marketplace): async stripe connect reversal worker` — B
|
|
|
|
Effort: **M**. The big one. Introduces `seller_transfers.status =
|
|
'reversal_pending'` as a new intermediate terminal-avoidance state.
|
|
Migration: `981_seller_transfers_reversal_pending_enum.sql`.
|
|
|
|
Acceptance:
|
|
- `reverseSellerAccounting` transitions `seller_transfers.status` to
|
|
`reversal_pending`, sets `next_retry_at = NOW()`. Refund
|
|
finalization completes end-to-end — buyer never sees a
|
|
Stripe-health-dependent refund UX.
|
|
- New `StripeReversalWorker` in `internal/workers/stripe_reversal.go`
|
|
runs every `TRANSFER_RETRY_INTERVAL` (reuse existing env var),
|
|
processes `reversal_pending` rows. Calls
|
|
`transfer.NewReversal(stripeTransferID, ¶ms)`.
|
|
- On success → `status='reversed'`, persist `stripe_reversal_id`.
|
|
- On Stripe 404 → `status='reversed'` + log INFO (treat as "already
|
|
reversed out-of-band").
|
|
- On other errors → exponential backoff via `next_retry_at` + `retry_count`,
|
|
hit `permanently_failed` ceiling after N retries (mirror
|
|
`TransferRetryWorker`).
|
|
- Tests: mock the stripe SDK via a local httptest.Server (harder than
|
|
the Hyperswitch mock because stripe-go's HTTP layer is less trivial
|
|
to re-point, but feasible via `stripe.SetBackend`). 8 cases:
|
|
- happy path: pending → reversal_pending → reversed + id persisted
|
|
- Stripe 404: reversal_pending → reversed + log
|
|
- Stripe 5xx transient: retry_count increments, backoff set
|
|
- max retries: → permanently_failed
|
|
- concurrent worker (two instances pick same row): lock wins
|
|
- StripeTransferID empty (legacy row): skip + ERROR log + counter
|
|
- reversal idempotency (Stripe dedupes same transferID): no-op
|
|
- worker graceful shutdown mid-reversal (context cancellation)
|
|
|
|
### 4. `feat(webhooks): persist raw hyperswitch payloads to audit log` — E
|
|
|
|
Effort: **S**. Migration: `982_hyperswitch_webhook_log.sql`. Insert
|
|
before signature verification (so we capture attack attempts too).
|
|
|
|
Acceptance:
|
|
- Every webhook landing on `/webhooks/hyperswitch` produces exactly one
|
|
row, regardless of signature-valid or processing outcome.
|
|
- `processing_result` field captures `'ok'`, `'error: <msg>'`, or
|
|
`'skipped'`.
|
|
- Retention: a `CleanupWebhookLog` worker in the same
|
|
`internal/jobs/` package as the orphan-tracks cleaner, daily, deletes
|
|
rows older than 90 days.
|
|
- Tests: 3 cases (valid signature + processing ok; invalid signature;
|
|
processing error).
|
|
|
|
### 5. `feat(workers): hyperswitch reconciliation sweep for stuck pending states` — C
|
|
|
|
Effort: **M**. New `ReconcileHyperswitchWorker` in `internal/jobs/`.
|
|
Hourly by default (`RECONCILE_INTERVAL=1h`), but exposed so ops can
|
|
drop to 5m during incident response.
|
|
|
|
Acceptance:
|
|
- Orders in `pending` > 30min: `GET /payments/:id`, call the same
|
|
`ProcessPaymentWebhook` internal dispatcher with a synthesised
|
|
payload. Idempotent with real webhooks via the existing terminal-state
|
|
guard.
|
|
- Refunds in `pending` with non-empty `hyperswitch_refund_id` > 30min:
|
|
`GET /refunds/:id`, same pattern with `ProcessRefundWebhook`.
|
|
- Refunds in `pending` with EMPTY `hyperswitch_refund_id` > 5min:
|
|
mark `failed`, roll order back to `completed`, log ERROR
|
|
(operator attention needed — something crashed between Phase 1 and
|
|
Phase 2 of `RefundOrder`).
|
|
- Tests: happy sync, no-op when everything is terminal, the
|
|
empty-refund_id auto-fail case.
|
|
- Structured log on every action taken so `grep reconcile` tells the
|
|
ops story.
|
|
|
|
### 6. `feat(metrics): ledger-health gauges + alert rules` — F
|
|
|
|
Effort: **S**. New file `internal/metrics/ledger_health.go` with a
|
|
60s sampler. Grafana dashboard JSON in `config/grafana/ledger.json`.
|
|
Prometheus alert rules in `config/alertmanager/ledger.yml`.
|
|
|
|
Acceptance:
|
|
- 5 gauge metrics (listed in P1.8 action).
|
|
- 2 alert rules (`stuck_orders > 0 for 10m`, `orphan_refunds > 0 for 5m`).
|
|
- Sampler queries are cheap: `SELECT COUNT(*) WHERE ... AND created_at < NOW() - INTERVAL '30 min'` per metric, indexed by `status + created_at`.
|
|
- If indexes are missing: migration `983_ledger_health_indexes.sql`.
|
|
|
|
### 7. `feat(subscription): pending_payment state + webhook-driven activation` — G
|
|
|
|
Effort: **M**. The follow-up to v1.0.6.2's gate-filter compensation.
|
|
v1.0.6.2 closed the feature bypass at the consumption site
|
|
(`GetUserSubscription` filters fantôme rows out); G replaces the
|
|
creation path so no fantôme rows get written in the first place.
|
|
|
|
Migrations:
|
|
- `984_subscription_pending_payment_enum.sql` adds `'pending_payment'`
|
|
to the `user_subscriptions.status` VARCHAR (no enum at DB level, so
|
|
just a documentation + backfill step for Go constants).
|
|
- `985_backfill_hs_subscription_id_from_invoice.sql` (optional) —
|
|
backpopulates `user_subscriptions.hyperswitch_subscription_id` from
|
|
the attached invoice's PSP intent where available. Documents the
|
|
join rule for the webhook to reconcile on.
|
|
|
|
Acceptance:
|
|
- `subscription/service.go:createNewSubscription` creates paid-plan
|
|
rows in `pending_payment` state (never `active`). Transition to
|
|
`active` only via `ProcessSubscriptionWebhook` on
|
|
`payment_succeeded`. On `payment_failed`: transition to `expired`
|
|
+ log INFO; no invoice charge, no DB fantôme.
|
|
- `if s.paymentProvider != nil` short-circuit deleted — paid plans
|
|
without a payment provider configured return `503 payment provider
|
|
not configured` (env misconfig is an ops issue, not a silently-free
|
|
subscription).
|
|
- `GET /me/subscription` handles `pending_payment` explicitly —
|
|
returns `status: pending_payment`, `client_secret` echoed back so
|
|
the frontend can resume a stalled flow.
|
|
- **Recovery endpoint** `POST /api/v1/subscriptions/complete/:id`
|
|
(or reuse the existing Subscribe response's `client_secret`) that
|
|
the frontend can route the user to when the distribution handler
|
|
returns the "complete payment" message. Without a real endpoint,
|
|
the v1.0.6.2 error message is a dead end for users who landed in
|
|
fantôme state via a broken flow (no payment method saved, network
|
|
error mid-confirmation, etc.). Document the target route the
|
|
frontend should redirect to in the handler response payload.
|
|
- `distribution.checkEligibility` treats `pending_payment` as
|
|
ineligible (same as the v1.0.6.2 `ErrSubscriptionNoPayment` path).
|
|
- Remove the TODO(v1.0.7-item-G) annotation; remove the v1.0.6.2
|
|
filter from `GetUserSubscription` (redundant once the creation
|
|
path is correct) — OR keep the filter as defence-in-depth and note
|
|
it in the code comment.
|
|
- Webhook dispatcher: new `subscription` event family. Reuse the
|
|
`webhook_raw_payloads` table from item E for persistence.
|
|
- Tests:
|
|
- Subscribe to paid plan with Hyperswitch enabled →
|
|
`pending_payment` row + PSP intent, no feature access yet.
|
|
- Webhook `subscription.payment_succeeded` → row transitions to
|
|
`active`, feature access granted.
|
|
- Webhook `subscription.payment_failed` → row transitions to
|
|
`expired`, no charge, no access.
|
|
- Webhook replay (same payment_id) is idempotent.
|
|
- Subscribe with provider misconfigured → 503, no row created.
|
|
- Migration of v1.0.6.2 voided rows — check `voided_subscriptions_20260417`
|
|
entries stay readable and not re-pickable by the new flow.
|
|
- **Idempotency-Key threading (inherited from item D)**: the
|
|
new Hyperswitch-backed subscription payment provider MUST
|
|
accept an explicit `idempotencyKey` parameter and send it as
|
|
the `Idempotency-Key` HTTP header, using `subscription.ID`
|
|
(UUID) as the key. An empty-key loud-fail test is required —
|
|
same pattern as D's `TestClient_CreatePayment_RejectsEmpty
|
|
IdempotencyKey`, literally copy-paste the 4-line test
|
|
skeleton with `CreateSubscriptionPayment` substituted for
|
|
`CreateRefund`. Without this check, item G silently reopens
|
|
the HTTP-retry duplicate-charge exposure that D closed.
|
|
- **E2E Playwright @critical**: `POST /subscribe` followed by
|
|
`POST /distribution/submit` asserts 403 with the "complete
|
|
payment" message until the payment webhook fires. Today's
|
|
regression coverage is the shell probe + Go unit tests —
|
|
neither runs on every commit. Wiring a Playwright @critical
|
|
test turns the probe into a gate so a refactor of `Subscribe`
|
|
or `checkEligibility` cannot silently re-open the bypass.
|
|
|
|
Independent of A/B/C/D/E/F. Can land at any point after D.
|
|
|
|
## Release gating
|
|
|
|
**Before tagging v1.0.7:**
|
|
- All seven items landed as separate commits.
|
|
- Refund smoke test (see `docs/audit-2026-04/smoke/refund.md` — TBD)
|
|
re-run against sandbox with the reversal path: assert Stripe side
|
|
shows the reversal as well as the refund.
|
|
- Manual reconciliation pass: query the live DB against Hyperswitch
|
|
dashboard for all `completed` orders in the past week → zero drift.
|
|
Document the script used (`scripts/reconcile-check.sh`), keep for
|
|
periodic re-use.
|
|
- CHANGELOG entry lists each commit + criticity it resolves.
|
|
|
|
**Cut v1.0.7 as a minor release** (not patch), because the interface
|
|
change in A is technically breaking for any external caller of
|
|
`TransferService` (there are none today, but naming it 1.0.7 signals
|
|
the API change).
|
|
|
|
## Effort total
|
|
|
|
Worst-case single-developer sequential: **~12-13 working days**
|
|
(XS + S + M + S + M + S + M = 2h + 1d + 3d + 1d + 3d + 1d + 3d).
|
|
|
|
With two devs on the parallel tracks: **~7 working days** end-to-end
|
|
— track 1 (A→B, 4d), track 2 (D, E, C→F, 5d), track 3 (G, 3d, can
|
|
run at any point after D). Track 3 extends the critical path unless
|
|
a third dev picks it up; with 3 devs it lands in the track-2 window.
|
|
|
|
## Unknowns to resolve before starting
|
|
|
|
1. **Volume order of magnitude** — informs whether the reconciliation
|
|
sweep's default interval of 1h is appropriate. If we're processing
|
|
> 1000 orders/day, drop to 15m. Ops confirmation needed.
|
|
2. **Stripe Connect account state at time of v1.0.7 deploy** — any
|
|
pre-v1.0.7 transfers lack `stripe_transfer_id` in DB. The backfill
|
|
migration in A needs to know how many rows to probe and how many
|
|
will successfully match. Acceptable ceiling: if > 5% of historical
|
|
transfers fail to backfill, escalate to manual reconciliation
|
|
session.
|
|
3. **Hyperswitch sandbox vs prod webhook secret rotation** — item D
|
|
changes outbound request headers only, no impact. Items B+C call
|
|
Hyperswitch read endpoints, which use the same `HYPERSWITCH_API_KEY`.
|
|
No secret-rotation concern.
|