# v1.0.7 — plan structuré > Derived from [`axis-1-correctness.md`](./axis-1-correctness.md). > The v1.0.6 CHANGELOG listed 4 "parked v1.0.7" items; the axis-1 audit > added 4 P0 findings. De-duplicated and sequenced below. ## The 9 items (was 8, de-duplicated → 6; +1 after the 2026-04-17 Q2 probe) | # | From | Title | Effort | |---|---|---|---| | A | audit P0.1 | Persist `stripe_transfer_id` in `seller_transfers` | **S** | | B | audit P0.2 ≡ CHANGELOG "Stripe Connect reversal" | Connect reversal via `reversal_pending` state + async worker | **M** | | C | audit P0.3 | Reconciliation sweep for stuck orders / refunds | **M** | | D | audit P0.4 | `Idempotency-Key` on `CreatePayment` / `CreateRefund` | **XS** | | E | audit P1.5 | Webhook raw-payload log table + insert | **S** | | F | audit P1.8 | Ledger-health Prometheus metrics + alerts | **S** | | **G** | **audit P0.12 follow-up (post v1.0.6.2 hotfix)** | **Subscription `pending_payment` state + webhook-driven activation; replace `if s.paymentProvider != nil` short-circuit** | **M** | Dropped from v1.0.7 scope: - Partial refunds (CHANGELOG-parked) — P2 in audit, feature-class, defer to v1.0.8 - `CloudUploadModal` single-source-of-truth (CHANGELOG-parked) — P2, out of money-movement scope - Sandbox smoke-test documentation — landed de facto in v1.0.6.1 via the partial-UNIQUE hotfix + the smoke harness artefacts Effort legend — **XS** ≤ 2h, **S** ≤ 1 day, **M** ≤ 3 days, **L** > 3 days. ## Dependency graph ``` ┌──────────────────┐ │ D Idempotency │ independent, can land first as quick win └──────────────────┘ ┌──────────────────┐ │ A transfer_id │ prerequisite for B └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ B reversal │ uses transfer_id persisted by A; │ worker │ introduces `reversal_pending` status └──────────────────┘ ┌──────────────────┐ │ E webhook log │ independent; prerequisite for P1.6 (disputes) └──────────────────┘ ┌──────────────────┐ │ C reconciler │ independent, but: │ sweep │ metrics (F) track its effectiveness └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ F metrics │ needs the state-shape C produces; alerts wire │ + alerts │ to the buckets the reconciler defines └──────────────────┘ ┌──────────────────┐ │ G subscription │ independent of A/B/C/D/E/F; shares the │ pending_ │ `pending_payment` pattern with B's │ payment │ `reversal_pending`. Builds on v1.0.6.2 └──────────────────┘ hotfix (which compensates via filter). ``` **Three parallel tracks:** - **Track 1 (reversal correctness)** — A → B - **Track 2 (operational visibility)** — D, E, C → F - **Track 3 (subscription creation path)** — G (single item, independent) Two developers can work in parallel without stepping on each other. A single developer sequences as ordered: D first (XS quick win, earns trust + unblocks "pre-open" checklist), then A→B, then E, then C→F, then G. G can also run in parallel at any point after D — it shares no data-model surface with the other items. ## Commit sequence (single-developer path) Each item lands as its own commit in the existing v1.0.6-style cadence (per-commit tests + CHANGELOG-worthy). ### 1. `fix(hyperswitch): idempotency-key on create-payment and create-refund` — D Effort: **XS**. Pure header addition. Tests: the 15-case refund suite already exists; add 2 cases verifying the header is set correctly (httptest.Server assertion on `r.Header.Get("Idempotency-Key")`). Acceptance (landed in commit TBD — this entry pinned ahead): - Every outbound `POST /payments` carries `Idempotency-Key: `. - Every outbound `POST /refunds` carries `Idempotency-Key: `. - No implicit-via-ctx magic: each call site sets the header explicitly, greppable. - Empty idempotency key returns an error from the client (loud failure, not silent header omission). - CHANGELOG entry cross-references P0.4 + its scope note (HTTP retry only, not app-level replay). **Status** — landed 2026-04-18 alongside item B day 3 closure. Subscription's CreateSubscriptionPayment interface still lacks a live Hyperswitch impl (deferred to item G); that's where the remaining idempotency-key plumbing goes. **TTL caveat** — Hyperswitch (like most PSPs) honours `Idempotency-Key` server-side only for a finite window: 24 h is common, 7 d at the high end. Beyond the TTL, a replayed call with the same key is treated as a new request. The header therefore protects against HTTP-layer retries within a single request cycle, not against long-tail application replay scenarios (for which the application-level idempotency primitives — order.id on payments, the partial UNIQUE on `refunds.hyperswitch_refund_id` landed in v1.0.6.1 — are the load-bearing guards). Verify the exact TTL against current Hyperswitch docs before landing and note it in the CHANGELOG scope sentence so anyone reading later knows the envelope. **Ship as v1.0.7-alpha-1 for sandbox testing**, don't wait for B/C. ### 2. `refactor(connect): persist stripe_transfer_id on create + retry` — A Effort: **S**. Touches the `TransferService` interface (minor breaking change — but only internal callers). Migration: `981_seller_transfers_stripe_reversal_id.sql` adds `stripe_reversal_id` nullable column (prepares ground for B). Note — bumped from 980 to 981 because v1.0.6.2 used 980 for the unpaid-subscription cleanup; all subsequent v1.0.7 migration numbers in this plan shift by +1 when they land. Acceptance: - `TransferService.CreateTransfer(...) (string, error)` — returns the Stripe transfer id. - `processSellerTransfers` persists `st.StripeTransferID` before `tx.Create(&st)`. - `TransferRetryWorker.retryOne` also persists on retry success. - Backfill: one-shot migration query that fills known transfer_ids for past orders by calling Stripe's `transfers.List(Destination=..., Metadata[order_id]=...)`. Acceptable to leave `NULL` where Stripe has no match (document: "pre-v1.0.7 transfers cannot be reversed automatically; use admin API P2.9"). ### 3. `feat(marketplace): async stripe connect reversal worker` — B Effort: **M**. The big one. Introduces `seller_transfers.status = 'reversal_pending'` as a new intermediate terminal-avoidance state. Migration: `981_seller_transfers_reversal_pending_enum.sql`. Acceptance: - `reverseSellerAccounting` transitions `seller_transfers.status` to `reversal_pending`, sets `next_retry_at = NOW()`. Refund finalization completes end-to-end — buyer never sees a Stripe-health-dependent refund UX. - New `StripeReversalWorker` in `internal/workers/stripe_reversal.go` runs every `TRANSFER_RETRY_INTERVAL` (reuse existing env var), processes `reversal_pending` rows. Calls `transfer.NewReversal(stripeTransferID, ¶ms)`. - On success → `status='reversed'`, persist `stripe_reversal_id`. - On Stripe 404 → `status='reversed'` + log INFO (treat as "already reversed out-of-band"). - On other errors → exponential backoff via `next_retry_at` + `retry_count`, hit `permanently_failed` ceiling after N retries (mirror `TransferRetryWorker`). - Tests: mock the stripe SDK via a local httptest.Server (harder than the Hyperswitch mock because stripe-go's HTTP layer is less trivial to re-point, but feasible via `stripe.SetBackend`). 8 cases: - happy path: pending → reversal_pending → reversed + id persisted - Stripe 404: reversal_pending → reversed + log - Stripe 5xx transient: retry_count increments, backoff set - max retries: → permanently_failed - concurrent worker (two instances pick same row): lock wins - StripeTransferID empty (legacy row): skip + ERROR log + counter - reversal idempotency (Stripe dedupes same transferID): no-op - worker graceful shutdown mid-reversal (context cancellation) ### 4. `feat(webhooks): persist raw hyperswitch payloads to audit log` — E Effort: **S**. Migration: `982_hyperswitch_webhook_log.sql`. Insert before signature verification (so we capture attack attempts too). Acceptance: - Every webhook landing on `/webhooks/hyperswitch` produces exactly one row, regardless of signature-valid or processing outcome. - `processing_result` field captures `'ok'`, `'error: '`, or `'skipped'`. - Retention: a `CleanupWebhookLog` worker in the same `internal/jobs/` package as the orphan-tracks cleaner, daily, deletes rows older than 90 days. - Tests: 3 cases (valid signature + processing ok; invalid signature; processing error). ### 5. `feat(workers): hyperswitch reconciliation sweep for stuck pending states` — C Effort: **M**. New `ReconcileHyperswitchWorker` in `internal/jobs/`. Hourly by default (`RECONCILE_INTERVAL=1h`), but exposed so ops can drop to 5m during incident response. Acceptance: - Orders in `pending` > 30min: `GET /payments/:id`, call the same `ProcessPaymentWebhook` internal dispatcher with a synthesised payload. Idempotent with real webhooks via the existing terminal-state guard. - Refunds in `pending` with non-empty `hyperswitch_refund_id` > 30min: `GET /refunds/:id`, same pattern with `ProcessRefundWebhook`. - Refunds in `pending` with EMPTY `hyperswitch_refund_id` > 5min: mark `failed`, roll order back to `completed`, log ERROR (operator attention needed — something crashed between Phase 1 and Phase 2 of `RefundOrder`). - Tests: happy sync, no-op when everything is terminal, the empty-refund_id auto-fail case. - Structured log on every action taken so `grep reconcile` tells the ops story. ### 6. `feat(metrics): ledger-health gauges + alert rules` — F Effort: **S**. New file `internal/metrics/ledger_health.go` with a 60s sampler. Grafana dashboard JSON in `config/grafana/ledger.json`. Prometheus alert rules in `config/alertmanager/ledger.yml`. Acceptance: - 5 gauge metrics (listed in P1.8 action). - 2 alert rules (`stuck_orders > 0 for 10m`, `orphan_refunds > 0 for 5m`). - Sampler queries are cheap: `SELECT COUNT(*) WHERE ... AND created_at < NOW() - INTERVAL '30 min'` per metric, indexed by `status + created_at`. - If indexes are missing: migration `983_ledger_health_indexes.sql`. ### 7. `feat(subscription): pending_payment state + webhook-driven activation` — G Effort: **M**. The follow-up to v1.0.6.2's gate-filter compensation. v1.0.6.2 closed the feature bypass at the consumption site (`GetUserSubscription` filters fantôme rows out); G replaces the creation path so no fantôme rows get written in the first place. Migrations: - `984_subscription_pending_payment_enum.sql` adds `'pending_payment'` to the `user_subscriptions.status` VARCHAR (no enum at DB level, so just a documentation + backfill step for Go constants). - `985_backfill_hs_subscription_id_from_invoice.sql` (optional) — backpopulates `user_subscriptions.hyperswitch_subscription_id` from the attached invoice's PSP intent where available. Documents the join rule for the webhook to reconcile on. Acceptance: - `subscription/service.go:createNewSubscription` creates paid-plan rows in `pending_payment` state (never `active`). Transition to `active` only via `ProcessSubscriptionWebhook` on `payment_succeeded`. On `payment_failed`: transition to `expired` + log INFO; no invoice charge, no DB fantôme. - `if s.paymentProvider != nil` short-circuit deleted — paid plans without a payment provider configured return `503 payment provider not configured` (env misconfig is an ops issue, not a silently-free subscription). - `GET /me/subscription` handles `pending_payment` explicitly — returns `status: pending_payment`, `client_secret` echoed back so the frontend can resume a stalled flow. - **Recovery endpoint** `POST /api/v1/subscriptions/complete/:id` (or reuse the existing Subscribe response's `client_secret`) that the frontend can route the user to when the distribution handler returns the "complete payment" message. Without a real endpoint, the v1.0.6.2 error message is a dead end for users who landed in fantôme state via a broken flow (no payment method saved, network error mid-confirmation, etc.). Document the target route the frontend should redirect to in the handler response payload. - `distribution.checkEligibility` treats `pending_payment` as ineligible (same as the v1.0.6.2 `ErrSubscriptionNoPayment` path). - Remove the TODO(v1.0.7-item-G) annotation; remove the v1.0.6.2 filter from `GetUserSubscription` (redundant once the creation path is correct) — OR keep the filter as defence-in-depth and note it in the code comment. - Webhook dispatcher: new `subscription` event family. Reuse the `webhook_raw_payloads` table from item E for persistence. - Tests: - Subscribe to paid plan with Hyperswitch enabled → `pending_payment` row + PSP intent, no feature access yet. - Webhook `subscription.payment_succeeded` → row transitions to `active`, feature access granted. - Webhook `subscription.payment_failed` → row transitions to `expired`, no charge, no access. - Webhook replay (same payment_id) is idempotent. - Subscribe with provider misconfigured → 503, no row created. - Migration of v1.0.6.2 voided rows — check `voided_subscriptions_20260417` entries stay readable and not re-pickable by the new flow. - **Idempotency-Key threading (inherited from item D)**: the new Hyperswitch-backed subscription payment provider MUST accept an explicit `idempotencyKey` parameter and send it as the `Idempotency-Key` HTTP header, using `subscription.ID` (UUID) as the key. An empty-key loud-fail test is required — same pattern as D's `TestClient_CreatePayment_RejectsEmpty IdempotencyKey`, literally copy-paste the 4-line test skeleton with `CreateSubscriptionPayment` substituted for `CreateRefund`. Without this check, item G silently reopens the HTTP-retry duplicate-charge exposure that D closed. - **E2E Playwright @critical**: `POST /subscribe` followed by `POST /distribution/submit` asserts 403 with the "complete payment" message until the payment webhook fires. Today's regression coverage is the shell probe + Go unit tests — neither runs on every commit. Wiring a Playwright @critical test turns the probe into a gate so a refactor of `Subscribe` or `checkEligibility` cannot silently re-open the bypass. Independent of A/B/C/D/E/F. Can land at any point after D. ## Release gating **Before tagging v1.0.7:** - All seven items landed as separate commits. - Refund smoke test (see `docs/audit-2026-04/smoke/refund.md` — TBD) re-run against sandbox with the reversal path: assert Stripe side shows the reversal as well as the refund. - Manual reconciliation pass: query the live DB against Hyperswitch dashboard for all `completed` orders in the past week → zero drift. Document the script used (`scripts/reconcile-check.sh`), keep for periodic re-use. - CHANGELOG entry lists each commit + criticity it resolves. **Cut v1.0.7 as a minor release** (not patch), because the interface change in A is technically breaking for any external caller of `TransferService` (there are none today, but naming it 1.0.7 signals the API change). ## Effort total Worst-case single-developer sequential: **~12-13 working days** (XS + S + M + S + M + S + M = 2h + 1d + 3d + 1d + 3d + 1d + 3d). With two devs on the parallel tracks: **~7 working days** end-to-end — track 1 (A→B, 4d), track 2 (D, E, C→F, 5d), track 3 (G, 3d, can run at any point after D). Track 3 extends the critical path unless a third dev picks it up; with 3 devs it lands in the track-2 window. ## Unknowns to resolve before starting 1. **Volume order of magnitude** — informs whether the reconciliation sweep's default interval of 1h is appropriate. If we're processing > 1000 orders/day, drop to 15m. Ops confirmation needed. 2. **Stripe Connect account state at time of v1.0.7 deploy** — any pre-v1.0.7 transfers lack `stripe_transfer_id` in DB. The backfill migration in A needs to know how many rows to probe and how many will successfully match. Acceptable ceiling: if > 5% of historical transfers fail to backfill, escalate to manual reconciliation session. 3. **Hyperswitch sandbox vs prod webhook secret rotation** — item D changes outbound request headers only, no impact. Items B+C call Hyperswitch read endpoints, which use the same `HYPERSWITCH_API_KEY`. No secret-rotation concern.