veza/docs/audit-2026-04/v107-plan.md
senke eedaad9f83 refactor(connect): persist stripe_transfer_id on create + retry — v1.0.7 item A
TransferService.CreateTransfer signature changes from (...) error to
(...) (string, error) — the caller now captures the Stripe transfer
identifier and persists it on the SellerTransfer row. Pre-v1.0.7 the
stripe_transfer_id column was declared on the model and table but
never written to, which blocked the reversal worker (v1.0.7 item B)
from identifying which transfer to reverse on refund.

Changes:
  * `TransferService` interface and `StripeConnectService.CreateTransfer`
    both return the Stripe transfer id alongside the error.
  * `processSellerTransfers` (marketplace service) persists the id on
    success before `tx.Create(&st)` so a crash between Stripe ACK and
    DB commit leaves no inconsistency.
  * `TransferRetryWorker.retryOne` persists on retry success — a row
    that failed on first attempt and succeeded via the worker is
    reversal-ready all the same.
  * `admin_transfer_handler.RetryTransfer` (manual retry) persists too.
  * `SellerPayout.ExternalPayoutID` is populated by the Connect payout
    flow (`payout.go`) — the field existed but was never written.
  * Four test mocks updated; two tests assert the id is persisted on
    the happy path, one on the failure path confirms we don't write a
    fake id when the provider errors.

Migration `981_seller_transfers_stripe_reversal_id.sql`:
  * Adds nullable `stripe_reversal_id` column for item B.
  * Partial UNIQUE indexes on both stripe_transfer_id and
    stripe_reversal_id (WHERE IS NOT NULL AND <> ''), mirroring the
    v1.0.6.1 pattern for refunds.hyperswitch_refund_id.
  * Logs a count of historical completed transfers that lack an id —
    these are candidates for the backfill CLI follow-up task.

Backfill for historical rows is a separate follow-up (cmd/tools/
backfill_stripe_transfer_ids, calling Stripe's transfers.List with
Destination + Metadata[order_id]). Pre-v1.0.7 transfers without a
backfilled id cannot be auto-reversed on refund — document in P2.9
admin-recovery when it lands. Acceptable scope per v107-plan.

Migration number bumped 980 → 981 because v1.0.6.2 used 980 for the
unpaid-subscription cleanup; v107-plan updated with the note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 13:08:39 +02:00

16 KiB

v1.0.7 — plan structuré

Derived from axis-1-correctness.md. The v1.0.6 CHANGELOG listed 4 "parked v1.0.7" items; the axis-1 audit added 4 P0 findings. De-duplicated and sequenced below.

The 9 items (was 8, de-duplicated → 6; +1 after the 2026-04-17 Q2 probe)

# From Title Effort
A audit P0.1 Persist stripe_transfer_id in seller_transfers S
B audit P0.2 ≡ CHANGELOG "Stripe Connect reversal" Connect reversal via reversal_pending state + async worker M
C audit P0.3 Reconciliation sweep for stuck orders / refunds M
D audit P0.4 Idempotency-Key on CreatePayment / CreateRefund XS
E audit P1.5 Webhook raw-payload log table + insert S
F audit P1.8 Ledger-health Prometheus metrics + alerts S
G audit P0.12 follow-up (post v1.0.6.2 hotfix) Subscription pending_payment state + webhook-driven activation; replace if s.paymentProvider != nil short-circuit M

Dropped from v1.0.7 scope:

  • Partial refunds (CHANGELOG-parked) — P2 in audit, feature-class, defer to v1.0.8
  • CloudUploadModal single-source-of-truth (CHANGELOG-parked) — P2, out of money-movement scope
  • Sandbox smoke-test documentation — landed de facto in v1.0.6.1 via the partial-UNIQUE hotfix + the smoke harness artefacts

Effort legend — XS ≤ 2h, S ≤ 1 day, M ≤ 3 days, L > 3 days.

Dependency graph

      ┌──────────────────┐
      │  D  Idempotency  │  independent, can land first as quick win
      └──────────────────┘

      ┌──────────────────┐
      │  A  transfer_id  │  prerequisite for B
      └────────┬─────────┘
               │
               ▼
      ┌──────────────────┐
      │  B  reversal     │  uses transfer_id persisted by A;
      │     worker       │  introduces `reversal_pending` status
      └──────────────────┘

      ┌──────────────────┐
      │  E  webhook log  │  independent; prerequisite for P1.6 (disputes)
      └──────────────────┘

      ┌──────────────────┐
      │  C  reconciler   │  independent, but:
      │     sweep        │  metrics (F) track its effectiveness
      └────────┬─────────┘
               │
               ▼
      ┌──────────────────┐
      │  F  metrics      │  needs the state-shape C produces; alerts wire
      │     + alerts     │  to the buckets the reconciler defines
      └──────────────────┘

      ┌──────────────────┐
      │  G  subscription │  independent of A/B/C/D/E/F; shares the
      │     pending_     │  `pending_payment` pattern with B's
      │     payment      │  `reversal_pending`. Builds on v1.0.6.2
      └──────────────────┘  hotfix (which compensates via filter).

Three parallel tracks:

  • Track 1 (reversal correctness) — A → B
  • Track 2 (operational visibility) — D, E, C → F
  • Track 3 (subscription creation path) — G (single item, independent)

Two developers can work in parallel without stepping on each other. A single developer sequences as ordered: D first (XS quick win, earns trust + unblocks "pre-open" checklist), then A→B, then E, then C→F, then G. G can also run in parallel at any point after D — it shares no data-model surface with the other items.

Commit sequence (single-developer path)

Each item lands as its own commit in the existing v1.0.6-style cadence (per-commit tests + CHANGELOG-worthy).

1. fix(hyperswitch): idempotency-key on create-payment and create-refund — D

Effort: XS. Pure header addition. Tests: the 15-case refund suite already exists; add 2 cases verifying the header is set correctly (httptest.Server assertion on r.Header.Get("Idempotency-Key")).

Acceptance:

  • Every outbound POST /payments carries Idempotency-Key: <order.id>.
  • Every outbound POST /refunds carries Idempotency-Key: <refund.id>.
  • No implicit-via-ctx magic: each call site sets the header explicitly, greppable.
  • CHANGELOG entry cross-references P0.4 + its scope note (HTTP retry only, not app-level replay).

TTL caveat — Hyperswitch (like most PSPs) honours Idempotency-Key server-side only for a finite window: 24 h is common, 7 d at the high end. Beyond the TTL, a replayed call with the same key is treated as a new request. The header therefore protects against HTTP-layer retries within a single request cycle, not against long-tail application replay scenarios (for which the application-level idempotency primitives — order.id on payments, the partial UNIQUE on refunds.hyperswitch_refund_id landed in v1.0.6.1 — are the load-bearing guards). Verify the exact TTL against current Hyperswitch docs before landing and note it in the CHANGELOG scope sentence so anyone reading later knows the envelope.

Ship as v1.0.7-alpha-1 for sandbox testing, don't wait for B/C.

2. refactor(connect): persist stripe_transfer_id on create + retry — A

Effort: S. Touches the TransferService interface (minor breaking change — but only internal callers). Migration: 981_seller_transfers_stripe_reversal_id.sql adds stripe_reversal_id nullable column (prepares ground for B). Note — bumped from 980 to 981 because v1.0.6.2 used 980 for the unpaid-subscription cleanup; all subsequent v1.0.7 migration numbers in this plan shift by +1 when they land.

Acceptance:

  • TransferService.CreateTransfer(...) (string, error) — returns the Stripe transfer id.
  • processSellerTransfers persists st.StripeTransferID before tx.Create(&st).
  • TransferRetryWorker.retryOne also persists on retry success.
  • Backfill: one-shot migration query that fills known transfer_ids for past orders by calling Stripe's transfers.List(Destination=..., Metadata[order_id]=...). Acceptable to leave NULL where Stripe has no match (document: "pre-v1.0.7 transfers cannot be reversed automatically; use admin API P2.9").

3. feat(marketplace): async stripe connect reversal worker — B

Effort: M. The big one. Introduces seller_transfers.status = 'reversal_pending' as a new intermediate terminal-avoidance state. Migration: 981_seller_transfers_reversal_pending_enum.sql.

Acceptance:

  • reverseSellerAccounting transitions seller_transfers.status to reversal_pending, sets next_retry_at = NOW(). Refund finalization completes end-to-end — buyer never sees a Stripe-health-dependent refund UX.
  • New StripeReversalWorker in internal/workers/stripe_reversal.go runs every TRANSFER_RETRY_INTERVAL (reuse existing env var), processes reversal_pending rows. Calls transfer.NewReversal(stripeTransferID, &params).
  • On success → status='reversed', persist stripe_reversal_id.
  • On Stripe 404 → status='reversed' + log INFO (treat as "already reversed out-of-band").
  • On other errors → exponential backoff via next_retry_at + retry_count, hit permanently_failed ceiling after N retries (mirror TransferRetryWorker).
  • Tests: mock the stripe SDK via a local httptest.Server (harder than the Hyperswitch mock because stripe-go's HTTP layer is less trivial to re-point, but feasible via stripe.SetBackend). 8 cases:
    • happy path: pending → reversal_pending → reversed + id persisted
    • Stripe 404: reversal_pending → reversed + log
    • Stripe 5xx transient: retry_count increments, backoff set
    • max retries: → permanently_failed
    • concurrent worker (two instances pick same row): lock wins
    • StripeTransferID empty (legacy row): skip + ERROR log + counter
    • reversal idempotency (Stripe dedupes same transferID): no-op
    • worker graceful shutdown mid-reversal (context cancellation)

4. feat(webhooks): persist raw hyperswitch payloads to audit log — E

Effort: S. Migration: 982_hyperswitch_webhook_log.sql. Insert before signature verification (so we capture attack attempts too).

Acceptance:

  • Every webhook landing on /webhooks/hyperswitch produces exactly one row, regardless of signature-valid or processing outcome.
  • processing_result field captures 'ok', 'error: <msg>', or 'skipped'.
  • Retention: a CleanupWebhookLog worker in the same internal/jobs/ package as the orphan-tracks cleaner, daily, deletes rows older than 90 days.
  • Tests: 3 cases (valid signature + processing ok; invalid signature; processing error).

5. feat(workers): hyperswitch reconciliation sweep for stuck pending states — C

Effort: M. New ReconcileHyperswitchWorker in internal/jobs/. Hourly by default (RECONCILE_INTERVAL=1h), but exposed so ops can drop to 5m during incident response.

Acceptance:

  • Orders in pending > 30min: GET /payments/:id, call the same ProcessPaymentWebhook internal dispatcher with a synthesised payload. Idempotent with real webhooks via the existing terminal-state guard.
  • Refunds in pending with non-empty hyperswitch_refund_id > 30min: GET /refunds/:id, same pattern with ProcessRefundWebhook.
  • Refunds in pending with EMPTY hyperswitch_refund_id > 5min: mark failed, roll order back to completed, log ERROR (operator attention needed — something crashed between Phase 1 and Phase 2 of RefundOrder).
  • Tests: happy sync, no-op when everything is terminal, the empty-refund_id auto-fail case.
  • Structured log on every action taken so grep reconcile tells the ops story.

6. feat(metrics): ledger-health gauges + alert rules — F

Effort: S. New file internal/metrics/ledger_health.go with a 60s sampler. Grafana dashboard JSON in config/grafana/ledger.json. Prometheus alert rules in config/alertmanager/ledger.yml.

Acceptance:

  • 5 gauge metrics (listed in P1.8 action).
  • 2 alert rules (stuck_orders > 0 for 10m, orphan_refunds > 0 for 5m).
  • Sampler queries are cheap: SELECT COUNT(*) WHERE ... AND created_at < NOW() - INTERVAL '30 min' per metric, indexed by status + created_at.
  • If indexes are missing: migration 983_ledger_health_indexes.sql.

7. feat(subscription): pending_payment state + webhook-driven activation — G

Effort: M. The follow-up to v1.0.6.2's gate-filter compensation. v1.0.6.2 closed the feature bypass at the consumption site (GetUserSubscription filters fantôme rows out); G replaces the creation path so no fantôme rows get written in the first place.

Migrations:

  • 984_subscription_pending_payment_enum.sql adds 'pending_payment' to the user_subscriptions.status VARCHAR (no enum at DB level, so just a documentation + backfill step for Go constants).
  • 985_backfill_hs_subscription_id_from_invoice.sql (optional) — backpopulates user_subscriptions.hyperswitch_subscription_id from the attached invoice's PSP intent where available. Documents the join rule for the webhook to reconcile on.

Acceptance:

  • subscription/service.go:createNewSubscription creates paid-plan rows in pending_payment state (never active). Transition to active only via ProcessSubscriptionWebhook on payment_succeeded. On payment_failed: transition to expired
    • log INFO; no invoice charge, no DB fantôme.
  • if s.paymentProvider != nil short-circuit deleted — paid plans without a payment provider configured return 503 payment provider not configured (env misconfig is an ops issue, not a silently-free subscription).
  • GET /me/subscription handles pending_payment explicitly — returns status: pending_payment, client_secret echoed back so the frontend can resume a stalled flow.
  • Recovery endpoint POST /api/v1/subscriptions/complete/:id (or reuse the existing Subscribe response's client_secret) that the frontend can route the user to when the distribution handler returns the "complete payment" message. Without a real endpoint, the v1.0.6.2 error message is a dead end for users who landed in fantôme state via a broken flow (no payment method saved, network error mid-confirmation, etc.). Document the target route the frontend should redirect to in the handler response payload.
  • distribution.checkEligibility treats pending_payment as ineligible (same as the v1.0.6.2 ErrSubscriptionNoPayment path).
  • Remove the TODO(v1.0.7-item-G) annotation; remove the v1.0.6.2 filter from GetUserSubscription (redundant once the creation path is correct) — OR keep the filter as defence-in-depth and note it in the code comment.
  • Webhook dispatcher: new subscription event family. Reuse the webhook_raw_payloads table from item E for persistence.
  • Tests:
    • Subscribe to paid plan with Hyperswitch enabled → pending_payment row + PSP intent, no feature access yet.
    • Webhook subscription.payment_succeeded → row transitions to active, feature access granted.
    • Webhook subscription.payment_failed → row transitions to expired, no charge, no access.
    • Webhook replay (same payment_id) is idempotent.
    • Subscribe with provider misconfigured → 503, no row created.
    • Migration of v1.0.6.2 voided rows — check voided_subscriptions_20260417 entries stay readable and not re-pickable by the new flow.
    • E2E Playwright @critical: POST /subscribe followed by POST /distribution/submit asserts 403 with the "complete payment" message until the payment webhook fires. Today's regression coverage is the shell probe + Go unit tests — neither runs on every commit. Wiring a Playwright @critical test turns the probe into a gate so a refactor of Subscribe or checkEligibility cannot silently re-open the bypass.

Independent of A/B/C/D/E/F. Can land at any point after D.

Release gating

Before tagging v1.0.7:

  • All seven items landed as separate commits.
  • Refund smoke test (see docs/audit-2026-04/smoke/refund.md — TBD) re-run against sandbox with the reversal path: assert Stripe side shows the reversal as well as the refund.
  • Manual reconciliation pass: query the live DB against Hyperswitch dashboard for all completed orders in the past week → zero drift. Document the script used (scripts/reconcile-check.sh), keep for periodic re-use.
  • CHANGELOG entry lists each commit + criticity it resolves.

Cut v1.0.7 as a minor release (not patch), because the interface change in A is technically breaking for any external caller of TransferService (there are none today, but naming it 1.0.7 signals the API change).

Effort total

Worst-case single-developer sequential: ~12-13 working days (XS + S + M + S + M + S + M = 2h + 1d + 3d + 1d + 3d + 1d + 3d).

With two devs on the parallel tracks: ~7 working days end-to-end — track 1 (A→B, 4d), track 2 (D, E, C→F, 5d), track 3 (G, 3d, can run at any point after D). Track 3 extends the critical path unless a third dev picks it up; with 3 devs it lands in the track-2 window.

Unknowns to resolve before starting

  1. Volume order of magnitude — informs whether the reconciliation sweep's default interval of 1h is appropriate. If we're processing

    1000 orders/day, drop to 15m. Ops confirmation needed.

  2. Stripe Connect account state at time of v1.0.7 deploy — any pre-v1.0.7 transfers lack stripe_transfer_id in DB. The backfill migration in A needs to know how many rows to probe and how many will successfully match. Acceptable ceiling: if > 5% of historical transfers fail to backfill, escalate to manual reconciliation session.
  3. Hyperswitch sandbox vs prod webhook secret rotation — item D changes outbound request headers only, no impact. Items B+C call Hyperswitch read endpoints, which use the same HYPERSWITCH_API_KEY. No secret-rotation concern.