docs(audit): 2026-04 correctness/accounting findings (axis 1)

Axis 1 of the 5-axis VEZA audit, scoped to money-movement correctness and ledger↔PSP reconciliation. Layout: one file per axis under docs/audit-2026-04/, README index, v107-plan.md derived. P0 findings (block v1.0.7 "ready-to-show" gate): * P0.1 — SellerTransfer.StripeTransferID declared but never populated. stripe_connect_service.CreateTransfer discards the *stripe.Transfer return value (`_, err := transfer.New(params)`), so the column in models.go:237 is dead. Structural blocker for the CHANGELOG-parked v1.0.7 "Stripe Connect reversal" item. * P0.2 — No Stripe Connect reversal on refund.succeeded. Every refund today creates a permanent VEZA↔Stripe ledger gap. Action reworked to decouple via a new `seller_transfers.status = 'reversal_pending'` state + async worker, so Stripe flaps never block buyer-facing refund UX. * P0.3 — No reconciliation sweep for stuck orders / refunds / refund rows with empty hyperswitch_refund_id. Hourly worker recommended, same pattern as v1.0.5 Fix 6 orphan-tracks cleaner. * P0.4 — No Idempotency-Key on outbound Hyperswitch POST /payments and POST /refunds. Action includes an explicit scope note: the header covers HTTP-transport retry only, NOT application-level replay (for which the fix is a state-machine precondition). P1 findings: * P1.5 — Webhook raw payloads not persisted (blocks dispute forensics) * P1.6 — Disputes / chargebacks silently dropped (new, surfaced during review; dispute.* webhooks fall through the default case) * P1.7 — Subscription money-movement not covered by v1.0.6 hardening * P1.8 — No ledger-health Prometheus metrics P2 findings: * P2.9 — No admin API for manual override * P2.10 — Partial refund latent compromise (amount *int64 always nil) wontfix: * wontfix.11 — Per-seller retry interval (re-evaluate at 10× load) Derived deliverable: v107-plan.md sequences the 6 de-duplicated items (4 P0 + 2 P1) with a dependency graph, two parallel tracks, per-commit effort estimates (D→A→B; E→C→F), release gating and open questions (volume magnitude, Connect backfill %). Info needed from ops (tracked in axis-1 doc, not determinable from code): last manual reconciliation date, whether subscriptions are currently sold, current order/refund volume. Axes 2-5 deferred: README.md marks axis 2 (state machines) as gated on v1.0.7 landing first, otherwise the transition matrix captures a v1.0.6.1 snapshot that's immediately stale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 03:21:33 +02:00 · 2026-04-17 03:21:33 +02:00 · 6b345ede9f
commit 6b345ede9f
parent 5e3964b989
3 changed files with 811 additions and 0 deletions
--- a/docs/audit-2026-04/README.md
+++ b/docs/audit-2026-04/README.md
@ -0,0 +1,52 @@
+# VEZA Audit — 2026-04
+
+> **Scope** — VEZA backend (Go) + web (TypeScript).
+> TALAS software (firmware, PCB reverse-engineering pipeline) is **out of scope**
+> and will be audited separately when its phase stabilises.
+>
+> **Source state** — commits up to `a57bb6f78` (v1.0.6.1, 2026-04-17).
+>
+> **Auditor** — Claude Opus 4.7 (1M context).
+
+## Axes
+
+| # | File | Status |
+|---|---|---|
+| 1 | [`axis-1-correctness.md`](./axis-1-correctness.md) — correctness / accounting | ✅ delivered |
+| 2 | `axis-2-state-machines.md` — transition matrix + illegal-transition tests | 🔲 pending v1.0.7 |
+| 3 | `axis-3-security.md` — attack surface (signatures, rate limits, authz, secrets) | 🔲 pending |
+| 4 | `axis-4-tests.md` — coverage vs reality, failure-injection gap | 🔲 pending |
+| 5 | `axis-5-debt.md` — documented debt vs hidden debt (TODO/FIXME inventory) | 🔲 pending |
+
+Axis 2 is gated on v1.0.7 landing first — otherwise the transition matrix
+captures a v1.0.6.1 snapshot that's immediately stale. See
+[`v107-plan.md`](./v107-plan.md) for the sequencing.
+
+## Reading conventions
+
+Every finding cites `file:line` evidence. Structure:
+
+```
+### P{0|1|2}.N — short title
+**Evidence** — concrete cites
+**Consequence** — what breaks today / tomorrow
+**Action** — what to do, with enough detail that an implementer can start
+**Criticity** — P0 / P1 / P2 / wontfix (with justification)
+```
+
+**P0** = fix within v1.0.7 or earlier (ledger diverges today, or a v1.0.7
+commitment is structurally blocked).
+**P1** = v1.0.7 target. Operational visibility / correctness hardening.
+**P2** = v1.0.8+. Nice-to-have.
+**wontfix** = justified non-action.
+
+## Info needed from ops (not determinable from code)
+
+Tracked in [`axis-1-correctness.md`](./axis-1-correctness.md#info-needed-from-ops).
+Absence of answers becomes a finding in its own right.
+
+## Derived deliverables
+
+- [`v107-plan.md`](./v107-plan.md) — sequencing, dependencies and relative
+  effort for the axis-1 P0 findings + the CHANGELOG-parked v1.0.7 items.
+  Read this before picking up v1.0.7 work.
--- a/docs/audit-2026-04/axis-1-correctness.md
+++ b/docs/audit-2026-04/axis-1-correctness.md
@ -0,0 +1,529 @@
+# Axis 1 — Correctness / accounting
+
+> See [`README.md`](./README.md) for scope, auditor, source state and
+> reading conventions.
+
+## Surface summary
+
+| PSP | Package | Endpoints called | Env gate |
+|---|---|---|---|
+| Hyperswitch (payments + refunds) | `internal/services/hyperswitch/` | `POST /payments`, `GET /payments/:id`, `POST /refunds` | `HYPERSWITCH_ENABLED` |
+| Stripe Connect (seller payouts + transfers) | `internal/services/stripe_connect_service.go` | `POST /v1/account_links`, `GET /v1/accounts/:id`, `GET /v1/balance`, `POST /v1/transfers` | `STRIPE_CONNECT_ENABLED` |
+
+| Table | Status column | Terminal states | Non-terminal |
+|---|---|---|---|
+| `orders` | `status` | completed, failed, cancelled, refunded | pending, refund_pending |
+| `refunds` (v1.0.6) | `status` | succeeded, failed | pending |
+| `seller_transfers` | `status` | completed, permanently_failed, reversed | pending, failed |
+| `seller_payouts` | `status` | completed, failed | pending, processing |
+| `licenses` | `revoked_at` (nullable) | NULL (active) / timestamp (revoked) | — |
+| `user_subscriptions` | `status` (enum) | canceled, expired | active, trialing, past_due |
+
+Webhook entry: `POST /api/v1/webhooks/hyperswitch`
+(`routes_webhooks.go:55`), HMAC-SHA512 verified, dispatches to
+`ProcessPaymentWebhook` or `ProcessRefundWebhook` via
+`HyperswitchWebhookPayload.IsRefundEvent()` (`service.go:700`).
+
+Existing reconciliation: **only** the `TransferRetryWorker` for
+`seller_transfers.status='failed'` (`transfer_retry.go:57`), interval
+`TRANSFER_RETRY_INTERVAL` default `5m` (`config.go:399`).
+
+---
+
+## P0.1 — `SellerTransfer.StripeTransferID` declared but never populated
+
+**Evidence**
+- Column declared: `internal/core/marketplace/models.go:237`
+  `StripeTransferID string \`gorm:"size:255" json:"stripe_transfer_id,omitempty"\``
+- Zero write-sites (`grep 'StripeTransferID\s*='` returns 0 hits in production code).
+- Upstream returns the transfer but we discard it:
+  `internal/services/stripe_connect_service.go:205`
+  `_, err := transfer.New(params)` — the `*stripe.Transfer` pointer carrying
+  `.ID` is underscored away.
+- `CreateTransfer` signature returns only `error`
+  (`internal/services/stripe_connect_service.go:179`), so callers physically
+  cannot persist the id even if they wanted to.
+
+**Consequence**
+Direct blocker for the v1.0.6 CHANGELOG-documented "parked v1.0.7 —
+Stripe Connect Transfers:reversal". When anyone starts that work, there
+is no `stripe_transfer_id` in the DB to pass to
+`transfers.CreateReversal(...)`. The only fallbacks would be:
+- Query Stripe with `transfers.List(Destination=account_id, Metadata[order_id]=...)` — fragile, rate-limited, no unique guarantee.
+- Do a full upstream refactor of the transfer worker — larger scope than the TODO implies.
+
+This is exactly the "temporary compromise frozen into permanent" pattern
+the audit charter targets.
+
+**Action (prerequisite for any v1.0.7 reversal work)**
+1. Change `TransferService.CreateTransfer` signature:
+   `(ctx, sellerID, amount, currency, orderID string) (stripeTransferID string, err error)`.
+2. Update `processSellerTransfers` (`service.go:762`) to capture the id
+   and set `st.StripeTransferID = stripeTransferID` before `tx.Create(&st)`.
+3. Update `TransferRetryWorker.retryOne` (`transfer_retry.go:116`) to
+   persist the id on successful retry too.
+
+**Criticity** — P0. Must land before the reversal commit, not with it.
+
+---
+
+## P0.2 — No Stripe Connect reversal on refund.succeeded
+
+**Evidence**
+- `internal/core/marketplace/service.go:1529` `reverseSellerAccounting`
+  does three things on refund webhook succeeded:
+  1. `DebitSellerBalance` (in-DB balance row decrement)
+  2. `seller_transfers.status = "reversed"` (in-DB ledger flag)
+  3. **Nothing else.**
+- The CHANGELOG v1.0.6 entry explicitly flags this: "Stripe Connect
+  Transfers:reversal call flagged TODO(v1.0.7). [...] the missing piece
+  is the money-movement round-trip at Stripe."
+
+**Consequence**
+For every refunded order, the buyer is credited by Hyperswitch (verified
+via webhook) and the VEZA internal seller balance is debited correctly.
+But the Stripe-Connect-side transfer that moved the seller's share into
+their connected account is **not** reversed. The seller's Stripe account
+still holds that money. VEZA's books say "debited", Stripe says "paid
+out". On every refund this creates a real reconciliation gap that a
+seller's monthly Stripe statement will show.
+
+This isn't a ledger-consistency issue in VEZA's own DB — it's a
+VEZA-vs-Stripe divergence. It grows linearly with refund volume.
+
+**Action — decoupled from the buyer-refund critical path**
+The intuitive fix ("call `transfers.NewReversal` inside
+`reverseSellerAccounting` and return on error to let the webhook retry")
+is wrong: Hyperswitch has already credited the buyer, the buyer is
+watching their order page, and a Stripe flap would leave the buyer in
+`refund_pending` until Stripe comes back. That couples buyer UX to the
+weakest dependency in the whole flow.
+
+Correct pattern — split the reversal into two concerns, with an
+intermediate state that says "VEZA books updated, PSP reversal still
+owed":
+
+1. **Migration** — extend `seller_transfers.status` to include a new
+   value `reversal_pending`. Update the state-value documentation
+   comment in `internal/core/marketplace/models.go:242`.
+2. **Refund finalization** (`service.go:1466 finalizeSuccessfulRefund`)
+   — no change to its flow except: instead of transitioning
+   `seller_transfers.status` directly to `reversed` in
+   `reverseSellerAccounting`, transition to `reversal_pending` and set
+   `next_retry_at = NOW()`. The buyer-facing flow finalizes immediately:
+   order=refunded, licenses revoked, webhook returns 200.
+3. **New worker** `StripeReversalWorker` — scans
+   `seller_transfers WHERE status='reversal_pending' AND (next_retry_at IS NULL OR next_retry_at <= NOW())`.
+   For each row:
+     - if `StripeTransferID == ""` (see P0.1), skip + log ERROR "cannot
+       reverse — transfer id missing", alert ops.
+     - else call
+       `transfer.NewReversal(stripeTransferID, &stripe.TransferReversalParams{Amount: amountCents})`.
+     - on success: `status='reversed'`, persist the reversal id on a new
+       `stripe_reversal_id` column.
+     - on Stripe 404 (transfer not found): assume already reversed
+       out-of-band, set `status='reversed'`, log INFO.
+     - on any other error: exponential backoff on `next_retry_at`,
+       increment `retry_count`. Hit the same `permanently_failed` ceiling
+       the TransferRetryWorker uses today.
+
+This gives us three properties:
+- Buyer refund never blocks on Stripe health.
+- VEZA accounting is consistent with Hyperswitch refund state at all times.
+- The VEZA↔Stripe gap is bounded (every reversal either completes or
+  ends in `permanently_failed` with a clear ops action).
+
+**Criticity** — P0. Pre-public-opening of marketplace.
+
+---
+
+## P0.3 — No reconciliation sweep for stuck pending states
+
+**Evidence**
+- Inventory search returned zero scheduled jobs that compare DB state
+  to PSP state for payments or refunds.
+- `TransferRetryWorker` exists (`transfer_retry.go`) but only reconciles
+  `seller_transfers` — not orders, not refunds.
+- Agent inventory: "No startup sanity checks comparing DB to PSP for
+  orders/refunds. No admin endpoints for manual order/refund/transfer
+  verification. No sweeps for stuck orders in `refund_pending` or orders
+  exceeding 14-day deadline."
+
+**Consequence**
+If a Hyperswitch webhook is never delivered — they've been down, our
+`/webhooks/hyperswitch` was down during their retry window, or their
+delivery system has a bug — then:
+- An order stuck in `pending` stays forever. Customer paid, got no
+  license. Only surfaces via customer complaint.
+- A refund stuck in `refund_pending` stays forever. Customer asked for
+  refund, got nothing. Order stuck in `refund_pending` also blocks
+  re-attempts (`ErrRefundAlreadyRequested` at `service.go:1303`).
+- A refund row with empty `hyperswitch_refund_id` (Phase 1 completed,
+  Phase 2 crashed between PSP call and DB UPDATE) has no way to
+  correlate to the PSP. Nothing queries them.
+
+**Action**
+Hourly worker (same pattern as `CleanupOrphanTracks` from v1.0.5 Fix 6):
+
+```go
+// internal/jobs/reconcile_hyperswitch.go
+func ReconcilePendingOrders(ctx context.Context, db *gorm.DB, hs *hyperswitch.Client) error {
+    // Orders in 'pending' older than 30m: GET /payments/:id, sync
+    // Refunds in 'pending' > 30m with non-empty refund_id: GET /refunds/:id, sync
+    // Refunds in 'pending' > 5m with EMPTY refund_id: mark failed, roll back order
+}
+```
+
+Wire in `cmd/api/main.go` next to `ScheduleOrphanTracksCleanup`.
+
+**Criticity** — P0. The gap is open today on every deployed instance.
+
+---
+
+## P0.4 — `CreatePayment` / `CreateRefund` not idempotent at the HTTP layer
+
+**Evidence**
+- `internal/services/hyperswitch/client.go:73-78` and
+  `internal/services/hyperswitch/client.go:170-175` — only
+  `Content-Type` and `api-key` headers are set. No `Idempotency-Key`.
+- Hyperswitch's API supports an `Idempotency-Key` header (standard PSP
+  convention).
+
+**Consequence**
+If the HTTP call from VEZA to Hyperswitch retries at the transport
+layer (curl-level / net/http Transport-level retry, proxy retry, TLS
+reconnect after a flap) during a single logical call, a second payment
+or refund object can be created for the same intent. Customer
+double-charged, or two refunds issued. The order row's
+`hyperswitch_payment_id` column is set to the *first* payment_id we
+hear back about, so the second shadow payment is orphaned in Hyperswitch
+with no VEZA row pointing at it.
+
+**Scope note — what Idempotency-Key does and does NOT fix**
+This header protects against **HTTP-transport retry** only. It does
+**not** fix, and should not be conflated with:
+- **Application-level replay** — buyer clicks "Pay" twice, frontend
+  double-submits a form, VEZA application code retries after a DB
+  transaction aborts. These paths create distinct `order.id` values, so
+  each call sees a different natural key and the PSP legitimately
+  creates different payments. The fix there is a state-machine
+  precondition (order dedup by `(buyer_id, cart_hash)` within a window,
+  frontend debounce, CSRF single-use token), not the header.
+- **Crash-between-call-and-save** — VEZA POST succeeds, Hyperswitch
+  returns payment_id, VEZA crashes before persisting it. A subsequent
+  application retry with the same `order.id` *does* recover thanks to
+  the Idempotency-Key (Hyperswitch returns the cached response), but
+  this only works if the application retry re-uses the same order row.
+  That requires the order row to exist in a `pending_payment_submitted`
+  intermediate state before the PSP call — which the current code
+  doesn't have. Documenting as a follow-up.
+
+**Action**
+Add `Idempotency-Key: <natural key>` header on every mutating call:
+- `CreatePayment` → `order_id` (UUID, guaranteed unique per order row).
+- `CreateRefund` → `refund.ID` (UUID from the pending `refunds` row —
+  populated before the PSP call in Phase 1, so the same key is used
+  across any HTTP-level retry within that call).
+
+The `http.Client` in `hyperswitch.Client` should NOT set this
+implicitly; callers must pass the key explicitly so the semantic is
+auditable at each call site. Two changes, each ~3 lines.
+
+The state-machine-level fix for application retry is out of scope here;
+it goes on the v1.0.8+ roadmap (documented in `v107-plan.md`).
+
+**Criticity** — P0. Low probability per call but unbounded blast radius
+(customer double-charge).
+
+---
+
+## P1.5 — Webhook raw payloads not persisted
+
+**Evidence**
+- `routes_webhooks.go:59-82` reads body, verifies signature, dispatches.
+- No table or log file captures the raw webhook body, signature header,
+  or processing result for later audit.
+
+**Consequence**
+When a customer disputes ("I never got refunded"), we cannot reproduce
+what Hyperswitch told us. The only evidence is in the backend `log` at
+whatever retention the host keeps — and structured logs don't carry the
+raw body (we explicitly redact some fields for security).
+
+**Action**
+New table `hyperswitch_webhook_log`:
+
+```sql
+CREATE TABLE hyperswitch_webhook_log (
+    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
+    received_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+    signature_header TEXT NOT NULL,
+    signature_valid BOOLEAN NOT NULL,
+    event_type TEXT,
+    payment_id TEXT,
+    refund_id TEXT,
+    payload BYTEA NOT NULL,
+    processing_result TEXT NOT NULL,  -- 'ok', 'error: <msg>', 'skipped'
+    processed_at TIMESTAMPTZ
+);
+CREATE INDEX ON hyperswitch_webhook_log (payment_id) WHERE payment_id IS NOT NULL;
+CREATE INDEX ON hyperswitch_webhook_log (refund_id) WHERE refund_id IS NOT NULL;
+```
+
+Insert in `routes_webhooks.go:hyperswitchWebhookHandler` **before**
+signature verification — so we capture attack attempts too. Retention
+90 days via a separate sweep job.
+
+**Criticity** — P1. Not ledger-diverging, but the first ops question
+on any refund/dispute claim is "what did the webhook say", and today we
+can't answer it.
+
+---
+
+## P1.6 — Disputes / chargebacks silently dropped
+
+**Evidence**
+- `HyperswitchWebhookPayload.IsRefundEvent()` (`service.go:700`) only
+  distinguishes refund-vs-non-refund. No dispute discriminator.
+- Zero grep hits for `dispute|chargeback|disputed` in
+  `internal/core/marketplace/`, `internal/services/hyperswitch/`, or
+  anywhere in the webhook handler path.
+- Hyperswitch emits `dispute.opened`, `dispute.evidence_required`,
+  `dispute.won`, `dispute.lost`, `dispute.expired` — all land on
+  `POST /webhooks/hyperswitch` and currently fall through
+  `ProcessPaymentWebhook`'s switch to the `default` case
+  (`service.go:718`), which logs Debug "ignoring status" and returns 200.
+
+**Consequence**
+A dispute/chargeback is the case where the ledger diverges most
+violently, because:
+1. It's **not initiated by VEZA** — we learn about it passively when
+   the buyer's bank pushes a reversal through the card network.
+2. On `dispute.lost` (the common case if we don't respond in time),
+   Hyperswitch debits our acquirer balance for the original amount
+   **plus** a chargeback fee (typically 15-25€ depending on acquirer).
+3. The VEZA `orders` row stays `completed`, the license stays active,
+   the seller stays paid. Buyer has their money back via the card
+   network. Every table in VEZA thinks everything is fine. The only
+   signal is the acquirer statement, and it's weeks delayed.
+4. Chargeback rate is a seller health metric: an account with 1% CB
+   rate gets flagged by Visa/MC and the whole VEZA merchant account
+   can be put under review. Without counting disputes per-seller, we
+   have no early-warning system.
+
+**Action (staged)**
+Minimum (ships with P1.5 webhook log): every `dispute.*` event lands in
+`hyperswitch_webhook_log` with the payload captured. Ops can query.
+
+Recommended (v1.0.8 sprint):
+1. New `disputes` table:
+   ```sql
+   CREATE TABLE disputes (
+       id UUID PRIMARY KEY,
+       order_id UUID NOT NULL REFERENCES orders(id),
+       hyperswitch_dispute_id TEXT UNIQUE NOT NULL,
+       status TEXT NOT NULL, -- opened, evidence_required, won, lost, expired
+       amount_cents BIGINT NOT NULL,
+       reason TEXT,
+       evidence_due_at TIMESTAMPTZ,
+       opened_at TIMESTAMPTZ NOT NULL,
+       resolved_at TIMESTAMPTZ,
+       created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+       updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
+   );
+   ```
+2. Extend `IsRefundEvent()` discriminator to a three-way
+   `EventKind()` that returns `payment | refund | dispute`.
+3. New `ProcessDisputeWebhook` — mirrors `ProcessRefundWebhook`'s
+   idempotency pattern (unique `hyperswitch_dispute_id` + row lock).
+4. On `dispute.lost`: same seller-side reversal machinery as refund
+   (P0.2), because the money effect is identical.
+5. Per-seller chargeback counter on `seller_balances` or a new
+   `seller_risk_metrics` table, exposed to admin UI.
+
+**Criticity** — P1 minimum (capture + log). P0 if volume justifies it
+once public-opened. At current volume (pre-opening), accepting the gap
+is defensible; post-opening, a single dispute we miss is worse than
+the engineering cost of handling it.
+
+---
+
+## P1.7 — Subscription money-movement not covered by v1.0.6 hardening
+
+**Evidence**
+- Subscription model has Hyperswitch keys:
+  `subscription/models.go:95` `HyperswitchSubscriptionID`, line 96
+  `HyperswitchCustomerID`, line 136 `HyperswitchPaymentID`.
+- Status enum: `active, trialing, canceled, past_due, expired`
+  (`subscription/models.go:20-28`).
+- `Service.Subscribe` (`subscription/service.go:149`) and
+  `Service.CancelSubscription` (line 390) exist.
+- **Zero refund code in the subscription package.** Cancellation marks
+  the row `canceled` but does not initiate any PSP reimbursement.
+- No visible webhook handler branch for `subscription.*` events — the
+  dispatcher only knows payment vs refund.
+- No code path was found that transitions a subscription INTO
+  `past_due` (which implies it's driven by an external event the
+  webhook isn't consuming, or it's dead state).
+
+**Consequence**
+The marketplace refund hardening only covers one-shot purchases. A
+subscription cancellation inside the paid period currently doesn't
+refund anything, and a failed recurring charge doesn't move the
+subscription to `past_due` because nothing wires it up.
+
+**Action**
+Dedicated sprint for subscription state-machine hardening. Scope for
+v1.0.8. Breakdown:
+1. Inventory every state in `user_subscriptions.status` with evidence
+   of which code path lands there.
+2. Handle `subscription.payment_succeeded`, `subscription.payment_failed`,
+   `subscription.canceled` webhook events in a
+   `ProcessSubscriptionWebhook`.
+3. Decide product policy on mid-period cancellation (pro-rata refund?
+   honour until end of period? forfeit?).
+
+**Criticity** — P1 for the hardening, P0 only if subscriptions are
+currently sold. Status: needs confirmation from ops — not determinable
+from code alone. See [Info needed from ops](#info-needed-from-ops) Q2.
+
+---
+
+## P1.8 — No ledger-health metrics
+
+**Evidence**
+No grep hit for `prometheus.*counter.*order|refund|transfer|payout` in
+`internal/metrics/`. Agent confirms no dashboards.
+
+**Consequence**
+Ledger state is invisible until someone complains. Specifically:
+- Count of orders in `pending` older than 30 min (webhook never arrived)
+- Count of orders in `refund_pending` older than 30 min
+- Count of seller_transfers in `failed` at max `retry_count`
+- Count of refunds with empty `hyperswitch_refund_id` older than 5 min
+- Count of seller_transfers in `reversal_pending` older than 30 min (new state from P0.2)
+- Oldest timestamp in each of the above
+
+**Action**
+Prometheus gauge metrics updated by a 60s sampler (cheap query each):
+```
+ledger_stuck_orders_pending_total{age_bucket="30m_1h|1h_1d|1d+"}
+ledger_stuck_refunds_pending_total{age_bucket=...}
+ledger_failed_transfers_at_max_retry_total
+ledger_orphan_refund_rows_total        // empty hyperswitch_refund_id
+ledger_reversal_pending_transfers_total
+```
+
+Grafana panel with 5 number panels + 5 time-series. Alert rules:
+- `ledger_stuck_orders_pending_total > 0 for 10m` → page
+- `ledger_orphan_refund_rows_total > 0 for 5m` → page (= bug in the
+  two-phase commit between DB and PSP)
+
+**Criticity** — P1. Ships with P0.3 (the reconciliation sweep) so the
+metrics track the sweep's effectiveness.
+
+---
+
+## P2.9 — No admin API for manual refund / order override
+
+**Evidence**
+No admin endpoint found that can force an order or refund into a
+specific terminal state. Ops would have to issue raw SQL in prod.
+
+**Consequence**
+When something genuinely gets stuck (Hyperswitch confirms out-of-band
+that a refund succeeded but their webhook is never going to fire for
+some freak reason), there's no sanctioned path to resolve without raw
+DB access.
+
+**Action**
+`POST /api/v1/admin/orders/:id/force-status` +
+`POST /api/v1/admin/refunds/:id/force-status`, gated by
+`AuthMiddleware.RequireAdminRole`, with an audit-log entry. Body:
+`{ new_status: "...", reason: "..." }`. Validates the target state is
+reachable from the current one per the documented state machine
+(depends on Axis 2).
+
+**Criticity** — P2. Not urgent; raw SQL works until it doesn't.
+
+---
+
+## P2.10 — Partial refund latent compromise
+
+**Evidence**
+`client.go:142` `CreateRefundRequest.Amount *int64 // nil = full refund`.
+`service.go:1330` passes `nil` as amount on every call.
+CHANGELOG v1.0.6: "full-only for v1.0.6 — partial refund UX is deferred
+to v1.0.7."
+
+**Consequence**
+The `amount *int64` parameter is a permanent half-wired compromise. When
+partial refunds land, the implementer must remember:
+- Multiple refunds per order must be allowed (change the double-submit
+  guard at `service.go:1303`).
+- `SellerBalance` debit must be proportional to the partial amount, not
+  full transfer amount.
+- The partial UNIQUE index already tolerates the multi-refund case
+  (v1.0.6.1 migration 979) — but only if each refund has its own
+  PSP-assigned refund_id, which Hyperswitch does.
+- Webhook lookup already uses `hyperswitch_refund_id` so no change.
+
+**Action**
+When partial lands: extensive state-table tests (can the same order
+reach (`status=refunded`, amount < total)? should it be a new
+`partially_refunded` status?). Flag this item in the v1.0.7 plan
+explicitly so it's scoped, not drive-by.
+
+**Criticity** — P2. No active bug; future hazard only.
+
+---
+
+## wontfix.11 — `TransferRetryInterval` not per-seller configurable
+
+`config.go:399` default `5m`. A single global interval works fine at
+current traffic. Re-evaluate at 10× load.
+
+---
+
+## Axis 1 summary
+
+| # | Finding | Criticity | Gate |
+|---|---|---|---|
+| 1 | `SellerTransfer.StripeTransferID` never set | P0 | prerequisite for v1.0.7 reversal |
+| 2 | No Stripe Connect reversal on refund.succeeded | P0 | v1.0.7, decoupled via `reversal_pending` state |
+| 3 | No reconciliation sweep for stuck pending states | P0 | v1.0.7 (hourly job) |
+| 4 | No `Idempotency-Key` on Hyperswitch calls | P0 | v1.0.7 (2 header lines) |
+| 5 | Webhook raw payloads not persisted | P1 | v1.0.7 (migration + insert) |
+| 6 | Disputes / chargebacks silently dropped | P1 | v1.0.8 (min: capture in P1.5 log) |
+| 7 | Subscription money-movement surface | P1 | v1.0.8 sprint |
+| 8 | No ledger-health metrics | P1 | ships with #3 |
+| 9 | No admin API for manual override | P2 | v1.0.8+ |
+| 10 | Partial refund latent compromise | P2 | flag in v1.0.7 plan |
+| 11 | Per-seller retry interval | wontfix | — |
+
+**Blockers for "ready to show"** — 1, 2, 3, 4 must land before the
+marketplace is used by anyone external (grants demo, beta, DIY
+community). After those four, VEZA's accounting is self-consistent
+AND consistent with Stripe+Hyperswitch, which is the real bar.
+
+See [`v107-plan.md`](./v107-plan.md) for sequencing and effort.
+
+---
+
+## Info needed from ops
+
+Not determinable from code. Answers become part of the audit:
+
+1. **Last manual reconciliation between VEZA DB and Stripe/Hyperswitch
+   dashboards.** Date + scope + outcome.
+   _If the answer is "jamais" — that's itself a P1 process finding:
+   no reconciliation cadence established. Absence of documented debt
+   = hidden debt._
+2. **Are subscriptions currently sold** (i.e., is the paid-subscription
+   feature turned on for end-users), or is the code live but the UI
+   hides the entry point? If yes, P1.7 → P0. If hidden-by-UI,
+   additional Axis-3 check needed: is the API reachable despite the
+   UI hiding? Front-flag-only gating is a security finding.
+3. **Current volume order of magnitude** — number of orders/refunds
+   processed per week — so criticity can be re-checked against blast
+   radius.
--- a/docs/audit-2026-04/v107-plan.md
+++ b/docs/audit-2026-04/v107-plan.md
@ -0,0 +1,230 @@
+# v1.0.7 — plan structuré
+
+> Derived from [`axis-1-correctness.md`](./axis-1-correctness.md).
+> The v1.0.6 CHANGELOG listed 4 "parked v1.0.7" items; the axis-1 audit
+> added 4 P0 findings. De-duplicated and sequenced below.
+
+## The 8 items, de-duplicated → 6
+
+| # | From | Title | Effort |
+|---|---|---|---|
+| A | audit P0.1 | Persist `stripe_transfer_id` in `seller_transfers` | **S** |
+| B | audit P0.2 ≡ CHANGELOG "Stripe Connect reversal" | Connect reversal via `reversal_pending` state + async worker | **M** |
+| C | audit P0.3 | Reconciliation sweep for stuck orders / refunds | **M** |
+| D | audit P0.4 | `Idempotency-Key` on `CreatePayment` / `CreateRefund` | **XS** |
+| E | audit P1.5 | Webhook raw-payload log table + insert | **S** |
+| F | audit P1.8 | Ledger-health Prometheus metrics + alerts | **S** |
+
+Dropped from v1.0.7 scope:
+- Partial refunds (CHANGELOG-parked) — P2 in audit, feature-class, defer to v1.0.8
+- `CloudUploadModal` single-source-of-truth (CHANGELOG-parked) — P2, out of money-movement scope
+- Sandbox smoke-test documentation — landed de facto in v1.0.6.1 via the partial-UNIQUE hotfix + the smoke harness artefacts
+
+Effort legend — **XS** ≤ 2h, **S** ≤ 1 day, **M** ≤ 3 days, **L** > 3 days.
+
+## Dependency graph
+
+```
+      ┌──────────────────┐
+      │  D  Idempotency  │  independent, can land first as quick win
+      └──────────────────┘
+
+      ┌──────────────────┐
+      │  A  transfer_id  │  prerequisite for B
+      └────────┬─────────┘
+               │
+               ▼
+      ┌──────────────────┐
+      │  B  reversal     │  uses transfer_id persisted by A;
+      │     worker       │  introduces `reversal_pending` status
+      └──────────────────┘
+
+      ┌──────────────────┐
+      │  E  webhook log  │  independent; prerequisite for P1.6 (disputes)
+      └──────────────────┘
+
+      ┌──────────────────┐
+      │  C  reconciler   │  independent, but:
+      │     sweep        │  metrics (F) track its effectiveness
+      └────────┬─────────┘
+               │
+               ▼
+      ┌──────────────────┐
+      │  F  metrics      │  needs the state-shape C produces; alerts wire
+      │     + alerts     │  to the buckets the reconciler defines
+      └──────────────────┘
+```
+
+**Two parallel tracks:**
+- **Track 1 (reversal correctness)** — A → B
+- **Track 2 (operational visibility)** — D, E, C → F
+
+Two developers can work in parallel without stepping on each other. A
+single developer sequences as ordered: D first (XS quick win, earns
+trust + unblocks "pre-open" checklist), then A→B, then E, then C→F.
+
+## Commit sequence (single-developer path)
+
+Each item lands as its own commit in the existing v1.0.6-style cadence
+(per-commit tests + CHANGELOG-worthy).
+
+### 1. `fix(hyperswitch): idempotency-key on create-payment and create-refund` — D
+
+Effort: **XS**. Pure header addition. Tests: the 15-case refund suite
+already exists; add 2 cases verifying the header is set correctly
+(httptest.Server assertion on `r.Header.Get("Idempotency-Key")`).
+
+Acceptance:
+- Every outbound `POST /payments` carries `Idempotency-Key: <order.id>`.
+- Every outbound `POST /refunds` carries `Idempotency-Key: <refund.id>`.
+- No implicit-via-ctx magic: each call site sets the header explicitly,
+  greppable.
+- CHANGELOG entry cross-references P0.4 + its scope note (HTTP retry
+  only, not app-level replay).
+
+**Ship as v1.0.7-alpha-1 for sandbox testing**, don't wait for B/C.
+
+### 2. `refactor(connect): persist stripe_transfer_id on create + retry` — A
+
+Effort: **S**. Touches the `TransferService` interface (minor breaking
+change — but only internal callers). Migration:
+`980_seller_transfers_stripe_reversal_id.sql` adds `stripe_reversal_id`
+nullable column (prepares ground for B).
+
+Acceptance:
+- `TransferService.CreateTransfer(...) (string, error)` — returns the
+  Stripe transfer id.
+- `processSellerTransfers` persists `st.StripeTransferID` before
+  `tx.Create(&st)`.
+- `TransferRetryWorker.retryOne` also persists on retry success.
+- Backfill: one-shot migration query that fills known transfer_ids
+  for past orders by calling Stripe's
+  `transfers.List(Destination=..., Metadata[order_id]=...)`. Acceptable
+  to leave `NULL` where Stripe has no match (document: "pre-v1.0.7
+  transfers cannot be reversed automatically; use admin API P2.9").
+
+### 3. `feat(marketplace): async stripe connect reversal worker` — B
+
+Effort: **M**. The big one. Introduces `seller_transfers.status =
+'reversal_pending'` as a new intermediate terminal-avoidance state.
+Migration: `981_seller_transfers_reversal_pending_enum.sql`.
+
+Acceptance:
+- `reverseSellerAccounting` transitions `seller_transfers.status` to
+  `reversal_pending`, sets `next_retry_at = NOW()`. Refund
+  finalization completes end-to-end — buyer never sees a
+  Stripe-health-dependent refund UX.
+- New `StripeReversalWorker` in `internal/workers/stripe_reversal.go`
+  runs every `TRANSFER_RETRY_INTERVAL` (reuse existing env var),
+  processes `reversal_pending` rows. Calls
+  `transfer.NewReversal(stripeTransferID, &params)`.
+- On success → `status='reversed'`, persist `stripe_reversal_id`.
+- On Stripe 404 → `status='reversed'` + log INFO (treat as "already
+  reversed out-of-band").
+- On other errors → exponential backoff via `next_retry_at` + `retry_count`,
+  hit `permanently_failed` ceiling after N retries (mirror
+  `TransferRetryWorker`).
+- Tests: mock the stripe SDK via a local httptest.Server (harder than
+  the Hyperswitch mock because stripe-go's HTTP layer is less trivial
+  to re-point, but feasible via `stripe.SetBackend`). 8 cases:
+    - happy path: pending → reversal_pending → reversed + id persisted
+    - Stripe 404: reversal_pending → reversed + log
+    - Stripe 5xx transient: retry_count increments, backoff set
+    - max retries: → permanently_failed
+    - concurrent worker (two instances pick same row): lock wins
+    - StripeTransferID empty (legacy row): skip + ERROR log + counter
+    - reversal idempotency (Stripe dedupes same transferID): no-op
+    - worker graceful shutdown mid-reversal (context cancellation)
+
+### 4. `feat(webhooks): persist raw hyperswitch payloads to audit log` — E
+
+Effort: **S**. Migration: `982_hyperswitch_webhook_log.sql`. Insert
+before signature verification (so we capture attack attempts too).
+
+Acceptance:
+- Every webhook landing on `/webhooks/hyperswitch` produces exactly one
+  row, regardless of signature-valid or processing outcome.
+- `processing_result` field captures `'ok'`, `'error: <msg>'`, or
+  `'skipped'`.
+- Retention: a `CleanupWebhookLog` worker in the same
+  `internal/jobs/` package as the orphan-tracks cleaner, daily, deletes
+  rows older than 90 days.
+- Tests: 3 cases (valid signature + processing ok; invalid signature;
+  processing error).
+
+### 5. `feat(workers): hyperswitch reconciliation sweep for stuck pending states` — C
+
+Effort: **M**. New `ReconcileHyperswitchWorker` in `internal/jobs/`.
+Hourly by default (`RECONCILE_INTERVAL=1h`), but exposed so ops can
+drop to 5m during incident response.
+
+Acceptance:
+- Orders in `pending` > 30min: `GET /payments/:id`, call the same
+  `ProcessPaymentWebhook` internal dispatcher with a synthesised
+  payload. Idempotent with real webhooks via the existing terminal-state
+  guard.
+- Refunds in `pending` with non-empty `hyperswitch_refund_id` > 30min:
+  `GET /refunds/:id`, same pattern with `ProcessRefundWebhook`.
+- Refunds in `pending` with EMPTY `hyperswitch_refund_id` > 5min:
+  mark `failed`, roll order back to `completed`, log ERROR
+  (operator attention needed — something crashed between Phase 1 and
+  Phase 2 of `RefundOrder`).
+- Tests: happy sync, no-op when everything is terminal, the
+  empty-refund_id auto-fail case.
+- Structured log on every action taken so `grep reconcile` tells the
+  ops story.
+
+### 6. `feat(metrics): ledger-health gauges + alert rules` — F
+
+Effort: **S**. New file `internal/metrics/ledger_health.go` with a
+60s sampler. Grafana dashboard JSON in `config/grafana/ledger.json`.
+Prometheus alert rules in `config/alertmanager/ledger.yml`.
+
+Acceptance:
+- 5 gauge metrics (listed in P1.8 action).
+- 2 alert rules (`stuck_orders > 0 for 10m`, `orphan_refunds > 0 for 5m`).
+- Sampler queries are cheap: `SELECT COUNT(*) WHERE ... AND created_at < NOW() - INTERVAL '30 min'` per metric, indexed by `status + created_at`.
+- If indexes are missing: migration `983_ledger_health_indexes.sql`.
+
+## Release gating
+
+**Before tagging v1.0.7:**
+- All six items landed as separate commits.
+- Refund smoke test (see `docs/audit-2026-04/smoke/refund.md` — TBD)
+  re-run against sandbox with the reversal path: assert Stripe side
+  shows the reversal as well as the refund.
+- Manual reconciliation pass: query the live DB against Hyperswitch
+  dashboard for all `completed` orders in the past week → zero drift.
+  Document the script used (`scripts/reconcile-check.sh`), keep for
+  periodic re-use.
+- CHANGELOG entry lists each commit + criticity it resolves.
+
+**Cut v1.0.7 as a minor release** (not patch), because the interface
+change in A is technically breaking for any external caller of
+`TransferService` (there are none today, but naming it 1.0.7 signals
+the API change).
+
+## Effort total
+
+Worst-case single-developer sequential: **~9-10 working days**
+(XS + S + M + S + M + S = 2h + 1d + 3d + 1d + 3d + 1d).
+
+With two devs on the parallel tracks: **~5 working days** end-to-end,
+because track 1 (A→B, 4d) and track 2 (D, E, C→F done serially = 5d)
+overlap.
+
+## Unknowns to resolve before starting
+
+1. **Volume order of magnitude** — informs whether the reconciliation
+   sweep's default interval of 1h is appropriate. If we're processing
+   > 1000 orders/day, drop to 15m. Ops confirmation needed.
+2. **Stripe Connect account state at time of v1.0.7 deploy** — any
+   pre-v1.0.7 transfers lack `stripe_transfer_id` in DB. The backfill
+   migration in A needs to know how many rows to probe and how many
+   will successfully match. Acceptable ceiling: if > 5% of historical
+   transfers fail to backfill, escalate to manual reconciliation
+   session.
+3. **Hyperswitch sandbox vs prod webhook secret rotation** — item D
+   changes outbound request headers only, no impact. Items B+C call
+   Hyperswitch read endpoints, which use the same `HYPERSWITCH_API_KEY`.
+   No secret-rotation concern.