From 6b345ede9fc8a249a29a6c8bbcabcb199f946bb5 Mon Sep 17 00:00:00 2001 From: senke Date: Fri, 17 Apr 2026 03:21:33 +0200 Subject: [PATCH] docs(audit): 2026-04 correctness/accounting findings (axis 1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Axis 1 of the 5-axis VEZA audit, scoped to money-movement correctness and ledger↔PSP reconciliation. Layout: one file per axis under docs/audit-2026-04/, README index, v107-plan.md derived. P0 findings (block v1.0.7 "ready-to-show" gate): * P0.1 — SellerTransfer.StripeTransferID declared but never populated. stripe_connect_service.CreateTransfer discards the *stripe.Transfer return value (`_, err := transfer.New(params)`), so the column in models.go:237 is dead. Structural blocker for the CHANGELOG-parked v1.0.7 "Stripe Connect reversal" item. * P0.2 — No Stripe Connect reversal on refund.succeeded. Every refund today creates a permanent VEZA↔Stripe ledger gap. Action reworked to decouple via a new `seller_transfers.status = 'reversal_pending'` state + async worker, so Stripe flaps never block buyer-facing refund UX. * P0.3 — No reconciliation sweep for stuck orders / refunds / refund rows with empty hyperswitch_refund_id. Hourly worker recommended, same pattern as v1.0.5 Fix 6 orphan-tracks cleaner. * P0.4 — No Idempotency-Key on outbound Hyperswitch POST /payments and POST /refunds. Action includes an explicit scope note: the header covers HTTP-transport retry only, NOT application-level replay (for which the fix is a state-machine precondition). P1 findings: * P1.5 — Webhook raw payloads not persisted (blocks dispute forensics) * P1.6 — Disputes / chargebacks silently dropped (new, surfaced during review; dispute.* webhooks fall through the default case) * P1.7 — Subscription money-movement not covered by v1.0.6 hardening * P1.8 — No ledger-health Prometheus metrics P2 findings: * P2.9 — No admin API for manual override * P2.10 — Partial refund latent compromise (amount *int64 always nil) wontfix: * wontfix.11 — Per-seller retry interval (re-evaluate at 10× load) Derived deliverable: v107-plan.md sequences the 6 de-duplicated items (4 P0 + 2 P1) with a dependency graph, two parallel tracks, per-commit effort estimates (D→A→B; E→C→F), release gating and open questions (volume magnitude, Connect backfill %). Info needed from ops (tracked in axis-1 doc, not determinable from code): last manual reconciliation date, whether subscriptions are currently sold, current order/refund volume. Axes 2-5 deferred: README.md marks axis 2 (state machines) as gated on v1.0.7 landing first, otherwise the transition matrix captures a v1.0.6.1 snapshot that's immediately stale. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/audit-2026-04/README.md | 52 +++ docs/audit-2026-04/axis-1-correctness.md | 529 +++++++++++++++++++++++ docs/audit-2026-04/v107-plan.md | 230 ++++++++++ 3 files changed, 811 insertions(+) create mode 100644 docs/audit-2026-04/README.md create mode 100644 docs/audit-2026-04/axis-1-correctness.md create mode 100644 docs/audit-2026-04/v107-plan.md diff --git a/docs/audit-2026-04/README.md b/docs/audit-2026-04/README.md new file mode 100644 index 000000000..5f832ee86 --- /dev/null +++ b/docs/audit-2026-04/README.md @@ -0,0 +1,52 @@ +# VEZA Audit — 2026-04 + +> **Scope** — VEZA backend (Go) + web (TypeScript). +> TALAS software (firmware, PCB reverse-engineering pipeline) is **out of scope** +> and will be audited separately when its phase stabilises. +> +> **Source state** — commits up to `a57bb6f78` (v1.0.6.1, 2026-04-17). +> +> **Auditor** — Claude Opus 4.7 (1M context). + +## Axes + +| # | File | Status | +|---|---|---| +| 1 | [`axis-1-correctness.md`](./axis-1-correctness.md) — correctness / accounting | ✅ delivered | +| 2 | `axis-2-state-machines.md` — transition matrix + illegal-transition tests | 🔲 pending v1.0.7 | +| 3 | `axis-3-security.md` — attack surface (signatures, rate limits, authz, secrets) | 🔲 pending | +| 4 | `axis-4-tests.md` — coverage vs reality, failure-injection gap | 🔲 pending | +| 5 | `axis-5-debt.md` — documented debt vs hidden debt (TODO/FIXME inventory) | 🔲 pending | + +Axis 2 is gated on v1.0.7 landing first — otherwise the transition matrix +captures a v1.0.6.1 snapshot that's immediately stale. See +[`v107-plan.md`](./v107-plan.md) for the sequencing. + +## Reading conventions + +Every finding cites `file:line` evidence. Structure: + +``` +### P{0|1|2}.N — short title +**Evidence** — concrete cites +**Consequence** — what breaks today / tomorrow +**Action** — what to do, with enough detail that an implementer can start +**Criticity** — P0 / P1 / P2 / wontfix (with justification) +``` + +**P0** = fix within v1.0.7 or earlier (ledger diverges today, or a v1.0.7 +commitment is structurally blocked). +**P1** = v1.0.7 target. Operational visibility / correctness hardening. +**P2** = v1.0.8+. Nice-to-have. +**wontfix** = justified non-action. + +## Info needed from ops (not determinable from code) + +Tracked in [`axis-1-correctness.md`](./axis-1-correctness.md#info-needed-from-ops). +Absence of answers becomes a finding in its own right. + +## Derived deliverables + +- [`v107-plan.md`](./v107-plan.md) — sequencing, dependencies and relative + effort for the axis-1 P0 findings + the CHANGELOG-parked v1.0.7 items. + Read this before picking up v1.0.7 work. diff --git a/docs/audit-2026-04/axis-1-correctness.md b/docs/audit-2026-04/axis-1-correctness.md new file mode 100644 index 000000000..0d0845674 --- /dev/null +++ b/docs/audit-2026-04/axis-1-correctness.md @@ -0,0 +1,529 @@ +# Axis 1 — Correctness / accounting + +> See [`README.md`](./README.md) for scope, auditor, source state and +> reading conventions. + +## Surface summary + +| PSP | Package | Endpoints called | Env gate | +|---|---|---|---| +| Hyperswitch (payments + refunds) | `internal/services/hyperswitch/` | `POST /payments`, `GET /payments/:id`, `POST /refunds` | `HYPERSWITCH_ENABLED` | +| Stripe Connect (seller payouts + transfers) | `internal/services/stripe_connect_service.go` | `POST /v1/account_links`, `GET /v1/accounts/:id`, `GET /v1/balance`, `POST /v1/transfers` | `STRIPE_CONNECT_ENABLED` | + +| Table | Status column | Terminal states | Non-terminal | +|---|---|---|---| +| `orders` | `status` | completed, failed, cancelled, refunded | pending, refund_pending | +| `refunds` (v1.0.6) | `status` | succeeded, failed | pending | +| `seller_transfers` | `status` | completed, permanently_failed, reversed | pending, failed | +| `seller_payouts` | `status` | completed, failed | pending, processing | +| `licenses` | `revoked_at` (nullable) | NULL (active) / timestamp (revoked) | — | +| `user_subscriptions` | `status` (enum) | canceled, expired | active, trialing, past_due | + +Webhook entry: `POST /api/v1/webhooks/hyperswitch` +(`routes_webhooks.go:55`), HMAC-SHA512 verified, dispatches to +`ProcessPaymentWebhook` or `ProcessRefundWebhook` via +`HyperswitchWebhookPayload.IsRefundEvent()` (`service.go:700`). + +Existing reconciliation: **only** the `TransferRetryWorker` for +`seller_transfers.status='failed'` (`transfer_retry.go:57`), interval +`TRANSFER_RETRY_INTERVAL` default `5m` (`config.go:399`). + +--- + +## P0.1 — `SellerTransfer.StripeTransferID` declared but never populated + +**Evidence** +- Column declared: `internal/core/marketplace/models.go:237` + `StripeTransferID string \`gorm:"size:255" json:"stripe_transfer_id,omitempty"\`` +- Zero write-sites (`grep 'StripeTransferID\s*='` returns 0 hits in production code). +- Upstream returns the transfer but we discard it: + `internal/services/stripe_connect_service.go:205` + `_, err := transfer.New(params)` — the `*stripe.Transfer` pointer carrying + `.ID` is underscored away. +- `CreateTransfer` signature returns only `error` + (`internal/services/stripe_connect_service.go:179`), so callers physically + cannot persist the id even if they wanted to. + +**Consequence** +Direct blocker for the v1.0.6 CHANGELOG-documented "parked v1.0.7 — +Stripe Connect Transfers:reversal". When anyone starts that work, there +is no `stripe_transfer_id` in the DB to pass to +`transfers.CreateReversal(...)`. The only fallbacks would be: +- Query Stripe with `transfers.List(Destination=account_id, Metadata[order_id]=...)` — fragile, rate-limited, no unique guarantee. +- Do a full upstream refactor of the transfer worker — larger scope than the TODO implies. + +This is exactly the "temporary compromise frozen into permanent" pattern +the audit charter targets. + +**Action (prerequisite for any v1.0.7 reversal work)** +1. Change `TransferService.CreateTransfer` signature: + `(ctx, sellerID, amount, currency, orderID string) (stripeTransferID string, err error)`. +2. Update `processSellerTransfers` (`service.go:762`) to capture the id + and set `st.StripeTransferID = stripeTransferID` before `tx.Create(&st)`. +3. Update `TransferRetryWorker.retryOne` (`transfer_retry.go:116`) to + persist the id on successful retry too. + +**Criticity** — P0. Must land before the reversal commit, not with it. + +--- + +## P0.2 — No Stripe Connect reversal on refund.succeeded + +**Evidence** +- `internal/core/marketplace/service.go:1529` `reverseSellerAccounting` + does three things on refund webhook succeeded: + 1. `DebitSellerBalance` (in-DB balance row decrement) + 2. `seller_transfers.status = "reversed"` (in-DB ledger flag) + 3. **Nothing else.** +- The CHANGELOG v1.0.6 entry explicitly flags this: "Stripe Connect + Transfers:reversal call flagged TODO(v1.0.7). [...] the missing piece + is the money-movement round-trip at Stripe." + +**Consequence** +For every refunded order, the buyer is credited by Hyperswitch (verified +via webhook) and the VEZA internal seller balance is debited correctly. +But the Stripe-Connect-side transfer that moved the seller's share into +their connected account is **not** reversed. The seller's Stripe account +still holds that money. VEZA's books say "debited", Stripe says "paid +out". On every refund this creates a real reconciliation gap that a +seller's monthly Stripe statement will show. + +This isn't a ledger-consistency issue in VEZA's own DB — it's a +VEZA-vs-Stripe divergence. It grows linearly with refund volume. + +**Action — decoupled from the buyer-refund critical path** +The intuitive fix ("call `transfers.NewReversal` inside +`reverseSellerAccounting` and return on error to let the webhook retry") +is wrong: Hyperswitch has already credited the buyer, the buyer is +watching their order page, and a Stripe flap would leave the buyer in +`refund_pending` until Stripe comes back. That couples buyer UX to the +weakest dependency in the whole flow. + +Correct pattern — split the reversal into two concerns, with an +intermediate state that says "VEZA books updated, PSP reversal still +owed": + +1. **Migration** — extend `seller_transfers.status` to include a new + value `reversal_pending`. Update the state-value documentation + comment in `internal/core/marketplace/models.go:242`. +2. **Refund finalization** (`service.go:1466 finalizeSuccessfulRefund`) + — no change to its flow except: instead of transitioning + `seller_transfers.status` directly to `reversed` in + `reverseSellerAccounting`, transition to `reversal_pending` and set + `next_retry_at = NOW()`. The buyer-facing flow finalizes immediately: + order=refunded, licenses revoked, webhook returns 200. +3. **New worker** `StripeReversalWorker` — scans + `seller_transfers WHERE status='reversal_pending' AND (next_retry_at IS NULL OR next_retry_at <= NOW())`. + For each row: + - if `StripeTransferID == ""` (see P0.1), skip + log ERROR "cannot + reverse — transfer id missing", alert ops. + - else call + `transfer.NewReversal(stripeTransferID, &stripe.TransferReversalParams{Amount: amountCents})`. + - on success: `status='reversed'`, persist the reversal id on a new + `stripe_reversal_id` column. + - on Stripe 404 (transfer not found): assume already reversed + out-of-band, set `status='reversed'`, log INFO. + - on any other error: exponential backoff on `next_retry_at`, + increment `retry_count`. Hit the same `permanently_failed` ceiling + the TransferRetryWorker uses today. + +This gives us three properties: +- Buyer refund never blocks on Stripe health. +- VEZA accounting is consistent with Hyperswitch refund state at all times. +- The VEZA↔Stripe gap is bounded (every reversal either completes or + ends in `permanently_failed` with a clear ops action). + +**Criticity** — P0. Pre-public-opening of marketplace. + +--- + +## P0.3 — No reconciliation sweep for stuck pending states + +**Evidence** +- Inventory search returned zero scheduled jobs that compare DB state + to PSP state for payments or refunds. +- `TransferRetryWorker` exists (`transfer_retry.go`) but only reconciles + `seller_transfers` — not orders, not refunds. +- Agent inventory: "No startup sanity checks comparing DB to PSP for + orders/refunds. No admin endpoints for manual order/refund/transfer + verification. No sweeps for stuck orders in `refund_pending` or orders + exceeding 14-day deadline." + +**Consequence** +If a Hyperswitch webhook is never delivered — they've been down, our +`/webhooks/hyperswitch` was down during their retry window, or their +delivery system has a bug — then: +- An order stuck in `pending` stays forever. Customer paid, got no + license. Only surfaces via customer complaint. +- A refund stuck in `refund_pending` stays forever. Customer asked for + refund, got nothing. Order stuck in `refund_pending` also blocks + re-attempts (`ErrRefundAlreadyRequested` at `service.go:1303`). +- A refund row with empty `hyperswitch_refund_id` (Phase 1 completed, + Phase 2 crashed between PSP call and DB UPDATE) has no way to + correlate to the PSP. Nothing queries them. + +**Action** +Hourly worker (same pattern as `CleanupOrphanTracks` from v1.0.5 Fix 6): + +```go +// internal/jobs/reconcile_hyperswitch.go +func ReconcilePendingOrders(ctx context.Context, db *gorm.DB, hs *hyperswitch.Client) error { + // Orders in 'pending' older than 30m: GET /payments/:id, sync + // Refunds in 'pending' > 30m with non-empty refund_id: GET /refunds/:id, sync + // Refunds in 'pending' > 5m with EMPTY refund_id: mark failed, roll back order +} +``` + +Wire in `cmd/api/main.go` next to `ScheduleOrphanTracksCleanup`. + +**Criticity** — P0. The gap is open today on every deployed instance. + +--- + +## P0.4 — `CreatePayment` / `CreateRefund` not idempotent at the HTTP layer + +**Evidence** +- `internal/services/hyperswitch/client.go:73-78` and + `internal/services/hyperswitch/client.go:170-175` — only + `Content-Type` and `api-key` headers are set. No `Idempotency-Key`. +- Hyperswitch's API supports an `Idempotency-Key` header (standard PSP + convention). + +**Consequence** +If the HTTP call from VEZA to Hyperswitch retries at the transport +layer (curl-level / net/http Transport-level retry, proxy retry, TLS +reconnect after a flap) during a single logical call, a second payment +or refund object can be created for the same intent. Customer +double-charged, or two refunds issued. The order row's +`hyperswitch_payment_id` column is set to the *first* payment_id we +hear back about, so the second shadow payment is orphaned in Hyperswitch +with no VEZA row pointing at it. + +**Scope note — what Idempotency-Key does and does NOT fix** +This header protects against **HTTP-transport retry** only. It does +**not** fix, and should not be conflated with: +- **Application-level replay** — buyer clicks "Pay" twice, frontend + double-submits a form, VEZA application code retries after a DB + transaction aborts. These paths create distinct `order.id` values, so + each call sees a different natural key and the PSP legitimately + creates different payments. The fix there is a state-machine + precondition (order dedup by `(buyer_id, cart_hash)` within a window, + frontend debounce, CSRF single-use token), not the header. +- **Crash-between-call-and-save** — VEZA POST succeeds, Hyperswitch + returns payment_id, VEZA crashes before persisting it. A subsequent + application retry with the same `order.id` *does* recover thanks to + the Idempotency-Key (Hyperswitch returns the cached response), but + this only works if the application retry re-uses the same order row. + That requires the order row to exist in a `pending_payment_submitted` + intermediate state before the PSP call — which the current code + doesn't have. Documenting as a follow-up. + +**Action** +Add `Idempotency-Key: ` header on every mutating call: +- `CreatePayment` → `order_id` (UUID, guaranteed unique per order row). +- `CreateRefund` → `refund.ID` (UUID from the pending `refunds` row — + populated before the PSP call in Phase 1, so the same key is used + across any HTTP-level retry within that call). + +The `http.Client` in `hyperswitch.Client` should NOT set this +implicitly; callers must pass the key explicitly so the semantic is +auditable at each call site. Two changes, each ~3 lines. + +The state-machine-level fix for application retry is out of scope here; +it goes on the v1.0.8+ roadmap (documented in `v107-plan.md`). + +**Criticity** — P0. Low probability per call but unbounded blast radius +(customer double-charge). + +--- + +## P1.5 — Webhook raw payloads not persisted + +**Evidence** +- `routes_webhooks.go:59-82` reads body, verifies signature, dispatches. +- No table or log file captures the raw webhook body, signature header, + or processing result for later audit. + +**Consequence** +When a customer disputes ("I never got refunded"), we cannot reproduce +what Hyperswitch told us. The only evidence is in the backend `log` at +whatever retention the host keeps — and structured logs don't carry the +raw body (we explicitly redact some fields for security). + +**Action** +New table `hyperswitch_webhook_log`: + +```sql +CREATE TABLE hyperswitch_webhook_log ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + received_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + signature_header TEXT NOT NULL, + signature_valid BOOLEAN NOT NULL, + event_type TEXT, + payment_id TEXT, + refund_id TEXT, + payload BYTEA NOT NULL, + processing_result TEXT NOT NULL, -- 'ok', 'error: ', 'skipped' + processed_at TIMESTAMPTZ +); +CREATE INDEX ON hyperswitch_webhook_log (payment_id) WHERE payment_id IS NOT NULL; +CREATE INDEX ON hyperswitch_webhook_log (refund_id) WHERE refund_id IS NOT NULL; +``` + +Insert in `routes_webhooks.go:hyperswitchWebhookHandler` **before** +signature verification — so we capture attack attempts too. Retention +90 days via a separate sweep job. + +**Criticity** — P1. Not ledger-diverging, but the first ops question +on any refund/dispute claim is "what did the webhook say", and today we +can't answer it. + +--- + +## P1.6 — Disputes / chargebacks silently dropped + +**Evidence** +- `HyperswitchWebhookPayload.IsRefundEvent()` (`service.go:700`) only + distinguishes refund-vs-non-refund. No dispute discriminator. +- Zero grep hits for `dispute|chargeback|disputed` in + `internal/core/marketplace/`, `internal/services/hyperswitch/`, or + anywhere in the webhook handler path. +- Hyperswitch emits `dispute.opened`, `dispute.evidence_required`, + `dispute.won`, `dispute.lost`, `dispute.expired` — all land on + `POST /webhooks/hyperswitch` and currently fall through + `ProcessPaymentWebhook`'s switch to the `default` case + (`service.go:718`), which logs Debug "ignoring status" and returns 200. + +**Consequence** +A dispute/chargeback is the case where the ledger diverges most +violently, because: +1. It's **not initiated by VEZA** — we learn about it passively when + the buyer's bank pushes a reversal through the card network. +2. On `dispute.lost` (the common case if we don't respond in time), + Hyperswitch debits our acquirer balance for the original amount + **plus** a chargeback fee (typically 15-25€ depending on acquirer). +3. The VEZA `orders` row stays `completed`, the license stays active, + the seller stays paid. Buyer has their money back via the card + network. Every table in VEZA thinks everything is fine. The only + signal is the acquirer statement, and it's weeks delayed. +4. Chargeback rate is a seller health metric: an account with 1% CB + rate gets flagged by Visa/MC and the whole VEZA merchant account + can be put under review. Without counting disputes per-seller, we + have no early-warning system. + +**Action (staged)** +Minimum (ships with P1.5 webhook log): every `dispute.*` event lands in +`hyperswitch_webhook_log` with the payload captured. Ops can query. + +Recommended (v1.0.8 sprint): +1. New `disputes` table: + ```sql + CREATE TABLE disputes ( + id UUID PRIMARY KEY, + order_id UUID NOT NULL REFERENCES orders(id), + hyperswitch_dispute_id TEXT UNIQUE NOT NULL, + status TEXT NOT NULL, -- opened, evidence_required, won, lost, expired + amount_cents BIGINT NOT NULL, + reason TEXT, + evidence_due_at TIMESTAMPTZ, + opened_at TIMESTAMPTZ NOT NULL, + resolved_at TIMESTAMPTZ, + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() + ); + ``` +2. Extend `IsRefundEvent()` discriminator to a three-way + `EventKind()` that returns `payment | refund | dispute`. +3. New `ProcessDisputeWebhook` — mirrors `ProcessRefundWebhook`'s + idempotency pattern (unique `hyperswitch_dispute_id` + row lock). +4. On `dispute.lost`: same seller-side reversal machinery as refund + (P0.2), because the money effect is identical. +5. Per-seller chargeback counter on `seller_balances` or a new + `seller_risk_metrics` table, exposed to admin UI. + +**Criticity** — P1 minimum (capture + log). P0 if volume justifies it +once public-opened. At current volume (pre-opening), accepting the gap +is defensible; post-opening, a single dispute we miss is worse than +the engineering cost of handling it. + +--- + +## P1.7 — Subscription money-movement not covered by v1.0.6 hardening + +**Evidence** +- Subscription model has Hyperswitch keys: + `subscription/models.go:95` `HyperswitchSubscriptionID`, line 96 + `HyperswitchCustomerID`, line 136 `HyperswitchPaymentID`. +- Status enum: `active, trialing, canceled, past_due, expired` + (`subscription/models.go:20-28`). +- `Service.Subscribe` (`subscription/service.go:149`) and + `Service.CancelSubscription` (line 390) exist. +- **Zero refund code in the subscription package.** Cancellation marks + the row `canceled` but does not initiate any PSP reimbursement. +- No visible webhook handler branch for `subscription.*` events — the + dispatcher only knows payment vs refund. +- No code path was found that transitions a subscription INTO + `past_due` (which implies it's driven by an external event the + webhook isn't consuming, or it's dead state). + +**Consequence** +The marketplace refund hardening only covers one-shot purchases. A +subscription cancellation inside the paid period currently doesn't +refund anything, and a failed recurring charge doesn't move the +subscription to `past_due` because nothing wires it up. + +**Action** +Dedicated sprint for subscription state-machine hardening. Scope for +v1.0.8. Breakdown: +1. Inventory every state in `user_subscriptions.status` with evidence + of which code path lands there. +2. Handle `subscription.payment_succeeded`, `subscription.payment_failed`, + `subscription.canceled` webhook events in a + `ProcessSubscriptionWebhook`. +3. Decide product policy on mid-period cancellation (pro-rata refund? + honour until end of period? forfeit?). + +**Criticity** — P1 for the hardening, P0 only if subscriptions are +currently sold. Status: needs confirmation from ops — not determinable +from code alone. See [Info needed from ops](#info-needed-from-ops) Q2. + +--- + +## P1.8 — No ledger-health metrics + +**Evidence** +No grep hit for `prometheus.*counter.*order|refund|transfer|payout` in +`internal/metrics/`. Agent confirms no dashboards. + +**Consequence** +Ledger state is invisible until someone complains. Specifically: +- Count of orders in `pending` older than 30 min (webhook never arrived) +- Count of orders in `refund_pending` older than 30 min +- Count of seller_transfers in `failed` at max `retry_count` +- Count of refunds with empty `hyperswitch_refund_id` older than 5 min +- Count of seller_transfers in `reversal_pending` older than 30 min (new state from P0.2) +- Oldest timestamp in each of the above + +**Action** +Prometheus gauge metrics updated by a 60s sampler (cheap query each): +``` +ledger_stuck_orders_pending_total{age_bucket="30m_1h|1h_1d|1d+"} +ledger_stuck_refunds_pending_total{age_bucket=...} +ledger_failed_transfers_at_max_retry_total +ledger_orphan_refund_rows_total // empty hyperswitch_refund_id +ledger_reversal_pending_transfers_total +``` + +Grafana panel with 5 number panels + 5 time-series. Alert rules: +- `ledger_stuck_orders_pending_total > 0 for 10m` → page +- `ledger_orphan_refund_rows_total > 0 for 5m` → page (= bug in the + two-phase commit between DB and PSP) + +**Criticity** — P1. Ships with P0.3 (the reconciliation sweep) so the +metrics track the sweep's effectiveness. + +--- + +## P2.9 — No admin API for manual refund / order override + +**Evidence** +No admin endpoint found that can force an order or refund into a +specific terminal state. Ops would have to issue raw SQL in prod. + +**Consequence** +When something genuinely gets stuck (Hyperswitch confirms out-of-band +that a refund succeeded but their webhook is never going to fire for +some freak reason), there's no sanctioned path to resolve without raw +DB access. + +**Action** +`POST /api/v1/admin/orders/:id/force-status` + +`POST /api/v1/admin/refunds/:id/force-status`, gated by +`AuthMiddleware.RequireAdminRole`, with an audit-log entry. Body: +`{ new_status: "...", reason: "..." }`. Validates the target state is +reachable from the current one per the documented state machine +(depends on Axis 2). + +**Criticity** — P2. Not urgent; raw SQL works until it doesn't. + +--- + +## P2.10 — Partial refund latent compromise + +**Evidence** +`client.go:142` `CreateRefundRequest.Amount *int64 // nil = full refund`. +`service.go:1330` passes `nil` as amount on every call. +CHANGELOG v1.0.6: "full-only for v1.0.6 — partial refund UX is deferred +to v1.0.7." + +**Consequence** +The `amount *int64` parameter is a permanent half-wired compromise. When +partial refunds land, the implementer must remember: +- Multiple refunds per order must be allowed (change the double-submit + guard at `service.go:1303`). +- `SellerBalance` debit must be proportional to the partial amount, not + full transfer amount. +- The partial UNIQUE index already tolerates the multi-refund case + (v1.0.6.1 migration 979) — but only if each refund has its own + PSP-assigned refund_id, which Hyperswitch does. +- Webhook lookup already uses `hyperswitch_refund_id` so no change. + +**Action** +When partial lands: extensive state-table tests (can the same order +reach (`status=refunded`, amount < total)? should it be a new +`partially_refunded` status?). Flag this item in the v1.0.7 plan +explicitly so it's scoped, not drive-by. + +**Criticity** — P2. No active bug; future hazard only. + +--- + +## wontfix.11 — `TransferRetryInterval` not per-seller configurable + +`config.go:399` default `5m`. A single global interval works fine at +current traffic. Re-evaluate at 10× load. + +--- + +## Axis 1 summary + +| # | Finding | Criticity | Gate | +|---|---|---|---| +| 1 | `SellerTransfer.StripeTransferID` never set | P0 | prerequisite for v1.0.7 reversal | +| 2 | No Stripe Connect reversal on refund.succeeded | P0 | v1.0.7, decoupled via `reversal_pending` state | +| 3 | No reconciliation sweep for stuck pending states | P0 | v1.0.7 (hourly job) | +| 4 | No `Idempotency-Key` on Hyperswitch calls | P0 | v1.0.7 (2 header lines) | +| 5 | Webhook raw payloads not persisted | P1 | v1.0.7 (migration + insert) | +| 6 | Disputes / chargebacks silently dropped | P1 | v1.0.8 (min: capture in P1.5 log) | +| 7 | Subscription money-movement surface | P1 | v1.0.8 sprint | +| 8 | No ledger-health metrics | P1 | ships with #3 | +| 9 | No admin API for manual override | P2 | v1.0.8+ | +| 10 | Partial refund latent compromise | P2 | flag in v1.0.7 plan | +| 11 | Per-seller retry interval | wontfix | — | + +**Blockers for "ready to show"** — 1, 2, 3, 4 must land before the +marketplace is used by anyone external (grants demo, beta, DIY +community). After those four, VEZA's accounting is self-consistent +AND consistent with Stripe+Hyperswitch, which is the real bar. + +See [`v107-plan.md`](./v107-plan.md) for sequencing and effort. + +--- + +## Info needed from ops + +Not determinable from code. Answers become part of the audit: + +1. **Last manual reconciliation between VEZA DB and Stripe/Hyperswitch + dashboards.** Date + scope + outcome. + _If the answer is "jamais" — that's itself a P1 process finding: + no reconciliation cadence established. Absence of documented debt + = hidden debt._ +2. **Are subscriptions currently sold** (i.e., is the paid-subscription + feature turned on for end-users), or is the code live but the UI + hides the entry point? If yes, P1.7 → P0. If hidden-by-UI, + additional Axis-3 check needed: is the API reachable despite the + UI hiding? Front-flag-only gating is a security finding. +3. **Current volume order of magnitude** — number of orders/refunds + processed per week — so criticity can be re-checked against blast + radius. diff --git a/docs/audit-2026-04/v107-plan.md b/docs/audit-2026-04/v107-plan.md new file mode 100644 index 000000000..51a86d991 --- /dev/null +++ b/docs/audit-2026-04/v107-plan.md @@ -0,0 +1,230 @@ +# v1.0.7 — plan structuré + +> Derived from [`axis-1-correctness.md`](./axis-1-correctness.md). +> The v1.0.6 CHANGELOG listed 4 "parked v1.0.7" items; the axis-1 audit +> added 4 P0 findings. De-duplicated and sequenced below. + +## The 8 items, de-duplicated → 6 + +| # | From | Title | Effort | +|---|---|---|---| +| A | audit P0.1 | Persist `stripe_transfer_id` in `seller_transfers` | **S** | +| B | audit P0.2 ≡ CHANGELOG "Stripe Connect reversal" | Connect reversal via `reversal_pending` state + async worker | **M** | +| C | audit P0.3 | Reconciliation sweep for stuck orders / refunds | **M** | +| D | audit P0.4 | `Idempotency-Key` on `CreatePayment` / `CreateRefund` | **XS** | +| E | audit P1.5 | Webhook raw-payload log table + insert | **S** | +| F | audit P1.8 | Ledger-health Prometheus metrics + alerts | **S** | + +Dropped from v1.0.7 scope: +- Partial refunds (CHANGELOG-parked) — P2 in audit, feature-class, defer to v1.0.8 +- `CloudUploadModal` single-source-of-truth (CHANGELOG-parked) — P2, out of money-movement scope +- Sandbox smoke-test documentation — landed de facto in v1.0.6.1 via the partial-UNIQUE hotfix + the smoke harness artefacts + +Effort legend — **XS** ≤ 2h, **S** ≤ 1 day, **M** ≤ 3 days, **L** > 3 days. + +## Dependency graph + +``` + ┌──────────────────┐ + │ D Idempotency │ independent, can land first as quick win + └──────────────────┘ + + ┌──────────────────┐ + │ A transfer_id │ prerequisite for B + └────────┬─────────┘ + │ + ▼ + ┌──────────────────┐ + │ B reversal │ uses transfer_id persisted by A; + │ worker │ introduces `reversal_pending` status + └──────────────────┘ + + ┌──────────────────┐ + │ E webhook log │ independent; prerequisite for P1.6 (disputes) + └──────────────────┘ + + ┌──────────────────┐ + │ C reconciler │ independent, but: + │ sweep │ metrics (F) track its effectiveness + └────────┬─────────┘ + │ + ▼ + ┌──────────────────┐ + │ F metrics │ needs the state-shape C produces; alerts wire + │ + alerts │ to the buckets the reconciler defines + └──────────────────┘ +``` + +**Two parallel tracks:** +- **Track 1 (reversal correctness)** — A → B +- **Track 2 (operational visibility)** — D, E, C → F + +Two developers can work in parallel without stepping on each other. A +single developer sequences as ordered: D first (XS quick win, earns +trust + unblocks "pre-open" checklist), then A→B, then E, then C→F. + +## Commit sequence (single-developer path) + +Each item lands as its own commit in the existing v1.0.6-style cadence +(per-commit tests + CHANGELOG-worthy). + +### 1. `fix(hyperswitch): idempotency-key on create-payment and create-refund` — D + +Effort: **XS**. Pure header addition. Tests: the 15-case refund suite +already exists; add 2 cases verifying the header is set correctly +(httptest.Server assertion on `r.Header.Get("Idempotency-Key")`). + +Acceptance: +- Every outbound `POST /payments` carries `Idempotency-Key: `. +- Every outbound `POST /refunds` carries `Idempotency-Key: `. +- No implicit-via-ctx magic: each call site sets the header explicitly, + greppable. +- CHANGELOG entry cross-references P0.4 + its scope note (HTTP retry + only, not app-level replay). + +**Ship as v1.0.7-alpha-1 for sandbox testing**, don't wait for B/C. + +### 2. `refactor(connect): persist stripe_transfer_id on create + retry` — A + +Effort: **S**. Touches the `TransferService` interface (minor breaking +change — but only internal callers). Migration: +`980_seller_transfers_stripe_reversal_id.sql` adds `stripe_reversal_id` +nullable column (prepares ground for B). + +Acceptance: +- `TransferService.CreateTransfer(...) (string, error)` — returns the + Stripe transfer id. +- `processSellerTransfers` persists `st.StripeTransferID` before + `tx.Create(&st)`. +- `TransferRetryWorker.retryOne` also persists on retry success. +- Backfill: one-shot migration query that fills known transfer_ids + for past orders by calling Stripe's + `transfers.List(Destination=..., Metadata[order_id]=...)`. Acceptable + to leave `NULL` where Stripe has no match (document: "pre-v1.0.7 + transfers cannot be reversed automatically; use admin API P2.9"). + +### 3. `feat(marketplace): async stripe connect reversal worker` — B + +Effort: **M**. The big one. Introduces `seller_transfers.status = +'reversal_pending'` as a new intermediate terminal-avoidance state. +Migration: `981_seller_transfers_reversal_pending_enum.sql`. + +Acceptance: +- `reverseSellerAccounting` transitions `seller_transfers.status` to + `reversal_pending`, sets `next_retry_at = NOW()`. Refund + finalization completes end-to-end — buyer never sees a + Stripe-health-dependent refund UX. +- New `StripeReversalWorker` in `internal/workers/stripe_reversal.go` + runs every `TRANSFER_RETRY_INTERVAL` (reuse existing env var), + processes `reversal_pending` rows. Calls + `transfer.NewReversal(stripeTransferID, ¶ms)`. +- On success → `status='reversed'`, persist `stripe_reversal_id`. +- On Stripe 404 → `status='reversed'` + log INFO (treat as "already + reversed out-of-band"). +- On other errors → exponential backoff via `next_retry_at` + `retry_count`, + hit `permanently_failed` ceiling after N retries (mirror + `TransferRetryWorker`). +- Tests: mock the stripe SDK via a local httptest.Server (harder than + the Hyperswitch mock because stripe-go's HTTP layer is less trivial + to re-point, but feasible via `stripe.SetBackend`). 8 cases: + - happy path: pending → reversal_pending → reversed + id persisted + - Stripe 404: reversal_pending → reversed + log + - Stripe 5xx transient: retry_count increments, backoff set + - max retries: → permanently_failed + - concurrent worker (two instances pick same row): lock wins + - StripeTransferID empty (legacy row): skip + ERROR log + counter + - reversal idempotency (Stripe dedupes same transferID): no-op + - worker graceful shutdown mid-reversal (context cancellation) + +### 4. `feat(webhooks): persist raw hyperswitch payloads to audit log` — E + +Effort: **S**. Migration: `982_hyperswitch_webhook_log.sql`. Insert +before signature verification (so we capture attack attempts too). + +Acceptance: +- Every webhook landing on `/webhooks/hyperswitch` produces exactly one + row, regardless of signature-valid or processing outcome. +- `processing_result` field captures `'ok'`, `'error: '`, or + `'skipped'`. +- Retention: a `CleanupWebhookLog` worker in the same + `internal/jobs/` package as the orphan-tracks cleaner, daily, deletes + rows older than 90 days. +- Tests: 3 cases (valid signature + processing ok; invalid signature; + processing error). + +### 5. `feat(workers): hyperswitch reconciliation sweep for stuck pending states` — C + +Effort: **M**. New `ReconcileHyperswitchWorker` in `internal/jobs/`. +Hourly by default (`RECONCILE_INTERVAL=1h`), but exposed so ops can +drop to 5m during incident response. + +Acceptance: +- Orders in `pending` > 30min: `GET /payments/:id`, call the same + `ProcessPaymentWebhook` internal dispatcher with a synthesised + payload. Idempotent with real webhooks via the existing terminal-state + guard. +- Refunds in `pending` with non-empty `hyperswitch_refund_id` > 30min: + `GET /refunds/:id`, same pattern with `ProcessRefundWebhook`. +- Refunds in `pending` with EMPTY `hyperswitch_refund_id` > 5min: + mark `failed`, roll order back to `completed`, log ERROR + (operator attention needed — something crashed between Phase 1 and + Phase 2 of `RefundOrder`). +- Tests: happy sync, no-op when everything is terminal, the + empty-refund_id auto-fail case. +- Structured log on every action taken so `grep reconcile` tells the + ops story. + +### 6. `feat(metrics): ledger-health gauges + alert rules` — F + +Effort: **S**. New file `internal/metrics/ledger_health.go` with a +60s sampler. Grafana dashboard JSON in `config/grafana/ledger.json`. +Prometheus alert rules in `config/alertmanager/ledger.yml`. + +Acceptance: +- 5 gauge metrics (listed in P1.8 action). +- 2 alert rules (`stuck_orders > 0 for 10m`, `orphan_refunds > 0 for 5m`). +- Sampler queries are cheap: `SELECT COUNT(*) WHERE ... AND created_at < NOW() - INTERVAL '30 min'` per metric, indexed by `status + created_at`. +- If indexes are missing: migration `983_ledger_health_indexes.sql`. + +## Release gating + +**Before tagging v1.0.7:** +- All six items landed as separate commits. +- Refund smoke test (see `docs/audit-2026-04/smoke/refund.md` — TBD) + re-run against sandbox with the reversal path: assert Stripe side + shows the reversal as well as the refund. +- Manual reconciliation pass: query the live DB against Hyperswitch + dashboard for all `completed` orders in the past week → zero drift. + Document the script used (`scripts/reconcile-check.sh`), keep for + periodic re-use. +- CHANGELOG entry lists each commit + criticity it resolves. + +**Cut v1.0.7 as a minor release** (not patch), because the interface +change in A is technically breaking for any external caller of +`TransferService` (there are none today, but naming it 1.0.7 signals +the API change). + +## Effort total + +Worst-case single-developer sequential: **~9-10 working days** +(XS + S + M + S + M + S = 2h + 1d + 3d + 1d + 3d + 1d). + +With two devs on the parallel tracks: **~5 working days** end-to-end, +because track 1 (A→B, 4d) and track 2 (D, E, C→F done serially = 5d) +overlap. + +## Unknowns to resolve before starting + +1. **Volume order of magnitude** — informs whether the reconciliation + sweep's default interval of 1h is appropriate. If we're processing + > 1000 orders/day, drop to 15m. Ops confirmation needed. +2. **Stripe Connect account state at time of v1.0.7 deploy** — any + pre-v1.0.7 transfers lack `stripe_transfer_id` in DB. The backfill + migration in A needs to know how many rows to probe and how many + will successfully match. Acceptable ceiling: if > 5% of historical + transfers fail to backfill, escalate to manual reconciliation + session. +3. **Hyperswitch sandbox vs prod webhook secret rotation** — item D + changes outbound request headers only, no impact. Items B+C call + Hyperswitch read endpoints, which use the same `HYPERSWITCH_API_KEY`. + No secret-rotation concern.