docs(audit): 2026-04 correctness/accounting findings (axis 1)

Axis 1 of the 5-axis VEZA audit, scoped to money-movement correctness
and ledger↔PSP reconciliation. Layout: one file per axis under
docs/audit-2026-04/, README index, v107-plan.md derived.

P0 findings (block v1.0.7 "ready-to-show" gate):
  * P0.1 — SellerTransfer.StripeTransferID declared but never populated.
    stripe_connect_service.CreateTransfer discards the *stripe.Transfer
    return value (`_, err := transfer.New(params)`), so the column in
    models.go:237 is dead. Structural blocker for the CHANGELOG-parked
    v1.0.7 "Stripe Connect reversal" item.
  * P0.2 — No Stripe Connect reversal on refund.succeeded. Every refund
    today creates a permanent VEZA↔Stripe ledger gap. Action reworked
    to decouple via a new `seller_transfers.status = 'reversal_pending'`
    state + async worker, so Stripe flaps never block buyer-facing
    refund UX.
  * P0.3 — No reconciliation sweep for stuck orders / refunds / refund
    rows with empty hyperswitch_refund_id. Hourly worker recommended,
    same pattern as v1.0.5 Fix 6 orphan-tracks cleaner.
  * P0.4 — No Idempotency-Key on outbound Hyperswitch POST /payments and
    POST /refunds. Action includes an explicit scope note: the header
    covers HTTP-transport retry only, NOT application-level replay (for
    which the fix is a state-machine precondition).

P1 findings:
  * P1.5 — Webhook raw payloads not persisted (blocks dispute forensics)
  * P1.6 — Disputes / chargebacks silently dropped (new, surfaced during
    review; dispute.* webhooks fall through the default case)
  * P1.7 — Subscription money-movement not covered by v1.0.6 hardening
  * P1.8 — No ledger-health Prometheus metrics

P2 findings:
  * P2.9 — No admin API for manual override
  * P2.10 — Partial refund latent compromise (amount *int64 always nil)

wontfix:
  * wontfix.11 — Per-seller retry interval (re-evaluate at 10× load)

Derived deliverable: v107-plan.md sequences the 6 de-duplicated items
(4 P0 + 2 P1) with a dependency graph, two parallel tracks, per-commit
effort estimates (D→A→B; E→C→F), release gating and open questions
(volume magnitude, Connect backfill %).

Info needed from ops (tracked in axis-1 doc, not determinable from
code): last manual reconciliation date, whether subscriptions are
currently sold, current order/refund volume.

Axes 2-5 deferred: README.md marks axis 2 (state machines) as gated
on v1.0.7 landing first, otherwise the transition matrix captures a
v1.0.6.1 snapshot that's immediately stale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
senke 2026-04-17 03:21:33 +02:00
parent 5e3964b989
commit 6b345ede9f
3 changed files with 811 additions and 0 deletions

View file

@ -0,0 +1,52 @@
# VEZA Audit — 2026-04
> **Scope** — VEZA backend (Go) + web (TypeScript).
> TALAS software (firmware, PCB reverse-engineering pipeline) is **out of scope**
> and will be audited separately when its phase stabilises.
>
> **Source state** — commits up to `a57bb6f78` (v1.0.6.1, 2026-04-17).
>
> **Auditor** — Claude Opus 4.7 (1M context).
## Axes
| # | File | Status |
|---|---|---|
| 1 | [`axis-1-correctness.md`](./axis-1-correctness.md) — correctness / accounting | ✅ delivered |
| 2 | `axis-2-state-machines.md` — transition matrix + illegal-transition tests | 🔲 pending v1.0.7 |
| 3 | `axis-3-security.md` — attack surface (signatures, rate limits, authz, secrets) | 🔲 pending |
| 4 | `axis-4-tests.md` — coverage vs reality, failure-injection gap | 🔲 pending |
| 5 | `axis-5-debt.md` — documented debt vs hidden debt (TODO/FIXME inventory) | 🔲 pending |
Axis 2 is gated on v1.0.7 landing first — otherwise the transition matrix
captures a v1.0.6.1 snapshot that's immediately stale. See
[`v107-plan.md`](./v107-plan.md) for the sequencing.
## Reading conventions
Every finding cites `file:line` evidence. Structure:
```
### P{0|1|2}.N — short title
**Evidence** — concrete cites
**Consequence** — what breaks today / tomorrow
**Action** — what to do, with enough detail that an implementer can start
**Criticity** — P0 / P1 / P2 / wontfix (with justification)
```
**P0** = fix within v1.0.7 or earlier (ledger diverges today, or a v1.0.7
commitment is structurally blocked).
**P1** = v1.0.7 target. Operational visibility / correctness hardening.
**P2** = v1.0.8+. Nice-to-have.
**wontfix** = justified non-action.
## Info needed from ops (not determinable from code)
Tracked in [`axis-1-correctness.md`](./axis-1-correctness.md#info-needed-from-ops).
Absence of answers becomes a finding in its own right.
## Derived deliverables
- [`v107-plan.md`](./v107-plan.md) — sequencing, dependencies and relative
effort for the axis-1 P0 findings + the CHANGELOG-parked v1.0.7 items.
Read this before picking up v1.0.7 work.

View file

@ -0,0 +1,529 @@
# Axis 1 — Correctness / accounting
> See [`README.md`](./README.md) for scope, auditor, source state and
> reading conventions.
## Surface summary
| PSP | Package | Endpoints called | Env gate |
|---|---|---|---|
| Hyperswitch (payments + refunds) | `internal/services/hyperswitch/` | `POST /payments`, `GET /payments/:id`, `POST /refunds` | `HYPERSWITCH_ENABLED` |
| Stripe Connect (seller payouts + transfers) | `internal/services/stripe_connect_service.go` | `POST /v1/account_links`, `GET /v1/accounts/:id`, `GET /v1/balance`, `POST /v1/transfers` | `STRIPE_CONNECT_ENABLED` |
| Table | Status column | Terminal states | Non-terminal |
|---|---|---|---|
| `orders` | `status` | completed, failed, cancelled, refunded | pending, refund_pending |
| `refunds` (v1.0.6) | `status` | succeeded, failed | pending |
| `seller_transfers` | `status` | completed, permanently_failed, reversed | pending, failed |
| `seller_payouts` | `status` | completed, failed | pending, processing |
| `licenses` | `revoked_at` (nullable) | NULL (active) / timestamp (revoked) | — |
| `user_subscriptions` | `status` (enum) | canceled, expired | active, trialing, past_due |
Webhook entry: `POST /api/v1/webhooks/hyperswitch`
(`routes_webhooks.go:55`), HMAC-SHA512 verified, dispatches to
`ProcessPaymentWebhook` or `ProcessRefundWebhook` via
`HyperswitchWebhookPayload.IsRefundEvent()` (`service.go:700`).
Existing reconciliation: **only** the `TransferRetryWorker` for
`seller_transfers.status='failed'` (`transfer_retry.go:57`), interval
`TRANSFER_RETRY_INTERVAL` default `5m` (`config.go:399`).
---
## P0.1 — `SellerTransfer.StripeTransferID` declared but never populated
**Evidence**
- Column declared: `internal/core/marketplace/models.go:237`
`StripeTransferID string \`gorm:"size:255" json:"stripe_transfer_id,omitempty"\``
- Zero write-sites (`grep 'StripeTransferID\s*='` returns 0 hits in production code).
- Upstream returns the transfer but we discard it:
`internal/services/stripe_connect_service.go:205`
`_, err := transfer.New(params)` — the `*stripe.Transfer` pointer carrying
`.ID` is underscored away.
- `CreateTransfer` signature returns only `error`
(`internal/services/stripe_connect_service.go:179`), so callers physically
cannot persist the id even if they wanted to.
**Consequence**
Direct blocker for the v1.0.6 CHANGELOG-documented "parked v1.0.7 —
Stripe Connect Transfers:reversal". When anyone starts that work, there
is no `stripe_transfer_id` in the DB to pass to
`transfers.CreateReversal(...)`. The only fallbacks would be:
- Query Stripe with `transfers.List(Destination=account_id, Metadata[order_id]=...)` — fragile, rate-limited, no unique guarantee.
- Do a full upstream refactor of the transfer worker — larger scope than the TODO implies.
This is exactly the "temporary compromise frozen into permanent" pattern
the audit charter targets.
**Action (prerequisite for any v1.0.7 reversal work)**
1. Change `TransferService.CreateTransfer` signature:
`(ctx, sellerID, amount, currency, orderID string) (stripeTransferID string, err error)`.
2. Update `processSellerTransfers` (`service.go:762`) to capture the id
and set `st.StripeTransferID = stripeTransferID` before `tx.Create(&st)`.
3. Update `TransferRetryWorker.retryOne` (`transfer_retry.go:116`) to
persist the id on successful retry too.
**Criticity** — P0. Must land before the reversal commit, not with it.
---
## P0.2 — No Stripe Connect reversal on refund.succeeded
**Evidence**
- `internal/core/marketplace/service.go:1529` `reverseSellerAccounting`
does three things on refund webhook succeeded:
1. `DebitSellerBalance` (in-DB balance row decrement)
2. `seller_transfers.status = "reversed"` (in-DB ledger flag)
3. **Nothing else.**
- The CHANGELOG v1.0.6 entry explicitly flags this: "Stripe Connect
Transfers:reversal call flagged TODO(v1.0.7). [...] the missing piece
is the money-movement round-trip at Stripe."
**Consequence**
For every refunded order, the buyer is credited by Hyperswitch (verified
via webhook) and the VEZA internal seller balance is debited correctly.
But the Stripe-Connect-side transfer that moved the seller's share into
their connected account is **not** reversed. The seller's Stripe account
still holds that money. VEZA's books say "debited", Stripe says "paid
out". On every refund this creates a real reconciliation gap that a
seller's monthly Stripe statement will show.
This isn't a ledger-consistency issue in VEZA's own DB — it's a
VEZA-vs-Stripe divergence. It grows linearly with refund volume.
**Action — decoupled from the buyer-refund critical path**
The intuitive fix ("call `transfers.NewReversal` inside
`reverseSellerAccounting` and return on error to let the webhook retry")
is wrong: Hyperswitch has already credited the buyer, the buyer is
watching their order page, and a Stripe flap would leave the buyer in
`refund_pending` until Stripe comes back. That couples buyer UX to the
weakest dependency in the whole flow.
Correct pattern — split the reversal into two concerns, with an
intermediate state that says "VEZA books updated, PSP reversal still
owed":
1. **Migration** — extend `seller_transfers.status` to include a new
value `reversal_pending`. Update the state-value documentation
comment in `internal/core/marketplace/models.go:242`.
2. **Refund finalization** (`service.go:1466 finalizeSuccessfulRefund`)
— no change to its flow except: instead of transitioning
`seller_transfers.status` directly to `reversed` in
`reverseSellerAccounting`, transition to `reversal_pending` and set
`next_retry_at = NOW()`. The buyer-facing flow finalizes immediately:
order=refunded, licenses revoked, webhook returns 200.
3. **New worker** `StripeReversalWorker` — scans
`seller_transfers WHERE status='reversal_pending' AND (next_retry_at IS NULL OR next_retry_at <= NOW())`.
For each row:
- if `StripeTransferID == ""` (see P0.1), skip + log ERROR "cannot
reverse — transfer id missing", alert ops.
- else call
`transfer.NewReversal(stripeTransferID, &stripe.TransferReversalParams{Amount: amountCents})`.
- on success: `status='reversed'`, persist the reversal id on a new
`stripe_reversal_id` column.
- on Stripe 404 (transfer not found): assume already reversed
out-of-band, set `status='reversed'`, log INFO.
- on any other error: exponential backoff on `next_retry_at`,
increment `retry_count`. Hit the same `permanently_failed` ceiling
the TransferRetryWorker uses today.
This gives us three properties:
- Buyer refund never blocks on Stripe health.
- VEZA accounting is consistent with Hyperswitch refund state at all times.
- The VEZA↔Stripe gap is bounded (every reversal either completes or
ends in `permanently_failed` with a clear ops action).
**Criticity** — P0. Pre-public-opening of marketplace.
---
## P0.3 — No reconciliation sweep for stuck pending states
**Evidence**
- Inventory search returned zero scheduled jobs that compare DB state
to PSP state for payments or refunds.
- `TransferRetryWorker` exists (`transfer_retry.go`) but only reconciles
`seller_transfers` — not orders, not refunds.
- Agent inventory: "No startup sanity checks comparing DB to PSP for
orders/refunds. No admin endpoints for manual order/refund/transfer
verification. No sweeps for stuck orders in `refund_pending` or orders
exceeding 14-day deadline."
**Consequence**
If a Hyperswitch webhook is never delivered — they've been down, our
`/webhooks/hyperswitch` was down during their retry window, or their
delivery system has a bug — then:
- An order stuck in `pending` stays forever. Customer paid, got no
license. Only surfaces via customer complaint.
- A refund stuck in `refund_pending` stays forever. Customer asked for
refund, got nothing. Order stuck in `refund_pending` also blocks
re-attempts (`ErrRefundAlreadyRequested` at `service.go:1303`).
- A refund row with empty `hyperswitch_refund_id` (Phase 1 completed,
Phase 2 crashed between PSP call and DB UPDATE) has no way to
correlate to the PSP. Nothing queries them.
**Action**
Hourly worker (same pattern as `CleanupOrphanTracks` from v1.0.5 Fix 6):
```go
// internal/jobs/reconcile_hyperswitch.go
func ReconcilePendingOrders(ctx context.Context, db *gorm.DB, hs *hyperswitch.Client) error {
// Orders in 'pending' older than 30m: GET /payments/:id, sync
// Refunds in 'pending' > 30m with non-empty refund_id: GET /refunds/:id, sync
// Refunds in 'pending' > 5m with EMPTY refund_id: mark failed, roll back order
}
```
Wire in `cmd/api/main.go` next to `ScheduleOrphanTracksCleanup`.
**Criticity** — P0. The gap is open today on every deployed instance.
---
## P0.4 — `CreatePayment` / `CreateRefund` not idempotent at the HTTP layer
**Evidence**
- `internal/services/hyperswitch/client.go:73-78` and
`internal/services/hyperswitch/client.go:170-175` — only
`Content-Type` and `api-key` headers are set. No `Idempotency-Key`.
- Hyperswitch's API supports an `Idempotency-Key` header (standard PSP
convention).
**Consequence**
If the HTTP call from VEZA to Hyperswitch retries at the transport
layer (curl-level / net/http Transport-level retry, proxy retry, TLS
reconnect after a flap) during a single logical call, a second payment
or refund object can be created for the same intent. Customer
double-charged, or two refunds issued. The order row's
`hyperswitch_payment_id` column is set to the *first* payment_id we
hear back about, so the second shadow payment is orphaned in Hyperswitch
with no VEZA row pointing at it.
**Scope note — what Idempotency-Key does and does NOT fix**
This header protects against **HTTP-transport retry** only. It does
**not** fix, and should not be conflated with:
- **Application-level replay** — buyer clicks "Pay" twice, frontend
double-submits a form, VEZA application code retries after a DB
transaction aborts. These paths create distinct `order.id` values, so
each call sees a different natural key and the PSP legitimately
creates different payments. The fix there is a state-machine
precondition (order dedup by `(buyer_id, cart_hash)` within a window,
frontend debounce, CSRF single-use token), not the header.
- **Crash-between-call-and-save** — VEZA POST succeeds, Hyperswitch
returns payment_id, VEZA crashes before persisting it. A subsequent
application retry with the same `order.id` *does* recover thanks to
the Idempotency-Key (Hyperswitch returns the cached response), but
this only works if the application retry re-uses the same order row.
That requires the order row to exist in a `pending_payment_submitted`
intermediate state before the PSP call — which the current code
doesn't have. Documenting as a follow-up.
**Action**
Add `Idempotency-Key: <natural key>` header on every mutating call:
- `CreatePayment``order_id` (UUID, guaranteed unique per order row).
- `CreateRefund``refund.ID` (UUID from the pending `refunds` row —
populated before the PSP call in Phase 1, so the same key is used
across any HTTP-level retry within that call).
The `http.Client` in `hyperswitch.Client` should NOT set this
implicitly; callers must pass the key explicitly so the semantic is
auditable at each call site. Two changes, each ~3 lines.
The state-machine-level fix for application retry is out of scope here;
it goes on the v1.0.8+ roadmap (documented in `v107-plan.md`).
**Criticity** — P0. Low probability per call but unbounded blast radius
(customer double-charge).
---
## P1.5 — Webhook raw payloads not persisted
**Evidence**
- `routes_webhooks.go:59-82` reads body, verifies signature, dispatches.
- No table or log file captures the raw webhook body, signature header,
or processing result for later audit.
**Consequence**
When a customer disputes ("I never got refunded"), we cannot reproduce
what Hyperswitch told us. The only evidence is in the backend `log` at
whatever retention the host keeps — and structured logs don't carry the
raw body (we explicitly redact some fields for security).
**Action**
New table `hyperswitch_webhook_log`:
```sql
CREATE TABLE hyperswitch_webhook_log (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
received_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
signature_header TEXT NOT NULL,
signature_valid BOOLEAN NOT NULL,
event_type TEXT,
payment_id TEXT,
refund_id TEXT,
payload BYTEA NOT NULL,
processing_result TEXT NOT NULL, -- 'ok', 'error: <msg>', 'skipped'
processed_at TIMESTAMPTZ
);
CREATE INDEX ON hyperswitch_webhook_log (payment_id) WHERE payment_id IS NOT NULL;
CREATE INDEX ON hyperswitch_webhook_log (refund_id) WHERE refund_id IS NOT NULL;
```
Insert in `routes_webhooks.go:hyperswitchWebhookHandler` **before**
signature verification — so we capture attack attempts too. Retention
90 days via a separate sweep job.
**Criticity** — P1. Not ledger-diverging, but the first ops question
on any refund/dispute claim is "what did the webhook say", and today we
can't answer it.
---
## P1.6 — Disputes / chargebacks silently dropped
**Evidence**
- `HyperswitchWebhookPayload.IsRefundEvent()` (`service.go:700`) only
distinguishes refund-vs-non-refund. No dispute discriminator.
- Zero grep hits for `dispute|chargeback|disputed` in
`internal/core/marketplace/`, `internal/services/hyperswitch/`, or
anywhere in the webhook handler path.
- Hyperswitch emits `dispute.opened`, `dispute.evidence_required`,
`dispute.won`, `dispute.lost`, `dispute.expired` — all land on
`POST /webhooks/hyperswitch` and currently fall through
`ProcessPaymentWebhook`'s switch to the `default` case
(`service.go:718`), which logs Debug "ignoring status" and returns 200.
**Consequence**
A dispute/chargeback is the case where the ledger diverges most
violently, because:
1. It's **not initiated by VEZA** — we learn about it passively when
the buyer's bank pushes a reversal through the card network.
2. On `dispute.lost` (the common case if we don't respond in time),
Hyperswitch debits our acquirer balance for the original amount
**plus** a chargeback fee (typically 15-25€ depending on acquirer).
3. The VEZA `orders` row stays `completed`, the license stays active,
the seller stays paid. Buyer has their money back via the card
network. Every table in VEZA thinks everything is fine. The only
signal is the acquirer statement, and it's weeks delayed.
4. Chargeback rate is a seller health metric: an account with 1% CB
rate gets flagged by Visa/MC and the whole VEZA merchant account
can be put under review. Without counting disputes per-seller, we
have no early-warning system.
**Action (staged)**
Minimum (ships with P1.5 webhook log): every `dispute.*` event lands in
`hyperswitch_webhook_log` with the payload captured. Ops can query.
Recommended (v1.0.8 sprint):
1. New `disputes` table:
```sql
CREATE TABLE disputes (
id UUID PRIMARY KEY,
order_id UUID NOT NULL REFERENCES orders(id),
hyperswitch_dispute_id TEXT UNIQUE NOT NULL,
status TEXT NOT NULL, -- opened, evidence_required, won, lost, expired
amount_cents BIGINT NOT NULL,
reason TEXT,
evidence_due_at TIMESTAMPTZ,
opened_at TIMESTAMPTZ NOT NULL,
resolved_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
```
2. Extend `IsRefundEvent()` discriminator to a three-way
`EventKind()` that returns `payment | refund | dispute`.
3. New `ProcessDisputeWebhook` — mirrors `ProcessRefundWebhook`'s
idempotency pattern (unique `hyperswitch_dispute_id` + row lock).
4. On `dispute.lost`: same seller-side reversal machinery as refund
(P0.2), because the money effect is identical.
5. Per-seller chargeback counter on `seller_balances` or a new
`seller_risk_metrics` table, exposed to admin UI.
**Criticity** — P1 minimum (capture + log). P0 if volume justifies it
once public-opened. At current volume (pre-opening), accepting the gap
is defensible; post-opening, a single dispute we miss is worse than
the engineering cost of handling it.
---
## P1.7 — Subscription money-movement not covered by v1.0.6 hardening
**Evidence**
- Subscription model has Hyperswitch keys:
`subscription/models.go:95` `HyperswitchSubscriptionID`, line 96
`HyperswitchCustomerID`, line 136 `HyperswitchPaymentID`.
- Status enum: `active, trialing, canceled, past_due, expired`
(`subscription/models.go:20-28`).
- `Service.Subscribe` (`subscription/service.go:149`) and
`Service.CancelSubscription` (line 390) exist.
- **Zero refund code in the subscription package.** Cancellation marks
the row `canceled` but does not initiate any PSP reimbursement.
- No visible webhook handler branch for `subscription.*` events — the
dispatcher only knows payment vs refund.
- No code path was found that transitions a subscription INTO
`past_due` (which implies it's driven by an external event the
webhook isn't consuming, or it's dead state).
**Consequence**
The marketplace refund hardening only covers one-shot purchases. A
subscription cancellation inside the paid period currently doesn't
refund anything, and a failed recurring charge doesn't move the
subscription to `past_due` because nothing wires it up.
**Action**
Dedicated sprint for subscription state-machine hardening. Scope for
v1.0.8. Breakdown:
1. Inventory every state in `user_subscriptions.status` with evidence
of which code path lands there.
2. Handle `subscription.payment_succeeded`, `subscription.payment_failed`,
`subscription.canceled` webhook events in a
`ProcessSubscriptionWebhook`.
3. Decide product policy on mid-period cancellation (pro-rata refund?
honour until end of period? forfeit?).
**Criticity** — P1 for the hardening, P0 only if subscriptions are
currently sold. Status: needs confirmation from ops — not determinable
from code alone. See [Info needed from ops](#info-needed-from-ops) Q2.
---
## P1.8 — No ledger-health metrics
**Evidence**
No grep hit for `prometheus.*counter.*order|refund|transfer|payout` in
`internal/metrics/`. Agent confirms no dashboards.
**Consequence**
Ledger state is invisible until someone complains. Specifically:
- Count of orders in `pending` older than 30 min (webhook never arrived)
- Count of orders in `refund_pending` older than 30 min
- Count of seller_transfers in `failed` at max `retry_count`
- Count of refunds with empty `hyperswitch_refund_id` older than 5 min
- Count of seller_transfers in `reversal_pending` older than 30 min (new state from P0.2)
- Oldest timestamp in each of the above
**Action**
Prometheus gauge metrics updated by a 60s sampler (cheap query each):
```
ledger_stuck_orders_pending_total{age_bucket="30m_1h|1h_1d|1d+"}
ledger_stuck_refunds_pending_total{age_bucket=...}
ledger_failed_transfers_at_max_retry_total
ledger_orphan_refund_rows_total // empty hyperswitch_refund_id
ledger_reversal_pending_transfers_total
```
Grafana panel with 5 number panels + 5 time-series. Alert rules:
- `ledger_stuck_orders_pending_total > 0 for 10m` → page
- `ledger_orphan_refund_rows_total > 0 for 5m` → page (= bug in the
two-phase commit between DB and PSP)
**Criticity** — P1. Ships with P0.3 (the reconciliation sweep) so the
metrics track the sweep's effectiveness.
---
## P2.9 — No admin API for manual refund / order override
**Evidence**
No admin endpoint found that can force an order or refund into a
specific terminal state. Ops would have to issue raw SQL in prod.
**Consequence**
When something genuinely gets stuck (Hyperswitch confirms out-of-band
that a refund succeeded but their webhook is never going to fire for
some freak reason), there's no sanctioned path to resolve without raw
DB access.
**Action**
`POST /api/v1/admin/orders/:id/force-status` +
`POST /api/v1/admin/refunds/:id/force-status`, gated by
`AuthMiddleware.RequireAdminRole`, with an audit-log entry. Body:
`{ new_status: "...", reason: "..." }`. Validates the target state is
reachable from the current one per the documented state machine
(depends on Axis 2).
**Criticity** — P2. Not urgent; raw SQL works until it doesn't.
---
## P2.10 — Partial refund latent compromise
**Evidence**
`client.go:142` `CreateRefundRequest.Amount *int64 // nil = full refund`.
`service.go:1330` passes `nil` as amount on every call.
CHANGELOG v1.0.6: "full-only for v1.0.6 — partial refund UX is deferred
to v1.0.7."
**Consequence**
The `amount *int64` parameter is a permanent half-wired compromise. When
partial refunds land, the implementer must remember:
- Multiple refunds per order must be allowed (change the double-submit
guard at `service.go:1303`).
- `SellerBalance` debit must be proportional to the partial amount, not
full transfer amount.
- The partial UNIQUE index already tolerates the multi-refund case
(v1.0.6.1 migration 979) — but only if each refund has its own
PSP-assigned refund_id, which Hyperswitch does.
- Webhook lookup already uses `hyperswitch_refund_id` so no change.
**Action**
When partial lands: extensive state-table tests (can the same order
reach (`status=refunded`, amount < total)? should it be a new
`partially_refunded` status?). Flag this item in the v1.0.7 plan
explicitly so it's scoped, not drive-by.
**Criticity** — P2. No active bug; future hazard only.
---
## wontfix.11 — `TransferRetryInterval` not per-seller configurable
`config.go:399` default `5m`. A single global interval works fine at
current traffic. Re-evaluate at 10× load.
---
## Axis 1 summary
| # | Finding | Criticity | Gate |
|---|---|---|---|
| 1 | `SellerTransfer.StripeTransferID` never set | P0 | prerequisite for v1.0.7 reversal |
| 2 | No Stripe Connect reversal on refund.succeeded | P0 | v1.0.7, decoupled via `reversal_pending` state |
| 3 | No reconciliation sweep for stuck pending states | P0 | v1.0.7 (hourly job) |
| 4 | No `Idempotency-Key` on Hyperswitch calls | P0 | v1.0.7 (2 header lines) |
| 5 | Webhook raw payloads not persisted | P1 | v1.0.7 (migration + insert) |
| 6 | Disputes / chargebacks silently dropped | P1 | v1.0.8 (min: capture in P1.5 log) |
| 7 | Subscription money-movement surface | P1 | v1.0.8 sprint |
| 8 | No ledger-health metrics | P1 | ships with #3 |
| 9 | No admin API for manual override | P2 | v1.0.8+ |
| 10 | Partial refund latent compromise | P2 | flag in v1.0.7 plan |
| 11 | Per-seller retry interval | wontfix | — |
**Blockers for "ready to show"** — 1, 2, 3, 4 must land before the
marketplace is used by anyone external (grants demo, beta, DIY
community). After those four, VEZA's accounting is self-consistent
AND consistent with Stripe+Hyperswitch, which is the real bar.
See [`v107-plan.md`](./v107-plan.md) for sequencing and effort.
---
## Info needed from ops
Not determinable from code. Answers become part of the audit:
1. **Last manual reconciliation between VEZA DB and Stripe/Hyperswitch
dashboards.** Date + scope + outcome.
_If the answer is "jamais" — that's itself a P1 process finding:
no reconciliation cadence established. Absence of documented debt
= hidden debt._
2. **Are subscriptions currently sold** (i.e., is the paid-subscription
feature turned on for end-users), or is the code live but the UI
hides the entry point? If yes, P1.7 → P0. If hidden-by-UI,
additional Axis-3 check needed: is the API reachable despite the
UI hiding? Front-flag-only gating is a security finding.
3. **Current volume order of magnitude** — number of orders/refunds
processed per week — so criticity can be re-checked against blast
radius.

View file

@ -0,0 +1,230 @@
# v1.0.7 — plan structuré
> Derived from [`axis-1-correctness.md`](./axis-1-correctness.md).
> The v1.0.6 CHANGELOG listed 4 "parked v1.0.7" items; the axis-1 audit
> added 4 P0 findings. De-duplicated and sequenced below.
## The 8 items, de-duplicated → 6
| # | From | Title | Effort |
|---|---|---|---|
| A | audit P0.1 | Persist `stripe_transfer_id` in `seller_transfers` | **S** |
| B | audit P0.2 ≡ CHANGELOG "Stripe Connect reversal" | Connect reversal via `reversal_pending` state + async worker | **M** |
| C | audit P0.3 | Reconciliation sweep for stuck orders / refunds | **M** |
| D | audit P0.4 | `Idempotency-Key` on `CreatePayment` / `CreateRefund` | **XS** |
| E | audit P1.5 | Webhook raw-payload log table + insert | **S** |
| F | audit P1.8 | Ledger-health Prometheus metrics + alerts | **S** |
Dropped from v1.0.7 scope:
- Partial refunds (CHANGELOG-parked) — P2 in audit, feature-class, defer to v1.0.8
- `CloudUploadModal` single-source-of-truth (CHANGELOG-parked) — P2, out of money-movement scope
- Sandbox smoke-test documentation — landed de facto in v1.0.6.1 via the partial-UNIQUE hotfix + the smoke harness artefacts
Effort legend — **XS** ≤ 2h, **S** ≤ 1 day, **M** ≤ 3 days, **L** > 3 days.
## Dependency graph
```
┌──────────────────┐
│ D Idempotency │ independent, can land first as quick win
└──────────────────┘
┌──────────────────┐
│ A transfer_id │ prerequisite for B
└────────┬─────────┘
┌──────────────────┐
│ B reversal │ uses transfer_id persisted by A;
│ worker │ introduces `reversal_pending` status
└──────────────────┘
┌──────────────────┐
│ E webhook log │ independent; prerequisite for P1.6 (disputes)
└──────────────────┘
┌──────────────────┐
│ C reconciler │ independent, but:
│ sweep │ metrics (F) track its effectiveness
└────────┬─────────┘
┌──────────────────┐
│ F metrics │ needs the state-shape C produces; alerts wire
│ + alerts │ to the buckets the reconciler defines
└──────────────────┘
```
**Two parallel tracks:**
- **Track 1 (reversal correctness)** — A → B
- **Track 2 (operational visibility)** — D, E, C → F
Two developers can work in parallel without stepping on each other. A
single developer sequences as ordered: D first (XS quick win, earns
trust + unblocks "pre-open" checklist), then A→B, then E, then C→F.
## Commit sequence (single-developer path)
Each item lands as its own commit in the existing v1.0.6-style cadence
(per-commit tests + CHANGELOG-worthy).
### 1. `fix(hyperswitch): idempotency-key on create-payment and create-refund` — D
Effort: **XS**. Pure header addition. Tests: the 15-case refund suite
already exists; add 2 cases verifying the header is set correctly
(httptest.Server assertion on `r.Header.Get("Idempotency-Key")`).
Acceptance:
- Every outbound `POST /payments` carries `Idempotency-Key: <order.id>`.
- Every outbound `POST /refunds` carries `Idempotency-Key: <refund.id>`.
- No implicit-via-ctx magic: each call site sets the header explicitly,
greppable.
- CHANGELOG entry cross-references P0.4 + its scope note (HTTP retry
only, not app-level replay).
**Ship as v1.0.7-alpha-1 for sandbox testing**, don't wait for B/C.
### 2. `refactor(connect): persist stripe_transfer_id on create + retry` — A
Effort: **S**. Touches the `TransferService` interface (minor breaking
change — but only internal callers). Migration:
`980_seller_transfers_stripe_reversal_id.sql` adds `stripe_reversal_id`
nullable column (prepares ground for B).
Acceptance:
- `TransferService.CreateTransfer(...) (string, error)` — returns the
Stripe transfer id.
- `processSellerTransfers` persists `st.StripeTransferID` before
`tx.Create(&st)`.
- `TransferRetryWorker.retryOne` also persists on retry success.
- Backfill: one-shot migration query that fills known transfer_ids
for past orders by calling Stripe's
`transfers.List(Destination=..., Metadata[order_id]=...)`. Acceptable
to leave `NULL` where Stripe has no match (document: "pre-v1.0.7
transfers cannot be reversed automatically; use admin API P2.9").
### 3. `feat(marketplace): async stripe connect reversal worker` — B
Effort: **M**. The big one. Introduces `seller_transfers.status =
'reversal_pending'` as a new intermediate terminal-avoidance state.
Migration: `981_seller_transfers_reversal_pending_enum.sql`.
Acceptance:
- `reverseSellerAccounting` transitions `seller_transfers.status` to
`reversal_pending`, sets `next_retry_at = NOW()`. Refund
finalization completes end-to-end — buyer never sees a
Stripe-health-dependent refund UX.
- New `StripeReversalWorker` in `internal/workers/stripe_reversal.go`
runs every `TRANSFER_RETRY_INTERVAL` (reuse existing env var),
processes `reversal_pending` rows. Calls
`transfer.NewReversal(stripeTransferID, &params)`.
- On success → `status='reversed'`, persist `stripe_reversal_id`.
- On Stripe 404 → `status='reversed'` + log INFO (treat as "already
reversed out-of-band").
- On other errors → exponential backoff via `next_retry_at` + `retry_count`,
hit `permanently_failed` ceiling after N retries (mirror
`TransferRetryWorker`).
- Tests: mock the stripe SDK via a local httptest.Server (harder than
the Hyperswitch mock because stripe-go's HTTP layer is less trivial
to re-point, but feasible via `stripe.SetBackend`). 8 cases:
- happy path: pending → reversal_pending → reversed + id persisted
- Stripe 404: reversal_pending → reversed + log
- Stripe 5xx transient: retry_count increments, backoff set
- max retries: → permanently_failed
- concurrent worker (two instances pick same row): lock wins
- StripeTransferID empty (legacy row): skip + ERROR log + counter
- reversal idempotency (Stripe dedupes same transferID): no-op
- worker graceful shutdown mid-reversal (context cancellation)
### 4. `feat(webhooks): persist raw hyperswitch payloads to audit log` — E
Effort: **S**. Migration: `982_hyperswitch_webhook_log.sql`. Insert
before signature verification (so we capture attack attempts too).
Acceptance:
- Every webhook landing on `/webhooks/hyperswitch` produces exactly one
row, regardless of signature-valid or processing outcome.
- `processing_result` field captures `'ok'`, `'error: <msg>'`, or
`'skipped'`.
- Retention: a `CleanupWebhookLog` worker in the same
`internal/jobs/` package as the orphan-tracks cleaner, daily, deletes
rows older than 90 days.
- Tests: 3 cases (valid signature + processing ok; invalid signature;
processing error).
### 5. `feat(workers): hyperswitch reconciliation sweep for stuck pending states` — C
Effort: **M**. New `ReconcileHyperswitchWorker` in `internal/jobs/`.
Hourly by default (`RECONCILE_INTERVAL=1h`), but exposed so ops can
drop to 5m during incident response.
Acceptance:
- Orders in `pending` > 30min: `GET /payments/:id`, call the same
`ProcessPaymentWebhook` internal dispatcher with a synthesised
payload. Idempotent with real webhooks via the existing terminal-state
guard.
- Refunds in `pending` with non-empty `hyperswitch_refund_id` > 30min:
`GET /refunds/:id`, same pattern with `ProcessRefundWebhook`.
- Refunds in `pending` with EMPTY `hyperswitch_refund_id` > 5min:
mark `failed`, roll order back to `completed`, log ERROR
(operator attention needed — something crashed between Phase 1 and
Phase 2 of `RefundOrder`).
- Tests: happy sync, no-op when everything is terminal, the
empty-refund_id auto-fail case.
- Structured log on every action taken so `grep reconcile` tells the
ops story.
### 6. `feat(metrics): ledger-health gauges + alert rules` — F
Effort: **S**. New file `internal/metrics/ledger_health.go` with a
60s sampler. Grafana dashboard JSON in `config/grafana/ledger.json`.
Prometheus alert rules in `config/alertmanager/ledger.yml`.
Acceptance:
- 5 gauge metrics (listed in P1.8 action).
- 2 alert rules (`stuck_orders > 0 for 10m`, `orphan_refunds > 0 for 5m`).
- Sampler queries are cheap: `SELECT COUNT(*) WHERE ... AND created_at < NOW() - INTERVAL '30 min'` per metric, indexed by `status + created_at`.
- If indexes are missing: migration `983_ledger_health_indexes.sql`.
## Release gating
**Before tagging v1.0.7:**
- All six items landed as separate commits.
- Refund smoke test (see `docs/audit-2026-04/smoke/refund.md` — TBD)
re-run against sandbox with the reversal path: assert Stripe side
shows the reversal as well as the refund.
- Manual reconciliation pass: query the live DB against Hyperswitch
dashboard for all `completed` orders in the past week → zero drift.
Document the script used (`scripts/reconcile-check.sh`), keep for
periodic re-use.
- CHANGELOG entry lists each commit + criticity it resolves.
**Cut v1.0.7 as a minor release** (not patch), because the interface
change in A is technically breaking for any external caller of
`TransferService` (there are none today, but naming it 1.0.7 signals
the API change).
## Effort total
Worst-case single-developer sequential: **~9-10 working days**
(XS + S + M + S + M + S = 2h + 1d + 3d + 1d + 3d + 1d).
With two devs on the parallel tracks: **~5 working days** end-to-end,
because track 1 (A→B, 4d) and track 2 (D, E, C→F done serially = 5d)
overlap.
## Unknowns to resolve before starting
1. **Volume order of magnitude** — informs whether the reconciliation
sweep's default interval of 1h is appropriate. If we're processing
> 1000 orders/day, drop to 15m. Ops confirmation needed.
2. **Stripe Connect account state at time of v1.0.7 deploy** — any
pre-v1.0.7 transfers lack `stripe_transfer_id` in DB. The backfill
migration in A needs to know how many rows to probe and how many
will successfully match. Acceptable ceiling: if > 5% of historical
transfers fail to backfill, escalate to manual reconciliation
session.
3. **Hyperswitch sandbox vs prod webhook secret rotation** — item D
changes outbound request headers only, no impact. Items B+C call
Hyperswitch read endpoints, which use the same `HYPERSWITCH_API_KEY`.
No secret-rotation concern.