docs(audit): 2026-04 correctness/accounting findings (axis 1)
Axis 1 of the 5-axis VEZA audit, scoped to money-movement correctness
and ledger↔PSP reconciliation. Layout: one file per axis under
docs/audit-2026-04/, README index, v107-plan.md derived.
P0 findings (block v1.0.7 "ready-to-show" gate):
* P0.1 — SellerTransfer.StripeTransferID declared but never populated.
stripe_connect_service.CreateTransfer discards the *stripe.Transfer
return value (`_, err := transfer.New(params)`), so the column in
models.go:237 is dead. Structural blocker for the CHANGELOG-parked
v1.0.7 "Stripe Connect reversal" item.
* P0.2 — No Stripe Connect reversal on refund.succeeded. Every refund
today creates a permanent VEZA↔Stripe ledger gap. Action reworked
to decouple via a new `seller_transfers.status = 'reversal_pending'`
state + async worker, so Stripe flaps never block buyer-facing
refund UX.
* P0.3 — No reconciliation sweep for stuck orders / refunds / refund
rows with empty hyperswitch_refund_id. Hourly worker recommended,
same pattern as v1.0.5 Fix 6 orphan-tracks cleaner.
* P0.4 — No Idempotency-Key on outbound Hyperswitch POST /payments and
POST /refunds. Action includes an explicit scope note: the header
covers HTTP-transport retry only, NOT application-level replay (for
which the fix is a state-machine precondition).
P1 findings:
* P1.5 — Webhook raw payloads not persisted (blocks dispute forensics)
* P1.6 — Disputes / chargebacks silently dropped (new, surfaced during
review; dispute.* webhooks fall through the default case)
* P1.7 — Subscription money-movement not covered by v1.0.6 hardening
* P1.8 — No ledger-health Prometheus metrics
P2 findings:
* P2.9 — No admin API for manual override
* P2.10 — Partial refund latent compromise (amount *int64 always nil)
wontfix:
* wontfix.11 — Per-seller retry interval (re-evaluate at 10× load)
Derived deliverable: v107-plan.md sequences the 6 de-duplicated items
(4 P0 + 2 P1) with a dependency graph, two parallel tracks, per-commit
effort estimates (D→A→B; E→C→F), release gating and open questions
(volume magnitude, Connect backfill %).
Info needed from ops (tracked in axis-1 doc, not determinable from
code): last manual reconciliation date, whether subscriptions are
currently sold, current order/refund volume.
Axes 2-5 deferred: README.md marks axis 2 (state machines) as gated
on v1.0.7 landing first, otherwise the transition matrix captures a
v1.0.6.1 snapshot that's immediately stale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
5e3964b989
commit
6b345ede9f
3 changed files with 811 additions and 0 deletions
52
docs/audit-2026-04/README.md
Normal file
52
docs/audit-2026-04/README.md
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
# VEZA Audit — 2026-04
|
||||
|
||||
> **Scope** — VEZA backend (Go) + web (TypeScript).
|
||||
> TALAS software (firmware, PCB reverse-engineering pipeline) is **out of scope**
|
||||
> and will be audited separately when its phase stabilises.
|
||||
>
|
||||
> **Source state** — commits up to `a57bb6f78` (v1.0.6.1, 2026-04-17).
|
||||
>
|
||||
> **Auditor** — Claude Opus 4.7 (1M context).
|
||||
|
||||
## Axes
|
||||
|
||||
| # | File | Status |
|
||||
|---|---|---|
|
||||
| 1 | [`axis-1-correctness.md`](./axis-1-correctness.md) — correctness / accounting | ✅ delivered |
|
||||
| 2 | `axis-2-state-machines.md` — transition matrix + illegal-transition tests | 🔲 pending v1.0.7 |
|
||||
| 3 | `axis-3-security.md` — attack surface (signatures, rate limits, authz, secrets) | 🔲 pending |
|
||||
| 4 | `axis-4-tests.md` — coverage vs reality, failure-injection gap | 🔲 pending |
|
||||
| 5 | `axis-5-debt.md` — documented debt vs hidden debt (TODO/FIXME inventory) | 🔲 pending |
|
||||
|
||||
Axis 2 is gated on v1.0.7 landing first — otherwise the transition matrix
|
||||
captures a v1.0.6.1 snapshot that's immediately stale. See
|
||||
[`v107-plan.md`](./v107-plan.md) for the sequencing.
|
||||
|
||||
## Reading conventions
|
||||
|
||||
Every finding cites `file:line` evidence. Structure:
|
||||
|
||||
```
|
||||
### P{0|1|2}.N — short title
|
||||
**Evidence** — concrete cites
|
||||
**Consequence** — what breaks today / tomorrow
|
||||
**Action** — what to do, with enough detail that an implementer can start
|
||||
**Criticity** — P0 / P1 / P2 / wontfix (with justification)
|
||||
```
|
||||
|
||||
**P0** = fix within v1.0.7 or earlier (ledger diverges today, or a v1.0.7
|
||||
commitment is structurally blocked).
|
||||
**P1** = v1.0.7 target. Operational visibility / correctness hardening.
|
||||
**P2** = v1.0.8+. Nice-to-have.
|
||||
**wontfix** = justified non-action.
|
||||
|
||||
## Info needed from ops (not determinable from code)
|
||||
|
||||
Tracked in [`axis-1-correctness.md`](./axis-1-correctness.md#info-needed-from-ops).
|
||||
Absence of answers becomes a finding in its own right.
|
||||
|
||||
## Derived deliverables
|
||||
|
||||
- [`v107-plan.md`](./v107-plan.md) — sequencing, dependencies and relative
|
||||
effort for the axis-1 P0 findings + the CHANGELOG-parked v1.0.7 items.
|
||||
Read this before picking up v1.0.7 work.
|
||||
529
docs/audit-2026-04/axis-1-correctness.md
Normal file
529
docs/audit-2026-04/axis-1-correctness.md
Normal file
|
|
@ -0,0 +1,529 @@
|
|||
# Axis 1 — Correctness / accounting
|
||||
|
||||
> See [`README.md`](./README.md) for scope, auditor, source state and
|
||||
> reading conventions.
|
||||
|
||||
## Surface summary
|
||||
|
||||
| PSP | Package | Endpoints called | Env gate |
|
||||
|---|---|---|---|
|
||||
| Hyperswitch (payments + refunds) | `internal/services/hyperswitch/` | `POST /payments`, `GET /payments/:id`, `POST /refunds` | `HYPERSWITCH_ENABLED` |
|
||||
| Stripe Connect (seller payouts + transfers) | `internal/services/stripe_connect_service.go` | `POST /v1/account_links`, `GET /v1/accounts/:id`, `GET /v1/balance`, `POST /v1/transfers` | `STRIPE_CONNECT_ENABLED` |
|
||||
|
||||
| Table | Status column | Terminal states | Non-terminal |
|
||||
|---|---|---|---|
|
||||
| `orders` | `status` | completed, failed, cancelled, refunded | pending, refund_pending |
|
||||
| `refunds` (v1.0.6) | `status` | succeeded, failed | pending |
|
||||
| `seller_transfers` | `status` | completed, permanently_failed, reversed | pending, failed |
|
||||
| `seller_payouts` | `status` | completed, failed | pending, processing |
|
||||
| `licenses` | `revoked_at` (nullable) | NULL (active) / timestamp (revoked) | — |
|
||||
| `user_subscriptions` | `status` (enum) | canceled, expired | active, trialing, past_due |
|
||||
|
||||
Webhook entry: `POST /api/v1/webhooks/hyperswitch`
|
||||
(`routes_webhooks.go:55`), HMAC-SHA512 verified, dispatches to
|
||||
`ProcessPaymentWebhook` or `ProcessRefundWebhook` via
|
||||
`HyperswitchWebhookPayload.IsRefundEvent()` (`service.go:700`).
|
||||
|
||||
Existing reconciliation: **only** the `TransferRetryWorker` for
|
||||
`seller_transfers.status='failed'` (`transfer_retry.go:57`), interval
|
||||
`TRANSFER_RETRY_INTERVAL` default `5m` (`config.go:399`).
|
||||
|
||||
---
|
||||
|
||||
## P0.1 — `SellerTransfer.StripeTransferID` declared but never populated
|
||||
|
||||
**Evidence**
|
||||
- Column declared: `internal/core/marketplace/models.go:237`
|
||||
`StripeTransferID string \`gorm:"size:255" json:"stripe_transfer_id,omitempty"\``
|
||||
- Zero write-sites (`grep 'StripeTransferID\s*='` returns 0 hits in production code).
|
||||
- Upstream returns the transfer but we discard it:
|
||||
`internal/services/stripe_connect_service.go:205`
|
||||
`_, err := transfer.New(params)` — the `*stripe.Transfer` pointer carrying
|
||||
`.ID` is underscored away.
|
||||
- `CreateTransfer` signature returns only `error`
|
||||
(`internal/services/stripe_connect_service.go:179`), so callers physically
|
||||
cannot persist the id even if they wanted to.
|
||||
|
||||
**Consequence**
|
||||
Direct blocker for the v1.0.6 CHANGELOG-documented "parked v1.0.7 —
|
||||
Stripe Connect Transfers:reversal". When anyone starts that work, there
|
||||
is no `stripe_transfer_id` in the DB to pass to
|
||||
`transfers.CreateReversal(...)`. The only fallbacks would be:
|
||||
- Query Stripe with `transfers.List(Destination=account_id, Metadata[order_id]=...)` — fragile, rate-limited, no unique guarantee.
|
||||
- Do a full upstream refactor of the transfer worker — larger scope than the TODO implies.
|
||||
|
||||
This is exactly the "temporary compromise frozen into permanent" pattern
|
||||
the audit charter targets.
|
||||
|
||||
**Action (prerequisite for any v1.0.7 reversal work)**
|
||||
1. Change `TransferService.CreateTransfer` signature:
|
||||
`(ctx, sellerID, amount, currency, orderID string) (stripeTransferID string, err error)`.
|
||||
2. Update `processSellerTransfers` (`service.go:762`) to capture the id
|
||||
and set `st.StripeTransferID = stripeTransferID` before `tx.Create(&st)`.
|
||||
3. Update `TransferRetryWorker.retryOne` (`transfer_retry.go:116`) to
|
||||
persist the id on successful retry too.
|
||||
|
||||
**Criticity** — P0. Must land before the reversal commit, not with it.
|
||||
|
||||
---
|
||||
|
||||
## P0.2 — No Stripe Connect reversal on refund.succeeded
|
||||
|
||||
**Evidence**
|
||||
- `internal/core/marketplace/service.go:1529` `reverseSellerAccounting`
|
||||
does three things on refund webhook succeeded:
|
||||
1. `DebitSellerBalance` (in-DB balance row decrement)
|
||||
2. `seller_transfers.status = "reversed"` (in-DB ledger flag)
|
||||
3. **Nothing else.**
|
||||
- The CHANGELOG v1.0.6 entry explicitly flags this: "Stripe Connect
|
||||
Transfers:reversal call flagged TODO(v1.0.7). [...] the missing piece
|
||||
is the money-movement round-trip at Stripe."
|
||||
|
||||
**Consequence**
|
||||
For every refunded order, the buyer is credited by Hyperswitch (verified
|
||||
via webhook) and the VEZA internal seller balance is debited correctly.
|
||||
But the Stripe-Connect-side transfer that moved the seller's share into
|
||||
their connected account is **not** reversed. The seller's Stripe account
|
||||
still holds that money. VEZA's books say "debited", Stripe says "paid
|
||||
out". On every refund this creates a real reconciliation gap that a
|
||||
seller's monthly Stripe statement will show.
|
||||
|
||||
This isn't a ledger-consistency issue in VEZA's own DB — it's a
|
||||
VEZA-vs-Stripe divergence. It grows linearly with refund volume.
|
||||
|
||||
**Action — decoupled from the buyer-refund critical path**
|
||||
The intuitive fix ("call `transfers.NewReversal` inside
|
||||
`reverseSellerAccounting` and return on error to let the webhook retry")
|
||||
is wrong: Hyperswitch has already credited the buyer, the buyer is
|
||||
watching their order page, and a Stripe flap would leave the buyer in
|
||||
`refund_pending` until Stripe comes back. That couples buyer UX to the
|
||||
weakest dependency in the whole flow.
|
||||
|
||||
Correct pattern — split the reversal into two concerns, with an
|
||||
intermediate state that says "VEZA books updated, PSP reversal still
|
||||
owed":
|
||||
|
||||
1. **Migration** — extend `seller_transfers.status` to include a new
|
||||
value `reversal_pending`. Update the state-value documentation
|
||||
comment in `internal/core/marketplace/models.go:242`.
|
||||
2. **Refund finalization** (`service.go:1466 finalizeSuccessfulRefund`)
|
||||
— no change to its flow except: instead of transitioning
|
||||
`seller_transfers.status` directly to `reversed` in
|
||||
`reverseSellerAccounting`, transition to `reversal_pending` and set
|
||||
`next_retry_at = NOW()`. The buyer-facing flow finalizes immediately:
|
||||
order=refunded, licenses revoked, webhook returns 200.
|
||||
3. **New worker** `StripeReversalWorker` — scans
|
||||
`seller_transfers WHERE status='reversal_pending' AND (next_retry_at IS NULL OR next_retry_at <= NOW())`.
|
||||
For each row:
|
||||
- if `StripeTransferID == ""` (see P0.1), skip + log ERROR "cannot
|
||||
reverse — transfer id missing", alert ops.
|
||||
- else call
|
||||
`transfer.NewReversal(stripeTransferID, &stripe.TransferReversalParams{Amount: amountCents})`.
|
||||
- on success: `status='reversed'`, persist the reversal id on a new
|
||||
`stripe_reversal_id` column.
|
||||
- on Stripe 404 (transfer not found): assume already reversed
|
||||
out-of-band, set `status='reversed'`, log INFO.
|
||||
- on any other error: exponential backoff on `next_retry_at`,
|
||||
increment `retry_count`. Hit the same `permanently_failed` ceiling
|
||||
the TransferRetryWorker uses today.
|
||||
|
||||
This gives us three properties:
|
||||
- Buyer refund never blocks on Stripe health.
|
||||
- VEZA accounting is consistent with Hyperswitch refund state at all times.
|
||||
- The VEZA↔Stripe gap is bounded (every reversal either completes or
|
||||
ends in `permanently_failed` with a clear ops action).
|
||||
|
||||
**Criticity** — P0. Pre-public-opening of marketplace.
|
||||
|
||||
---
|
||||
|
||||
## P0.3 — No reconciliation sweep for stuck pending states
|
||||
|
||||
**Evidence**
|
||||
- Inventory search returned zero scheduled jobs that compare DB state
|
||||
to PSP state for payments or refunds.
|
||||
- `TransferRetryWorker` exists (`transfer_retry.go`) but only reconciles
|
||||
`seller_transfers` — not orders, not refunds.
|
||||
- Agent inventory: "No startup sanity checks comparing DB to PSP for
|
||||
orders/refunds. No admin endpoints for manual order/refund/transfer
|
||||
verification. No sweeps for stuck orders in `refund_pending` or orders
|
||||
exceeding 14-day deadline."
|
||||
|
||||
**Consequence**
|
||||
If a Hyperswitch webhook is never delivered — they've been down, our
|
||||
`/webhooks/hyperswitch` was down during their retry window, or their
|
||||
delivery system has a bug — then:
|
||||
- An order stuck in `pending` stays forever. Customer paid, got no
|
||||
license. Only surfaces via customer complaint.
|
||||
- A refund stuck in `refund_pending` stays forever. Customer asked for
|
||||
refund, got nothing. Order stuck in `refund_pending` also blocks
|
||||
re-attempts (`ErrRefundAlreadyRequested` at `service.go:1303`).
|
||||
- A refund row with empty `hyperswitch_refund_id` (Phase 1 completed,
|
||||
Phase 2 crashed between PSP call and DB UPDATE) has no way to
|
||||
correlate to the PSP. Nothing queries them.
|
||||
|
||||
**Action**
|
||||
Hourly worker (same pattern as `CleanupOrphanTracks` from v1.0.5 Fix 6):
|
||||
|
||||
```go
|
||||
// internal/jobs/reconcile_hyperswitch.go
|
||||
func ReconcilePendingOrders(ctx context.Context, db *gorm.DB, hs *hyperswitch.Client) error {
|
||||
// Orders in 'pending' older than 30m: GET /payments/:id, sync
|
||||
// Refunds in 'pending' > 30m with non-empty refund_id: GET /refunds/:id, sync
|
||||
// Refunds in 'pending' > 5m with EMPTY refund_id: mark failed, roll back order
|
||||
}
|
||||
```
|
||||
|
||||
Wire in `cmd/api/main.go` next to `ScheduleOrphanTracksCleanup`.
|
||||
|
||||
**Criticity** — P0. The gap is open today on every deployed instance.
|
||||
|
||||
---
|
||||
|
||||
## P0.4 — `CreatePayment` / `CreateRefund` not idempotent at the HTTP layer
|
||||
|
||||
**Evidence**
|
||||
- `internal/services/hyperswitch/client.go:73-78` and
|
||||
`internal/services/hyperswitch/client.go:170-175` — only
|
||||
`Content-Type` and `api-key` headers are set. No `Idempotency-Key`.
|
||||
- Hyperswitch's API supports an `Idempotency-Key` header (standard PSP
|
||||
convention).
|
||||
|
||||
**Consequence**
|
||||
If the HTTP call from VEZA to Hyperswitch retries at the transport
|
||||
layer (curl-level / net/http Transport-level retry, proxy retry, TLS
|
||||
reconnect after a flap) during a single logical call, a second payment
|
||||
or refund object can be created for the same intent. Customer
|
||||
double-charged, or two refunds issued. The order row's
|
||||
`hyperswitch_payment_id` column is set to the *first* payment_id we
|
||||
hear back about, so the second shadow payment is orphaned in Hyperswitch
|
||||
with no VEZA row pointing at it.
|
||||
|
||||
**Scope note — what Idempotency-Key does and does NOT fix**
|
||||
This header protects against **HTTP-transport retry** only. It does
|
||||
**not** fix, and should not be conflated with:
|
||||
- **Application-level replay** — buyer clicks "Pay" twice, frontend
|
||||
double-submits a form, VEZA application code retries after a DB
|
||||
transaction aborts. These paths create distinct `order.id` values, so
|
||||
each call sees a different natural key and the PSP legitimately
|
||||
creates different payments. The fix there is a state-machine
|
||||
precondition (order dedup by `(buyer_id, cart_hash)` within a window,
|
||||
frontend debounce, CSRF single-use token), not the header.
|
||||
- **Crash-between-call-and-save** — VEZA POST succeeds, Hyperswitch
|
||||
returns payment_id, VEZA crashes before persisting it. A subsequent
|
||||
application retry with the same `order.id` *does* recover thanks to
|
||||
the Idempotency-Key (Hyperswitch returns the cached response), but
|
||||
this only works if the application retry re-uses the same order row.
|
||||
That requires the order row to exist in a `pending_payment_submitted`
|
||||
intermediate state before the PSP call — which the current code
|
||||
doesn't have. Documenting as a follow-up.
|
||||
|
||||
**Action**
|
||||
Add `Idempotency-Key: <natural key>` header on every mutating call:
|
||||
- `CreatePayment` → `order_id` (UUID, guaranteed unique per order row).
|
||||
- `CreateRefund` → `refund.ID` (UUID from the pending `refunds` row —
|
||||
populated before the PSP call in Phase 1, so the same key is used
|
||||
across any HTTP-level retry within that call).
|
||||
|
||||
The `http.Client` in `hyperswitch.Client` should NOT set this
|
||||
implicitly; callers must pass the key explicitly so the semantic is
|
||||
auditable at each call site. Two changes, each ~3 lines.
|
||||
|
||||
The state-machine-level fix for application retry is out of scope here;
|
||||
it goes on the v1.0.8+ roadmap (documented in `v107-plan.md`).
|
||||
|
||||
**Criticity** — P0. Low probability per call but unbounded blast radius
|
||||
(customer double-charge).
|
||||
|
||||
---
|
||||
|
||||
## P1.5 — Webhook raw payloads not persisted
|
||||
|
||||
**Evidence**
|
||||
- `routes_webhooks.go:59-82` reads body, verifies signature, dispatches.
|
||||
- No table or log file captures the raw webhook body, signature header,
|
||||
or processing result for later audit.
|
||||
|
||||
**Consequence**
|
||||
When a customer disputes ("I never got refunded"), we cannot reproduce
|
||||
what Hyperswitch told us. The only evidence is in the backend `log` at
|
||||
whatever retention the host keeps — and structured logs don't carry the
|
||||
raw body (we explicitly redact some fields for security).
|
||||
|
||||
**Action**
|
||||
New table `hyperswitch_webhook_log`:
|
||||
|
||||
```sql
|
||||
CREATE TABLE hyperswitch_webhook_log (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
received_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
signature_header TEXT NOT NULL,
|
||||
signature_valid BOOLEAN NOT NULL,
|
||||
event_type TEXT,
|
||||
payment_id TEXT,
|
||||
refund_id TEXT,
|
||||
payload BYTEA NOT NULL,
|
||||
processing_result TEXT NOT NULL, -- 'ok', 'error: <msg>', 'skipped'
|
||||
processed_at TIMESTAMPTZ
|
||||
);
|
||||
CREATE INDEX ON hyperswitch_webhook_log (payment_id) WHERE payment_id IS NOT NULL;
|
||||
CREATE INDEX ON hyperswitch_webhook_log (refund_id) WHERE refund_id IS NOT NULL;
|
||||
```
|
||||
|
||||
Insert in `routes_webhooks.go:hyperswitchWebhookHandler` **before**
|
||||
signature verification — so we capture attack attempts too. Retention
|
||||
90 days via a separate sweep job.
|
||||
|
||||
**Criticity** — P1. Not ledger-diverging, but the first ops question
|
||||
on any refund/dispute claim is "what did the webhook say", and today we
|
||||
can't answer it.
|
||||
|
||||
---
|
||||
|
||||
## P1.6 — Disputes / chargebacks silently dropped
|
||||
|
||||
**Evidence**
|
||||
- `HyperswitchWebhookPayload.IsRefundEvent()` (`service.go:700`) only
|
||||
distinguishes refund-vs-non-refund. No dispute discriminator.
|
||||
- Zero grep hits for `dispute|chargeback|disputed` in
|
||||
`internal/core/marketplace/`, `internal/services/hyperswitch/`, or
|
||||
anywhere in the webhook handler path.
|
||||
- Hyperswitch emits `dispute.opened`, `dispute.evidence_required`,
|
||||
`dispute.won`, `dispute.lost`, `dispute.expired` — all land on
|
||||
`POST /webhooks/hyperswitch` and currently fall through
|
||||
`ProcessPaymentWebhook`'s switch to the `default` case
|
||||
(`service.go:718`), which logs Debug "ignoring status" and returns 200.
|
||||
|
||||
**Consequence**
|
||||
A dispute/chargeback is the case where the ledger diverges most
|
||||
violently, because:
|
||||
1. It's **not initiated by VEZA** — we learn about it passively when
|
||||
the buyer's bank pushes a reversal through the card network.
|
||||
2. On `dispute.lost` (the common case if we don't respond in time),
|
||||
Hyperswitch debits our acquirer balance for the original amount
|
||||
**plus** a chargeback fee (typically 15-25€ depending on acquirer).
|
||||
3. The VEZA `orders` row stays `completed`, the license stays active,
|
||||
the seller stays paid. Buyer has their money back via the card
|
||||
network. Every table in VEZA thinks everything is fine. The only
|
||||
signal is the acquirer statement, and it's weeks delayed.
|
||||
4. Chargeback rate is a seller health metric: an account with 1% CB
|
||||
rate gets flagged by Visa/MC and the whole VEZA merchant account
|
||||
can be put under review. Without counting disputes per-seller, we
|
||||
have no early-warning system.
|
||||
|
||||
**Action (staged)**
|
||||
Minimum (ships with P1.5 webhook log): every `dispute.*` event lands in
|
||||
`hyperswitch_webhook_log` with the payload captured. Ops can query.
|
||||
|
||||
Recommended (v1.0.8 sprint):
|
||||
1. New `disputes` table:
|
||||
```sql
|
||||
CREATE TABLE disputes (
|
||||
id UUID PRIMARY KEY,
|
||||
order_id UUID NOT NULL REFERENCES orders(id),
|
||||
hyperswitch_dispute_id TEXT UNIQUE NOT NULL,
|
||||
status TEXT NOT NULL, -- opened, evidence_required, won, lost, expired
|
||||
amount_cents BIGINT NOT NULL,
|
||||
reason TEXT,
|
||||
evidence_due_at TIMESTAMPTZ,
|
||||
opened_at TIMESTAMPTZ NOT NULL,
|
||||
resolved_at TIMESTAMPTZ,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
2. Extend `IsRefundEvent()` discriminator to a three-way
|
||||
`EventKind()` that returns `payment | refund | dispute`.
|
||||
3. New `ProcessDisputeWebhook` — mirrors `ProcessRefundWebhook`'s
|
||||
idempotency pattern (unique `hyperswitch_dispute_id` + row lock).
|
||||
4. On `dispute.lost`: same seller-side reversal machinery as refund
|
||||
(P0.2), because the money effect is identical.
|
||||
5. Per-seller chargeback counter on `seller_balances` or a new
|
||||
`seller_risk_metrics` table, exposed to admin UI.
|
||||
|
||||
**Criticity** — P1 minimum (capture + log). P0 if volume justifies it
|
||||
once public-opened. At current volume (pre-opening), accepting the gap
|
||||
is defensible; post-opening, a single dispute we miss is worse than
|
||||
the engineering cost of handling it.
|
||||
|
||||
---
|
||||
|
||||
## P1.7 — Subscription money-movement not covered by v1.0.6 hardening
|
||||
|
||||
**Evidence**
|
||||
- Subscription model has Hyperswitch keys:
|
||||
`subscription/models.go:95` `HyperswitchSubscriptionID`, line 96
|
||||
`HyperswitchCustomerID`, line 136 `HyperswitchPaymentID`.
|
||||
- Status enum: `active, trialing, canceled, past_due, expired`
|
||||
(`subscription/models.go:20-28`).
|
||||
- `Service.Subscribe` (`subscription/service.go:149`) and
|
||||
`Service.CancelSubscription` (line 390) exist.
|
||||
- **Zero refund code in the subscription package.** Cancellation marks
|
||||
the row `canceled` but does not initiate any PSP reimbursement.
|
||||
- No visible webhook handler branch for `subscription.*` events — the
|
||||
dispatcher only knows payment vs refund.
|
||||
- No code path was found that transitions a subscription INTO
|
||||
`past_due` (which implies it's driven by an external event the
|
||||
webhook isn't consuming, or it's dead state).
|
||||
|
||||
**Consequence**
|
||||
The marketplace refund hardening only covers one-shot purchases. A
|
||||
subscription cancellation inside the paid period currently doesn't
|
||||
refund anything, and a failed recurring charge doesn't move the
|
||||
subscription to `past_due` because nothing wires it up.
|
||||
|
||||
**Action**
|
||||
Dedicated sprint for subscription state-machine hardening. Scope for
|
||||
v1.0.8. Breakdown:
|
||||
1. Inventory every state in `user_subscriptions.status` with evidence
|
||||
of which code path lands there.
|
||||
2. Handle `subscription.payment_succeeded`, `subscription.payment_failed`,
|
||||
`subscription.canceled` webhook events in a
|
||||
`ProcessSubscriptionWebhook`.
|
||||
3. Decide product policy on mid-period cancellation (pro-rata refund?
|
||||
honour until end of period? forfeit?).
|
||||
|
||||
**Criticity** — P1 for the hardening, P0 only if subscriptions are
|
||||
currently sold. Status: needs confirmation from ops — not determinable
|
||||
from code alone. See [Info needed from ops](#info-needed-from-ops) Q2.
|
||||
|
||||
---
|
||||
|
||||
## P1.8 — No ledger-health metrics
|
||||
|
||||
**Evidence**
|
||||
No grep hit for `prometheus.*counter.*order|refund|transfer|payout` in
|
||||
`internal/metrics/`. Agent confirms no dashboards.
|
||||
|
||||
**Consequence**
|
||||
Ledger state is invisible until someone complains. Specifically:
|
||||
- Count of orders in `pending` older than 30 min (webhook never arrived)
|
||||
- Count of orders in `refund_pending` older than 30 min
|
||||
- Count of seller_transfers in `failed` at max `retry_count`
|
||||
- Count of refunds with empty `hyperswitch_refund_id` older than 5 min
|
||||
- Count of seller_transfers in `reversal_pending` older than 30 min (new state from P0.2)
|
||||
- Oldest timestamp in each of the above
|
||||
|
||||
**Action**
|
||||
Prometheus gauge metrics updated by a 60s sampler (cheap query each):
|
||||
```
|
||||
ledger_stuck_orders_pending_total{age_bucket="30m_1h|1h_1d|1d+"}
|
||||
ledger_stuck_refunds_pending_total{age_bucket=...}
|
||||
ledger_failed_transfers_at_max_retry_total
|
||||
ledger_orphan_refund_rows_total // empty hyperswitch_refund_id
|
||||
ledger_reversal_pending_transfers_total
|
||||
```
|
||||
|
||||
Grafana panel with 5 number panels + 5 time-series. Alert rules:
|
||||
- `ledger_stuck_orders_pending_total > 0 for 10m` → page
|
||||
- `ledger_orphan_refund_rows_total > 0 for 5m` → page (= bug in the
|
||||
two-phase commit between DB and PSP)
|
||||
|
||||
**Criticity** — P1. Ships with P0.3 (the reconciliation sweep) so the
|
||||
metrics track the sweep's effectiveness.
|
||||
|
||||
---
|
||||
|
||||
## P2.9 — No admin API for manual refund / order override
|
||||
|
||||
**Evidence**
|
||||
No admin endpoint found that can force an order or refund into a
|
||||
specific terminal state. Ops would have to issue raw SQL in prod.
|
||||
|
||||
**Consequence**
|
||||
When something genuinely gets stuck (Hyperswitch confirms out-of-band
|
||||
that a refund succeeded but their webhook is never going to fire for
|
||||
some freak reason), there's no sanctioned path to resolve without raw
|
||||
DB access.
|
||||
|
||||
**Action**
|
||||
`POST /api/v1/admin/orders/:id/force-status` +
|
||||
`POST /api/v1/admin/refunds/:id/force-status`, gated by
|
||||
`AuthMiddleware.RequireAdminRole`, with an audit-log entry. Body:
|
||||
`{ new_status: "...", reason: "..." }`. Validates the target state is
|
||||
reachable from the current one per the documented state machine
|
||||
(depends on Axis 2).
|
||||
|
||||
**Criticity** — P2. Not urgent; raw SQL works until it doesn't.
|
||||
|
||||
---
|
||||
|
||||
## P2.10 — Partial refund latent compromise
|
||||
|
||||
**Evidence**
|
||||
`client.go:142` `CreateRefundRequest.Amount *int64 // nil = full refund`.
|
||||
`service.go:1330` passes `nil` as amount on every call.
|
||||
CHANGELOG v1.0.6: "full-only for v1.0.6 — partial refund UX is deferred
|
||||
to v1.0.7."
|
||||
|
||||
**Consequence**
|
||||
The `amount *int64` parameter is a permanent half-wired compromise. When
|
||||
partial refunds land, the implementer must remember:
|
||||
- Multiple refunds per order must be allowed (change the double-submit
|
||||
guard at `service.go:1303`).
|
||||
- `SellerBalance` debit must be proportional to the partial amount, not
|
||||
full transfer amount.
|
||||
- The partial UNIQUE index already tolerates the multi-refund case
|
||||
(v1.0.6.1 migration 979) — but only if each refund has its own
|
||||
PSP-assigned refund_id, which Hyperswitch does.
|
||||
- Webhook lookup already uses `hyperswitch_refund_id` so no change.
|
||||
|
||||
**Action**
|
||||
When partial lands: extensive state-table tests (can the same order
|
||||
reach (`status=refunded`, amount < total)? should it be a new
|
||||
`partially_refunded` status?). Flag this item in the v1.0.7 plan
|
||||
explicitly so it's scoped, not drive-by.
|
||||
|
||||
**Criticity** — P2. No active bug; future hazard only.
|
||||
|
||||
---
|
||||
|
||||
## wontfix.11 — `TransferRetryInterval` not per-seller configurable
|
||||
|
||||
`config.go:399` default `5m`. A single global interval works fine at
|
||||
current traffic. Re-evaluate at 10× load.
|
||||
|
||||
---
|
||||
|
||||
## Axis 1 summary
|
||||
|
||||
| # | Finding | Criticity | Gate |
|
||||
|---|---|---|---|
|
||||
| 1 | `SellerTransfer.StripeTransferID` never set | P0 | prerequisite for v1.0.7 reversal |
|
||||
| 2 | No Stripe Connect reversal on refund.succeeded | P0 | v1.0.7, decoupled via `reversal_pending` state |
|
||||
| 3 | No reconciliation sweep for stuck pending states | P0 | v1.0.7 (hourly job) |
|
||||
| 4 | No `Idempotency-Key` on Hyperswitch calls | P0 | v1.0.7 (2 header lines) |
|
||||
| 5 | Webhook raw payloads not persisted | P1 | v1.0.7 (migration + insert) |
|
||||
| 6 | Disputes / chargebacks silently dropped | P1 | v1.0.8 (min: capture in P1.5 log) |
|
||||
| 7 | Subscription money-movement surface | P1 | v1.0.8 sprint |
|
||||
| 8 | No ledger-health metrics | P1 | ships with #3 |
|
||||
| 9 | No admin API for manual override | P2 | v1.0.8+ |
|
||||
| 10 | Partial refund latent compromise | P2 | flag in v1.0.7 plan |
|
||||
| 11 | Per-seller retry interval | wontfix | — |
|
||||
|
||||
**Blockers for "ready to show"** — 1, 2, 3, 4 must land before the
|
||||
marketplace is used by anyone external (grants demo, beta, DIY
|
||||
community). After those four, VEZA's accounting is self-consistent
|
||||
AND consistent with Stripe+Hyperswitch, which is the real bar.
|
||||
|
||||
See [`v107-plan.md`](./v107-plan.md) for sequencing and effort.
|
||||
|
||||
---
|
||||
|
||||
## Info needed from ops
|
||||
|
||||
Not determinable from code. Answers become part of the audit:
|
||||
|
||||
1. **Last manual reconciliation between VEZA DB and Stripe/Hyperswitch
|
||||
dashboards.** Date + scope + outcome.
|
||||
_If the answer is "jamais" — that's itself a P1 process finding:
|
||||
no reconciliation cadence established. Absence of documented debt
|
||||
= hidden debt._
|
||||
2. **Are subscriptions currently sold** (i.e., is the paid-subscription
|
||||
feature turned on for end-users), or is the code live but the UI
|
||||
hides the entry point? If yes, P1.7 → P0. If hidden-by-UI,
|
||||
additional Axis-3 check needed: is the API reachable despite the
|
||||
UI hiding? Front-flag-only gating is a security finding.
|
||||
3. **Current volume order of magnitude** — number of orders/refunds
|
||||
processed per week — so criticity can be re-checked against blast
|
||||
radius.
|
||||
230
docs/audit-2026-04/v107-plan.md
Normal file
230
docs/audit-2026-04/v107-plan.md
Normal file
|
|
@ -0,0 +1,230 @@
|
|||
# v1.0.7 — plan structuré
|
||||
|
||||
> Derived from [`axis-1-correctness.md`](./axis-1-correctness.md).
|
||||
> The v1.0.6 CHANGELOG listed 4 "parked v1.0.7" items; the axis-1 audit
|
||||
> added 4 P0 findings. De-duplicated and sequenced below.
|
||||
|
||||
## The 8 items, de-duplicated → 6
|
||||
|
||||
| # | From | Title | Effort |
|
||||
|---|---|---|---|
|
||||
| A | audit P0.1 | Persist `stripe_transfer_id` in `seller_transfers` | **S** |
|
||||
| B | audit P0.2 ≡ CHANGELOG "Stripe Connect reversal" | Connect reversal via `reversal_pending` state + async worker | **M** |
|
||||
| C | audit P0.3 | Reconciliation sweep for stuck orders / refunds | **M** |
|
||||
| D | audit P0.4 | `Idempotency-Key` on `CreatePayment` / `CreateRefund` | **XS** |
|
||||
| E | audit P1.5 | Webhook raw-payload log table + insert | **S** |
|
||||
| F | audit P1.8 | Ledger-health Prometheus metrics + alerts | **S** |
|
||||
|
||||
Dropped from v1.0.7 scope:
|
||||
- Partial refunds (CHANGELOG-parked) — P2 in audit, feature-class, defer to v1.0.8
|
||||
- `CloudUploadModal` single-source-of-truth (CHANGELOG-parked) — P2, out of money-movement scope
|
||||
- Sandbox smoke-test documentation — landed de facto in v1.0.6.1 via the partial-UNIQUE hotfix + the smoke harness artefacts
|
||||
|
||||
Effort legend — **XS** ≤ 2h, **S** ≤ 1 day, **M** ≤ 3 days, **L** > 3 days.
|
||||
|
||||
## Dependency graph
|
||||
|
||||
```
|
||||
┌──────────────────┐
|
||||
│ D Idempotency │ independent, can land first as quick win
|
||||
└──────────────────┘
|
||||
|
||||
┌──────────────────┐
|
||||
│ A transfer_id │ prerequisite for B
|
||||
└────────┬─────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────┐
|
||||
│ B reversal │ uses transfer_id persisted by A;
|
||||
│ worker │ introduces `reversal_pending` status
|
||||
└──────────────────┘
|
||||
|
||||
┌──────────────────┐
|
||||
│ E webhook log │ independent; prerequisite for P1.6 (disputes)
|
||||
└──────────────────┘
|
||||
|
||||
┌──────────────────┐
|
||||
│ C reconciler │ independent, but:
|
||||
│ sweep │ metrics (F) track its effectiveness
|
||||
└────────┬─────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────┐
|
||||
│ F metrics │ needs the state-shape C produces; alerts wire
|
||||
│ + alerts │ to the buckets the reconciler defines
|
||||
└──────────────────┘
|
||||
```
|
||||
|
||||
**Two parallel tracks:**
|
||||
- **Track 1 (reversal correctness)** — A → B
|
||||
- **Track 2 (operational visibility)** — D, E, C → F
|
||||
|
||||
Two developers can work in parallel without stepping on each other. A
|
||||
single developer sequences as ordered: D first (XS quick win, earns
|
||||
trust + unblocks "pre-open" checklist), then A→B, then E, then C→F.
|
||||
|
||||
## Commit sequence (single-developer path)
|
||||
|
||||
Each item lands as its own commit in the existing v1.0.6-style cadence
|
||||
(per-commit tests + CHANGELOG-worthy).
|
||||
|
||||
### 1. `fix(hyperswitch): idempotency-key on create-payment and create-refund` — D
|
||||
|
||||
Effort: **XS**. Pure header addition. Tests: the 15-case refund suite
|
||||
already exists; add 2 cases verifying the header is set correctly
|
||||
(httptest.Server assertion on `r.Header.Get("Idempotency-Key")`).
|
||||
|
||||
Acceptance:
|
||||
- Every outbound `POST /payments` carries `Idempotency-Key: <order.id>`.
|
||||
- Every outbound `POST /refunds` carries `Idempotency-Key: <refund.id>`.
|
||||
- No implicit-via-ctx magic: each call site sets the header explicitly,
|
||||
greppable.
|
||||
- CHANGELOG entry cross-references P0.4 + its scope note (HTTP retry
|
||||
only, not app-level replay).
|
||||
|
||||
**Ship as v1.0.7-alpha-1 for sandbox testing**, don't wait for B/C.
|
||||
|
||||
### 2. `refactor(connect): persist stripe_transfer_id on create + retry` — A
|
||||
|
||||
Effort: **S**. Touches the `TransferService` interface (minor breaking
|
||||
change — but only internal callers). Migration:
|
||||
`980_seller_transfers_stripe_reversal_id.sql` adds `stripe_reversal_id`
|
||||
nullable column (prepares ground for B).
|
||||
|
||||
Acceptance:
|
||||
- `TransferService.CreateTransfer(...) (string, error)` — returns the
|
||||
Stripe transfer id.
|
||||
- `processSellerTransfers` persists `st.StripeTransferID` before
|
||||
`tx.Create(&st)`.
|
||||
- `TransferRetryWorker.retryOne` also persists on retry success.
|
||||
- Backfill: one-shot migration query that fills known transfer_ids
|
||||
for past orders by calling Stripe's
|
||||
`transfers.List(Destination=..., Metadata[order_id]=...)`. Acceptable
|
||||
to leave `NULL` where Stripe has no match (document: "pre-v1.0.7
|
||||
transfers cannot be reversed automatically; use admin API P2.9").
|
||||
|
||||
### 3. `feat(marketplace): async stripe connect reversal worker` — B
|
||||
|
||||
Effort: **M**. The big one. Introduces `seller_transfers.status =
|
||||
'reversal_pending'` as a new intermediate terminal-avoidance state.
|
||||
Migration: `981_seller_transfers_reversal_pending_enum.sql`.
|
||||
|
||||
Acceptance:
|
||||
- `reverseSellerAccounting` transitions `seller_transfers.status` to
|
||||
`reversal_pending`, sets `next_retry_at = NOW()`. Refund
|
||||
finalization completes end-to-end — buyer never sees a
|
||||
Stripe-health-dependent refund UX.
|
||||
- New `StripeReversalWorker` in `internal/workers/stripe_reversal.go`
|
||||
runs every `TRANSFER_RETRY_INTERVAL` (reuse existing env var),
|
||||
processes `reversal_pending` rows. Calls
|
||||
`transfer.NewReversal(stripeTransferID, ¶ms)`.
|
||||
- On success → `status='reversed'`, persist `stripe_reversal_id`.
|
||||
- On Stripe 404 → `status='reversed'` + log INFO (treat as "already
|
||||
reversed out-of-band").
|
||||
- On other errors → exponential backoff via `next_retry_at` + `retry_count`,
|
||||
hit `permanently_failed` ceiling after N retries (mirror
|
||||
`TransferRetryWorker`).
|
||||
- Tests: mock the stripe SDK via a local httptest.Server (harder than
|
||||
the Hyperswitch mock because stripe-go's HTTP layer is less trivial
|
||||
to re-point, but feasible via `stripe.SetBackend`). 8 cases:
|
||||
- happy path: pending → reversal_pending → reversed + id persisted
|
||||
- Stripe 404: reversal_pending → reversed + log
|
||||
- Stripe 5xx transient: retry_count increments, backoff set
|
||||
- max retries: → permanently_failed
|
||||
- concurrent worker (two instances pick same row): lock wins
|
||||
- StripeTransferID empty (legacy row): skip + ERROR log + counter
|
||||
- reversal idempotency (Stripe dedupes same transferID): no-op
|
||||
- worker graceful shutdown mid-reversal (context cancellation)
|
||||
|
||||
### 4. `feat(webhooks): persist raw hyperswitch payloads to audit log` — E
|
||||
|
||||
Effort: **S**. Migration: `982_hyperswitch_webhook_log.sql`. Insert
|
||||
before signature verification (so we capture attack attempts too).
|
||||
|
||||
Acceptance:
|
||||
- Every webhook landing on `/webhooks/hyperswitch` produces exactly one
|
||||
row, regardless of signature-valid or processing outcome.
|
||||
- `processing_result` field captures `'ok'`, `'error: <msg>'`, or
|
||||
`'skipped'`.
|
||||
- Retention: a `CleanupWebhookLog` worker in the same
|
||||
`internal/jobs/` package as the orphan-tracks cleaner, daily, deletes
|
||||
rows older than 90 days.
|
||||
- Tests: 3 cases (valid signature + processing ok; invalid signature;
|
||||
processing error).
|
||||
|
||||
### 5. `feat(workers): hyperswitch reconciliation sweep for stuck pending states` — C
|
||||
|
||||
Effort: **M**. New `ReconcileHyperswitchWorker` in `internal/jobs/`.
|
||||
Hourly by default (`RECONCILE_INTERVAL=1h`), but exposed so ops can
|
||||
drop to 5m during incident response.
|
||||
|
||||
Acceptance:
|
||||
- Orders in `pending` > 30min: `GET /payments/:id`, call the same
|
||||
`ProcessPaymentWebhook` internal dispatcher with a synthesised
|
||||
payload. Idempotent with real webhooks via the existing terminal-state
|
||||
guard.
|
||||
- Refunds in `pending` with non-empty `hyperswitch_refund_id` > 30min:
|
||||
`GET /refunds/:id`, same pattern with `ProcessRefundWebhook`.
|
||||
- Refunds in `pending` with EMPTY `hyperswitch_refund_id` > 5min:
|
||||
mark `failed`, roll order back to `completed`, log ERROR
|
||||
(operator attention needed — something crashed between Phase 1 and
|
||||
Phase 2 of `RefundOrder`).
|
||||
- Tests: happy sync, no-op when everything is terminal, the
|
||||
empty-refund_id auto-fail case.
|
||||
- Structured log on every action taken so `grep reconcile` tells the
|
||||
ops story.
|
||||
|
||||
### 6. `feat(metrics): ledger-health gauges + alert rules` — F
|
||||
|
||||
Effort: **S**. New file `internal/metrics/ledger_health.go` with a
|
||||
60s sampler. Grafana dashboard JSON in `config/grafana/ledger.json`.
|
||||
Prometheus alert rules in `config/alertmanager/ledger.yml`.
|
||||
|
||||
Acceptance:
|
||||
- 5 gauge metrics (listed in P1.8 action).
|
||||
- 2 alert rules (`stuck_orders > 0 for 10m`, `orphan_refunds > 0 for 5m`).
|
||||
- Sampler queries are cheap: `SELECT COUNT(*) WHERE ... AND created_at < NOW() - INTERVAL '30 min'` per metric, indexed by `status + created_at`.
|
||||
- If indexes are missing: migration `983_ledger_health_indexes.sql`.
|
||||
|
||||
## Release gating
|
||||
|
||||
**Before tagging v1.0.7:**
|
||||
- All six items landed as separate commits.
|
||||
- Refund smoke test (see `docs/audit-2026-04/smoke/refund.md` — TBD)
|
||||
re-run against sandbox with the reversal path: assert Stripe side
|
||||
shows the reversal as well as the refund.
|
||||
- Manual reconciliation pass: query the live DB against Hyperswitch
|
||||
dashboard for all `completed` orders in the past week → zero drift.
|
||||
Document the script used (`scripts/reconcile-check.sh`), keep for
|
||||
periodic re-use.
|
||||
- CHANGELOG entry lists each commit + criticity it resolves.
|
||||
|
||||
**Cut v1.0.7 as a minor release** (not patch), because the interface
|
||||
change in A is technically breaking for any external caller of
|
||||
`TransferService` (there are none today, but naming it 1.0.7 signals
|
||||
the API change).
|
||||
|
||||
## Effort total
|
||||
|
||||
Worst-case single-developer sequential: **~9-10 working days**
|
||||
(XS + S + M + S + M + S = 2h + 1d + 3d + 1d + 3d + 1d).
|
||||
|
||||
With two devs on the parallel tracks: **~5 working days** end-to-end,
|
||||
because track 1 (A→B, 4d) and track 2 (D, E, C→F done serially = 5d)
|
||||
overlap.
|
||||
|
||||
## Unknowns to resolve before starting
|
||||
|
||||
1. **Volume order of magnitude** — informs whether the reconciliation
|
||||
sweep's default interval of 1h is appropriate. If we're processing
|
||||
> 1000 orders/day, drop to 15m. Ops confirmation needed.
|
||||
2. **Stripe Connect account state at time of v1.0.7 deploy** — any
|
||||
pre-v1.0.7 transfers lack `stripe_transfer_id` in DB. The backfill
|
||||
migration in A needs to know how many rows to probe and how many
|
||||
will successfully match. Acceptable ceiling: if > 5% of historical
|
||||
transfers fail to backfill, escalate to manual reconciliation
|
||||
session.
|
||||
3. **Hyperswitch sandbox vs prod webhook secret rotation** — item D
|
||||
changes outbound request headers only, no impact. Items B+C call
|
||||
Hyperswitch read endpoints, which use the same `HYPERSWITCH_API_KEY`.
|
||||
No secret-rotation concern.
|
||||
Loading…
Reference in a new issue