Self-review of the v1.0.6.2 hotfix surfaced that
distribution.checkEligibility silently swallowed
subscription.ErrSubscriptionNoPayment as "ineligible, no extra info",
so a user with a fantôme subscription trying to submit a distribution
got "Distribution requires Creator or Premium plan" — misleading, the
user has a plan but no payment. checkEligibility now propagates the
error so the handler can surface "Your subscription is not linked to
a payment. Complete payment to enable distribution."
Security is unchanged — the gate still refuses. This is a UX clarity
fix for honest-path users who landed in the fantôme state via a
broken payment flow.
Also:
- Closure timestamp added to axis-1 P0.12 ("closed 2026-04-17 in
v1.0.6.2 (commit 9a8d2a4e7)") so future readers know the finding's
lifecycle without re-grepping the CHANGELOG.
- Item G in v107-plan.md gains an explicit E2E Playwright @critical
acceptance — the shell probe + Go unit tests validate the fix
today but don't run on every commit, so a refactor of Subscribe or
checkEligibility could silently re-open the bypass. The E2E test
makes regression coverage automatic.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
609 lines
27 KiB
Markdown
609 lines
27 KiB
Markdown
# Axis 1 — Correctness / accounting
|
||
|
||
> See [`README.md`](./README.md) for scope, auditor, source state and
|
||
> reading conventions.
|
||
|
||
## Surface summary
|
||
|
||
| PSP | Package | Endpoints called | Env gate |
|
||
|---|---|---|---|
|
||
| Hyperswitch (payments + refunds) | `internal/services/hyperswitch/` | `POST /payments`, `GET /payments/:id`, `POST /refunds` | `HYPERSWITCH_ENABLED` |
|
||
| Stripe Connect (seller payouts + transfers) | `internal/services/stripe_connect_service.go` | `POST /v1/account_links`, `GET /v1/accounts/:id`, `GET /v1/balance`, `POST /v1/transfers` | `STRIPE_CONNECT_ENABLED` |
|
||
|
||
| Table | Status column | Terminal states | Non-terminal |
|
||
|---|---|---|---|
|
||
| `orders` | `status` | completed, failed, cancelled, refunded | pending, refund_pending |
|
||
| `refunds` (v1.0.6) | `status` | succeeded, failed | pending |
|
||
| `seller_transfers` | `status` | completed, permanently_failed, reversed | pending, failed |
|
||
| `seller_payouts` | `status` | completed, failed | pending, processing |
|
||
| `licenses` | `revoked_at` (nullable) | NULL (active) / timestamp (revoked) | — |
|
||
| `user_subscriptions` | `status` (enum) | canceled, expired | active, trialing, past_due |
|
||
|
||
Webhook entry: `POST /api/v1/webhooks/hyperswitch`
|
||
(`routes_webhooks.go:55`), HMAC-SHA512 verified, dispatches to
|
||
`ProcessPaymentWebhook` or `ProcessRefundWebhook` via
|
||
`HyperswitchWebhookPayload.IsRefundEvent()` (`service.go:700`).
|
||
|
||
Existing reconciliation: **only** the `TransferRetryWorker` for
|
||
`seller_transfers.status='failed'` (`transfer_retry.go:57`), interval
|
||
`TRANSFER_RETRY_INTERVAL` default `5m` (`config.go:399`).
|
||
|
||
---
|
||
|
||
## P0.1 — `SellerTransfer.StripeTransferID` declared but never populated
|
||
|
||
**Evidence**
|
||
- Column declared: `internal/core/marketplace/models.go:237`
|
||
`StripeTransferID string \`gorm:"size:255" json:"stripe_transfer_id,omitempty"\``
|
||
- Zero write-sites (`grep 'StripeTransferID\s*='` returns 0 hits in production code).
|
||
- Upstream returns the transfer but we discard it:
|
||
`internal/services/stripe_connect_service.go:205`
|
||
`_, err := transfer.New(params)` — the `*stripe.Transfer` pointer carrying
|
||
`.ID` is underscored away.
|
||
- `CreateTransfer` signature returns only `error`
|
||
(`internal/services/stripe_connect_service.go:179`), so callers physically
|
||
cannot persist the id even if they wanted to.
|
||
|
||
**Consequence**
|
||
Direct blocker for the v1.0.6 CHANGELOG-documented "parked v1.0.7 —
|
||
Stripe Connect Transfers:reversal". When anyone starts that work, there
|
||
is no `stripe_transfer_id` in the DB to pass to
|
||
`transfers.CreateReversal(...)`. The only fallbacks would be:
|
||
- Query Stripe with `transfers.List(Destination=account_id, Metadata[order_id]=...)` — fragile, rate-limited, no unique guarantee.
|
||
- Do a full upstream refactor of the transfer worker — larger scope than the TODO implies.
|
||
|
||
This is exactly the "temporary compromise frozen into permanent" pattern
|
||
the audit charter targets.
|
||
|
||
**Action (prerequisite for any v1.0.7 reversal work)**
|
||
1. Change `TransferService.CreateTransfer` signature:
|
||
`(ctx, sellerID, amount, currency, orderID string) (stripeTransferID string, err error)`.
|
||
2. Update `processSellerTransfers` (`service.go:762`) to capture the id
|
||
and set `st.StripeTransferID = stripeTransferID` before `tx.Create(&st)`.
|
||
3. Update `TransferRetryWorker.retryOne` (`transfer_retry.go:116`) to
|
||
persist the id on successful retry too.
|
||
|
||
**Criticity** — P0. Must land before the reversal commit, not with it.
|
||
|
||
---
|
||
|
||
## P0.2 — No Stripe Connect reversal on refund.succeeded
|
||
|
||
**Evidence**
|
||
- `internal/core/marketplace/service.go:1529` `reverseSellerAccounting`
|
||
does three things on refund webhook succeeded:
|
||
1. `DebitSellerBalance` (in-DB balance row decrement)
|
||
2. `seller_transfers.status = "reversed"` (in-DB ledger flag)
|
||
3. **Nothing else.**
|
||
- The CHANGELOG v1.0.6 entry explicitly flags this: "Stripe Connect
|
||
Transfers:reversal call flagged TODO(v1.0.7). [...] the missing piece
|
||
is the money-movement round-trip at Stripe."
|
||
|
||
**Consequence**
|
||
For every refunded order, the buyer is credited by Hyperswitch (verified
|
||
via webhook) and the VEZA internal seller balance is debited correctly.
|
||
But the Stripe-Connect-side transfer that moved the seller's share into
|
||
their connected account is **not** reversed. The seller's Stripe account
|
||
still holds that money. VEZA's books say "debited", Stripe says "paid
|
||
out". On every refund this creates a real reconciliation gap that a
|
||
seller's monthly Stripe statement will show.
|
||
|
||
This isn't a ledger-consistency issue in VEZA's own DB — it's a
|
||
VEZA-vs-Stripe divergence. It grows linearly with refund volume.
|
||
|
||
**Action — decoupled from the buyer-refund critical path**
|
||
The intuitive fix ("call `transfers.NewReversal` inside
|
||
`reverseSellerAccounting` and return on error to let the webhook retry")
|
||
is wrong: Hyperswitch has already credited the buyer, the buyer is
|
||
watching their order page, and a Stripe flap would leave the buyer in
|
||
`refund_pending` until Stripe comes back. That couples buyer UX to the
|
||
weakest dependency in the whole flow.
|
||
|
||
Correct pattern — split the reversal into two concerns, with an
|
||
intermediate state that says "VEZA books updated, PSP reversal still
|
||
owed":
|
||
|
||
1. **Migration** — extend `seller_transfers.status` to include a new
|
||
value `reversal_pending`. Update the state-value documentation
|
||
comment in `internal/core/marketplace/models.go:242`.
|
||
2. **Refund finalization** (`service.go:1466 finalizeSuccessfulRefund`)
|
||
— no change to its flow except: instead of transitioning
|
||
`seller_transfers.status` directly to `reversed` in
|
||
`reverseSellerAccounting`, transition to `reversal_pending` and set
|
||
`next_retry_at = NOW()`. The buyer-facing flow finalizes immediately:
|
||
order=refunded, licenses revoked, webhook returns 200.
|
||
3. **New worker** `StripeReversalWorker` — scans
|
||
`seller_transfers WHERE status='reversal_pending' AND (next_retry_at IS NULL OR next_retry_at <= NOW())`.
|
||
For each row:
|
||
- if `StripeTransferID == ""` (see P0.1), skip + log ERROR "cannot
|
||
reverse — transfer id missing", alert ops.
|
||
- else call
|
||
`transfer.NewReversal(stripeTransferID, &stripe.TransferReversalParams{Amount: amountCents})`.
|
||
- on success: `status='reversed'`, persist the reversal id on a new
|
||
`stripe_reversal_id` column.
|
||
- on Stripe 404 (transfer not found): assume already reversed
|
||
out-of-band, set `status='reversed'`, log INFO.
|
||
- on any other error: exponential backoff on `next_retry_at`,
|
||
increment `retry_count`. Hit the same `permanently_failed` ceiling
|
||
the TransferRetryWorker uses today.
|
||
|
||
This gives us three properties:
|
||
- Buyer refund never blocks on Stripe health.
|
||
- VEZA accounting is consistent with Hyperswitch refund state at all times.
|
||
- The VEZA↔Stripe gap is bounded (every reversal either completes or
|
||
ends in `permanently_failed` with a clear ops action).
|
||
|
||
**Criticity** — P0. Pre-public-opening of marketplace.
|
||
|
||
---
|
||
|
||
## P0.3 — No reconciliation sweep for stuck pending states
|
||
|
||
**Evidence**
|
||
- Inventory search returned zero scheduled jobs that compare DB state
|
||
to PSP state for payments or refunds.
|
||
- `TransferRetryWorker` exists (`transfer_retry.go`) but only reconciles
|
||
`seller_transfers` — not orders, not refunds.
|
||
- Agent inventory: "No startup sanity checks comparing DB to PSP for
|
||
orders/refunds. No admin endpoints for manual order/refund/transfer
|
||
verification. No sweeps for stuck orders in `refund_pending` or orders
|
||
exceeding 14-day deadline."
|
||
|
||
**Consequence**
|
||
If a Hyperswitch webhook is never delivered — they've been down, our
|
||
`/webhooks/hyperswitch` was down during their retry window, or their
|
||
delivery system has a bug — then:
|
||
- An order stuck in `pending` stays forever. Customer paid, got no
|
||
license. Only surfaces via customer complaint.
|
||
- A refund stuck in `refund_pending` stays forever. Customer asked for
|
||
refund, got nothing. Order stuck in `refund_pending` also blocks
|
||
re-attempts (`ErrRefundAlreadyRequested` at `service.go:1303`).
|
||
- A refund row with empty `hyperswitch_refund_id` (Phase 1 completed,
|
||
Phase 2 crashed between PSP call and DB UPDATE) has no way to
|
||
correlate to the PSP. Nothing queries them.
|
||
|
||
**Action**
|
||
Hourly worker (same pattern as `CleanupOrphanTracks` from v1.0.5 Fix 6):
|
||
|
||
```go
|
||
// internal/jobs/reconcile_hyperswitch.go
|
||
func ReconcilePendingOrders(ctx context.Context, db *gorm.DB, hs *hyperswitch.Client) error {
|
||
// Orders in 'pending' older than 30m: GET /payments/:id, sync
|
||
// Refunds in 'pending' > 30m with non-empty refund_id: GET /refunds/:id, sync
|
||
// Refunds in 'pending' > 5m with EMPTY refund_id: mark failed, roll back order
|
||
}
|
||
```
|
||
|
||
Wire in `cmd/api/main.go` next to `ScheduleOrphanTracksCleanup`.
|
||
|
||
**Criticity** — P0. The gap is open today on every deployed instance.
|
||
|
||
---
|
||
|
||
## P0.4 — `CreatePayment` / `CreateRefund` not idempotent at the HTTP layer
|
||
|
||
**Evidence**
|
||
- `internal/services/hyperswitch/client.go:73-78` and
|
||
`internal/services/hyperswitch/client.go:170-175` — only
|
||
`Content-Type` and `api-key` headers are set. No `Idempotency-Key`.
|
||
- Hyperswitch's API supports an `Idempotency-Key` header (standard PSP
|
||
convention).
|
||
|
||
**Consequence**
|
||
If the HTTP call from VEZA to Hyperswitch retries at the transport
|
||
layer (curl-level / net/http Transport-level retry, proxy retry, TLS
|
||
reconnect after a flap) during a single logical call, a second payment
|
||
or refund object can be created for the same intent. Customer
|
||
double-charged, or two refunds issued. The order row's
|
||
`hyperswitch_payment_id` column is set to the *first* payment_id we
|
||
hear back about, so the second shadow payment is orphaned in Hyperswitch
|
||
with no VEZA row pointing at it.
|
||
|
||
**Scope note — what Idempotency-Key does and does NOT fix**
|
||
This header protects against **HTTP-transport retry** only. It does
|
||
**not** fix, and should not be conflated with:
|
||
- **Application-level replay** — buyer clicks "Pay" twice, frontend
|
||
double-submits a form, VEZA application code retries after a DB
|
||
transaction aborts. These paths create distinct `order.id` values, so
|
||
each call sees a different natural key and the PSP legitimately
|
||
creates different payments. The fix there is a state-machine
|
||
precondition (order dedup by `(buyer_id, cart_hash)` within a window,
|
||
frontend debounce, CSRF single-use token), not the header.
|
||
- **Crash-between-call-and-save** — VEZA POST succeeds, Hyperswitch
|
||
returns payment_id, VEZA crashes before persisting it. A subsequent
|
||
application retry with the same `order.id` *does* recover thanks to
|
||
the Idempotency-Key (Hyperswitch returns the cached response), but
|
||
this only works if the application retry re-uses the same order row.
|
||
That requires the order row to exist in a `pending_payment_submitted`
|
||
intermediate state before the PSP call — which the current code
|
||
doesn't have. Documenting as a follow-up.
|
||
|
||
**Action**
|
||
Add `Idempotency-Key: <natural key>` header on every mutating call:
|
||
- `CreatePayment` → `order_id` (UUID, guaranteed unique per order row).
|
||
- `CreateRefund` → `refund.ID` (UUID from the pending `refunds` row —
|
||
populated before the PSP call in Phase 1, so the same key is used
|
||
across any HTTP-level retry within that call).
|
||
|
||
The `http.Client` in `hyperswitch.Client` should NOT set this
|
||
implicitly; callers must pass the key explicitly so the semantic is
|
||
auditable at each call site. Two changes, each ~3 lines.
|
||
|
||
The state-machine-level fix for application retry is out of scope here;
|
||
it goes on the v1.0.8+ roadmap (documented in `v107-plan.md`).
|
||
|
||
**Criticity** — P0. Low probability per call but unbounded blast radius
|
||
(customer double-charge).
|
||
|
||
---
|
||
|
||
## P1.5 — Webhook raw payloads not persisted
|
||
|
||
**Evidence**
|
||
- `routes_webhooks.go:59-82` reads body, verifies signature, dispatches.
|
||
- No table or log file captures the raw webhook body, signature header,
|
||
or processing result for later audit.
|
||
|
||
**Consequence**
|
||
When a customer disputes ("I never got refunded"), we cannot reproduce
|
||
what Hyperswitch told us. The only evidence is in the backend `log` at
|
||
whatever retention the host keeps — and structured logs don't carry the
|
||
raw body (we explicitly redact some fields for security).
|
||
|
||
**Action**
|
||
New table `hyperswitch_webhook_log`:
|
||
|
||
```sql
|
||
CREATE TABLE hyperswitch_webhook_log (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
received_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
signature_header TEXT NOT NULL,
|
||
signature_valid BOOLEAN NOT NULL,
|
||
event_type TEXT,
|
||
payment_id TEXT,
|
||
refund_id TEXT,
|
||
payload BYTEA NOT NULL,
|
||
processing_result TEXT NOT NULL, -- 'ok', 'error: <msg>', 'skipped'
|
||
processed_at TIMESTAMPTZ
|
||
);
|
||
CREATE INDEX ON hyperswitch_webhook_log (payment_id) WHERE payment_id IS NOT NULL;
|
||
CREATE INDEX ON hyperswitch_webhook_log (refund_id) WHERE refund_id IS NOT NULL;
|
||
```
|
||
|
||
Insert in `routes_webhooks.go:hyperswitchWebhookHandler` **before**
|
||
signature verification — so we capture attack attempts too. Retention
|
||
90 days via a separate sweep job.
|
||
|
||
**Criticity** — P1. Not ledger-diverging, but the first ops question
|
||
on any refund/dispute claim is "what did the webhook say", and today we
|
||
can't answer it.
|
||
|
||
---
|
||
|
||
## P1.6 — Disputes / chargebacks silently dropped
|
||
|
||
**Evidence**
|
||
- `HyperswitchWebhookPayload.IsRefundEvent()` (`service.go:700`) only
|
||
distinguishes refund-vs-non-refund. No dispute discriminator.
|
||
- Zero grep hits for `dispute|chargeback|disputed` in
|
||
`internal/core/marketplace/`, `internal/services/hyperswitch/`, or
|
||
anywhere in the webhook handler path.
|
||
- Hyperswitch emits `dispute.opened`, `dispute.evidence_required`,
|
||
`dispute.won`, `dispute.lost`, `dispute.expired` — all land on
|
||
`POST /webhooks/hyperswitch` and currently fall through
|
||
`ProcessPaymentWebhook`'s switch to the `default` case
|
||
(`service.go:718`), which logs Debug "ignoring status" and returns 200.
|
||
|
||
**Consequence**
|
||
A dispute/chargeback is the case where the ledger diverges most
|
||
violently, because:
|
||
1. It's **not initiated by VEZA** — we learn about it passively when
|
||
the buyer's bank pushes a reversal through the card network.
|
||
2. On `dispute.lost` (the common case if we don't respond in time),
|
||
Hyperswitch debits our acquirer balance for the original amount
|
||
**plus** a chargeback fee (typically 15-25€ depending on acquirer).
|
||
3. The VEZA `orders` row stays `completed`, the license stays active,
|
||
the seller stays paid. Buyer has their money back via the card
|
||
network. Every table in VEZA thinks everything is fine. The only
|
||
signal is the acquirer statement, and it's weeks delayed.
|
||
4. Chargeback rate is a seller health metric: an account with 1% CB
|
||
rate gets flagged by Visa/MC and the whole VEZA merchant account
|
||
can be put under review. Without counting disputes per-seller, we
|
||
have no early-warning system.
|
||
|
||
**Action (staged)**
|
||
Minimum (ships with P1.5 webhook log): every `dispute.*` event lands in
|
||
`hyperswitch_webhook_log` with the payload captured. Ops can query.
|
||
|
||
Recommended (v1.0.8 sprint):
|
||
1. New `disputes` table:
|
||
```sql
|
||
CREATE TABLE disputes (
|
||
id UUID PRIMARY KEY,
|
||
order_id UUID NOT NULL REFERENCES orders(id),
|
||
hyperswitch_dispute_id TEXT UNIQUE NOT NULL,
|
||
status TEXT NOT NULL, -- opened, evidence_required, won, lost, expired
|
||
amount_cents BIGINT NOT NULL,
|
||
reason TEXT,
|
||
evidence_due_at TIMESTAMPTZ,
|
||
opened_at TIMESTAMPTZ NOT NULL,
|
||
resolved_at TIMESTAMPTZ,
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||
);
|
||
```
|
||
2. Extend `IsRefundEvent()` discriminator to a three-way
|
||
`EventKind()` that returns `payment | refund | dispute`.
|
||
3. New `ProcessDisputeWebhook` — mirrors `ProcessRefundWebhook`'s
|
||
idempotency pattern (unique `hyperswitch_dispute_id` + row lock).
|
||
4. On `dispute.lost`: same seller-side reversal machinery as refund
|
||
(P0.2), because the money effect is identical.
|
||
5. Per-seller chargeback counter on `seller_balances` or a new
|
||
`seller_risk_metrics` table, exposed to admin UI.
|
||
|
||
**Criticity** — P1 minimum (capture + log). P0 if volume justifies it
|
||
once public-opened. At current volume (pre-opening), accepting the gap
|
||
is defensible; post-opening, a single dispute we miss is worse than
|
||
the engineering cost of handling it.
|
||
|
||
---
|
||
|
||
## P1.7 → P0.12 — Subscription payment-gate bypass (escalated 2026-04-17)
|
||
|
||
> **Re-classified from P1.7 to P0 on 2026-04-17** after probing ops Q2.
|
||
> The probe did not confirm "subscriptions are sold" — it confirmed
|
||
> "subscriptions are activated server-side without payment, and that
|
||
> state satisfies a paid-feature gate". Criticity moved from "P1
|
||
> hardening if subs are live" to "P0 security — bypass is live in any
|
||
> environment where `HYPERSWITCH_ENABLED=false` (which is the current
|
||
> dev/staging default and was the prod default of every env variable
|
||
> default before v1.0.6.2)". Hotfix v1.0.6.2 ships the gate fix
|
||
> (`GetUserSubscription` filter + migration 980 cleanup). The
|
||
> subscription-refund / webhook work (original P1.7 scope) remains and
|
||
> splits out below as P1.7' for v1.0.8.
|
||
|
||
**Evidence (bypass chain, confirmed by probe)**
|
||
- `POST /api/v1/subscriptions/subscribe` with `plan_id=<Creator>` from
|
||
any authenticated user (role=`user`, no KYC, no email verification
|
||
re-check beyond initial signup) returns **HTTP 201 `status=active`**.
|
||
- `subscription/service.go:256` `sub.Status = StatusActive` is assigned
|
||
unconditionally before the payment call.
|
||
- `subscription/service.go:279` `if s.paymentProvider != nil` —
|
||
payment call is conditional; when Hyperswitch is disabled (the
|
||
default for `HYPERSWITCH_ENABLED=false`), the row persists `active`
|
||
with no PSP reference written.
|
||
- `subscription/service.go:101` (pre-hotfix) `WHERE status IN
|
||
('active', 'trialing')` — the fantôme row satisfies this filter,
|
||
`GetUserSubscription` returns it to callers.
|
||
- `distribution/service.go:253` gate returns
|
||
`sub.Plan.HasDistribution || sub.Plan.CanSellOnMarketplace`. Creator
|
||
plan has `can_sell_on_marketplace=true` (migration 949 line 79) →
|
||
gate passes → user reaches `POST /distribution/submit`.
|
||
- `user_subscriptions.hyperswitch_subscription_id` declared
|
||
(`models.go:95`) but **never written anywhere** in the codebase —
|
||
no code path populates it, so it cannot be used as a post-hoc
|
||
"payment confirmed" signal either.
|
||
- Probe artefact (captured, cleaned): row
|
||
`906c585d-1afb-49e5-ac96-5eb5411cda99` created 2026-04-17 via
|
||
authenticated `POST /subscribe` from `smoke.0416@veza.local`
|
||
(`role=user`), HS IDs empty. Voided by migration 980 at v1.0.6.2
|
||
deploy.
|
||
|
||
**Consequence**
|
||
Any authenticated user gains access to `POST /distribution/submit`,
|
||
which dispatches to external distribution partners. Three harms:
|
||
1. Reputational — external partners receive distribution submissions
|
||
from non-paying VEZA accounts.
|
||
2. Financial — if the distribution workflow triggers billable API
|
||
calls to the partner platform (depends on workflow, worth auditing
|
||
as a separate finding under axis 3).
|
||
3. Contractual — VEZA's agreements with partners likely assume
|
||
"paid-subscriber submissions only".
|
||
The `role` column is not affected (user stays `user`, no bypass of
|
||
v1.0.6 creator self-service gate), and free-tier upload is unaffected
|
||
— the specific gate bypassed is the `has_distribution ||
|
||
can_sell_on_marketplace` check on paid plans.
|
||
|
||
**Action — shipped in v1.0.6.2 (commit d31f5733d)**
|
||
1. `subscription/service.go` `GetUserSubscription` now routes through
|
||
`hasEffectivePayment` — effective payment = free plan OR unexpired
|
||
trial OR invoice with non-empty `hyperswitch_payment_id`.
|
||
2. `ErrSubscriptionNoPayment` added; `distribution.checkEligibility`
|
||
and the four subscription handlers map it to "no active
|
||
subscription" (with `needs_payment=true` on `GET /me/subscription`
|
||
so honest-path users get actionable info).
|
||
3. Migration `980_void_unpaid_subscriptions.sql` sweeps pre-existing
|
||
fantôme rows to `status='expired'` with audit table
|
||
`voided_subscriptions_20260417`.
|
||
4. Probe script `scripts/probes/subscription-unpaid-activation.sh`
|
||
kept as versioned regression test.
|
||
5. `TODO(v1.0.7-item-G)` annotation on
|
||
`subscription/service.go:createNewSubscription` marks the deferred
|
||
work: replace the `if s.paymentProvider != nil` short-circuit with
|
||
a mandatory `pending_payment` state + webhook-driven activation
|
||
(item G in the v1.0.7 plan).
|
||
|
||
**Criticity** — **P0, closed 2026-04-17 in v1.0.6.2 (commit
|
||
d31f5733d).** Item G in v1.0.7 hardens the creation path end-to-end.
|
||
|
||
---
|
||
|
||
## P1.7' — Subscription money-movement not covered by v1.0.6 hardening (original scope, deferred to v1.0.8)
|
||
|
||
**Evidence**
|
||
- Subscription model has Hyperswitch keys:
|
||
`subscription/models.go:95` `HyperswitchSubscriptionID`, line 96
|
||
`HyperswitchCustomerID`, line 136 `HyperswitchPaymentID`.
|
||
- Status enum: `active, trialing, canceled, past_due, expired`
|
||
(`subscription/models.go:20-28`).
|
||
- `Service.Subscribe` (`subscription/service.go:149`) and
|
||
`Service.CancelSubscription` (line 390) exist.
|
||
- **Zero refund code in the subscription package.** Cancellation marks
|
||
the row `canceled` but does not initiate any PSP reimbursement.
|
||
- No visible webhook handler branch for `subscription.*` events — the
|
||
dispatcher only knows payment vs refund.
|
||
- No code path was found that transitions a subscription INTO
|
||
`past_due` (which implies it's driven by an external event the
|
||
webhook isn't consuming, or it's dead state).
|
||
|
||
**Consequence**
|
||
The marketplace refund hardening only covers one-shot purchases. A
|
||
subscription cancellation inside the paid period currently doesn't
|
||
refund anything, and a failed recurring charge doesn't move the
|
||
subscription to `past_due` because nothing wires it up.
|
||
|
||
**Action**
|
||
Dedicated sprint for subscription state-machine hardening. Scope for
|
||
v1.0.8. Breakdown:
|
||
1. Inventory every state in `user_subscriptions.status` with evidence
|
||
of which code path lands there.
|
||
2. Handle `subscription.payment_succeeded`, `subscription.payment_failed`,
|
||
`subscription.canceled` webhook events in a
|
||
`ProcessSubscriptionWebhook`.
|
||
3. Decide product policy on mid-period cancellation (pro-rata refund?
|
||
honour until end of period? forfeit?).
|
||
|
||
**Criticity** — P1 for the hardening. v1.0.6.2 closes the bypass half;
|
||
this half (refund / webhook completeness) is legitimately v1.0.8 scope
|
||
and does not produce a bypass on its own.
|
||
|
||
---
|
||
|
||
## P1.8 — No ledger-health metrics
|
||
|
||
**Evidence**
|
||
No grep hit for `prometheus.*counter.*order|refund|transfer|payout` in
|
||
`internal/metrics/`. Agent confirms no dashboards.
|
||
|
||
**Consequence**
|
||
Ledger state is invisible until someone complains. Specifically:
|
||
- Count of orders in `pending` older than 30 min (webhook never arrived)
|
||
- Count of orders in `refund_pending` older than 30 min
|
||
- Count of seller_transfers in `failed` at max `retry_count`
|
||
- Count of refunds with empty `hyperswitch_refund_id` older than 5 min
|
||
- Count of seller_transfers in `reversal_pending` older than 30 min (new state from P0.2)
|
||
- Oldest timestamp in each of the above
|
||
|
||
**Action**
|
||
Prometheus gauge metrics updated by a 60s sampler (cheap query each):
|
||
```
|
||
ledger_stuck_orders_pending_total{age_bucket="30m_1h|1h_1d|1d+"}
|
||
ledger_stuck_refunds_pending_total{age_bucket=...}
|
||
ledger_failed_transfers_at_max_retry_total
|
||
ledger_orphan_refund_rows_total // empty hyperswitch_refund_id
|
||
ledger_reversal_pending_transfers_total
|
||
```
|
||
|
||
Grafana panel with 5 number panels + 5 time-series. Alert rules:
|
||
- `ledger_stuck_orders_pending_total > 0 for 10m` → page
|
||
- `ledger_orphan_refund_rows_total > 0 for 5m` → page (= bug in the
|
||
two-phase commit between DB and PSP)
|
||
|
||
**Criticity** — P1. Ships with P0.3 (the reconciliation sweep) so the
|
||
metrics track the sweep's effectiveness.
|
||
|
||
---
|
||
|
||
## P2.9 — No admin API for manual refund / order override
|
||
|
||
**Evidence**
|
||
No admin endpoint found that can force an order or refund into a
|
||
specific terminal state. Ops would have to issue raw SQL in prod.
|
||
|
||
**Consequence**
|
||
When something genuinely gets stuck (Hyperswitch confirms out-of-band
|
||
that a refund succeeded but their webhook is never going to fire for
|
||
some freak reason), there's no sanctioned path to resolve without raw
|
||
DB access.
|
||
|
||
**Action**
|
||
`POST /api/v1/admin/orders/:id/force-status` +
|
||
`POST /api/v1/admin/refunds/:id/force-status`, gated by
|
||
`AuthMiddleware.RequireAdminRole`, with an audit-log entry. Body:
|
||
`{ new_status: "...", reason: "..." }`. Validates the target state is
|
||
reachable from the current one per the documented state machine
|
||
(depends on Axis 2).
|
||
|
||
**Criticity** — P2. Not urgent; raw SQL works until it doesn't.
|
||
|
||
---
|
||
|
||
## P2.10 — Partial refund latent compromise
|
||
|
||
**Evidence**
|
||
`client.go:142` `CreateRefundRequest.Amount *int64 // nil = full refund`.
|
||
`service.go:1330` passes `nil` as amount on every call.
|
||
CHANGELOG v1.0.6: "full-only for v1.0.6 — partial refund UX is deferred
|
||
to v1.0.7."
|
||
|
||
**Consequence**
|
||
The `amount *int64` parameter is a permanent half-wired compromise. When
|
||
partial refunds land, the implementer must remember:
|
||
- Multiple refunds per order must be allowed (change the double-submit
|
||
guard at `service.go:1303`).
|
||
- `SellerBalance` debit must be proportional to the partial amount, not
|
||
full transfer amount.
|
||
- The partial UNIQUE index already tolerates the multi-refund case
|
||
(v1.0.6.1 migration 979) — but only if each refund has its own
|
||
PSP-assigned refund_id, which Hyperswitch does.
|
||
- Webhook lookup already uses `hyperswitch_refund_id` so no change.
|
||
|
||
**Action**
|
||
When partial lands: extensive state-table tests (can the same order
|
||
reach (`status=refunded`, amount < total)? should it be a new
|
||
`partially_refunded` status?). Flag this item in the v1.0.7 plan
|
||
explicitly so it's scoped, not drive-by.
|
||
|
||
**Criticity** — P2. No active bug; future hazard only.
|
||
|
||
---
|
||
|
||
## wontfix.11 — `TransferRetryInterval` not per-seller configurable
|
||
|
||
`config.go:399` default `5m`. A single global interval works fine at
|
||
current traffic. Re-evaluate at 10× load.
|
||
|
||
---
|
||
|
||
## Axis 1 summary
|
||
|
||
| # | Finding | Criticity | Gate |
|
||
|---|---|---|---|
|
||
| 1 | `SellerTransfer.StripeTransferID` never set | P0 | prerequisite for v1.0.7 reversal |
|
||
| 2 | No Stripe Connect reversal on refund.succeeded | P0 | v1.0.7, decoupled via `reversal_pending` state |
|
||
| 3 | No reconciliation sweep for stuck pending states | P0 | v1.0.7 (hourly job) |
|
||
| 4 | No `Idempotency-Key` on Hyperswitch calls | P0 | v1.0.7 (2 header lines) |
|
||
| 5 | Webhook raw payloads not persisted | P1 | v1.0.7 (migration + insert) |
|
||
| 6 | Disputes / chargebacks silently dropped | P1 | v1.0.8 (min: capture in P1.5 log) |
|
||
| 7 | Subscription money-movement surface | P1 | v1.0.8 sprint |
|
||
| 8 | No ledger-health metrics | P1 | ships with #3 |
|
||
| 9 | No admin API for manual override | P2 | v1.0.8+ |
|
||
| 10 | Partial refund latent compromise | P2 | flag in v1.0.7 plan |
|
||
| 11 | Per-seller retry interval | wontfix | — |
|
||
|
||
**Blockers for "ready to show"** — 1, 2, 3, 4 must land before the
|
||
marketplace is used by anyone external (grants demo, beta, DIY
|
||
community). After those four, VEZA's accounting is self-consistent
|
||
AND consistent with Stripe+Hyperswitch, which is the real bar.
|
||
|
||
See [`v107-plan.md`](./v107-plan.md) for sequencing and effort.
|
||
|
||
---
|
||
|
||
## Info needed from ops
|
||
|
||
Not determinable from code. Answers become part of the audit:
|
||
|
||
1. **Last manual reconciliation between VEZA DB and Stripe/Hyperswitch
|
||
dashboards.** Date + scope + outcome.
|
||
_If the answer is "jamais" — that's itself a P1 process finding:
|
||
no reconciliation cadence established. Absence of documented debt
|
||
= hidden debt._
|
||
2. **Are subscriptions currently sold** (i.e., is the paid-subscription
|
||
feature turned on for end-users), or is the code live but the UI
|
||
hides the entry point? If yes, P1.7 → P0. If hidden-by-UI,
|
||
additional Axis-3 check needed: is the API reachable despite the
|
||
UI hiding? Front-flag-only gating is a security finding.
|
||
3. **Current volume order of magnitude** — number of orders/refunds
|
||
processed per week — so criticity can be re-checked against blast
|
||
radius.
|