feat(marketplace): stripe reversal error disambiguation + CHECK constraint + E2E — v1.0.7 item B day 3
Day-3 closure of item B. The three things day 2 deferred are now done:
1. Stripe error disambiguation.
ReverseTransfer in StripeConnectService now parses
stripe.Error.Code + HTTPStatusCode + Msg to emit the sentinels
the worker routes on. Pre-day-3 the sentinels were declared but
the service wrapped every error opaquely, making this the exact
"temporary compromise frozen into permanent" pattern the audit
was meant to prevent — flagged during review and fixed same day.
Mapping:
* 404 + code=resource_missing → ErrTransferNotFound
* 400 + msg matches "already" + "reverse" → ErrTransferAlreadyReversed
* any other → transient (wrapped raw, retry)
The "already reversed" case has no machine-readable code in
stripe-go (unlike ChargeAlreadyRefunded for charges — the SDK
doesn't enumerate the equivalent for transfers), so it's
message-parsed. Fragility documented at the call site: if Stripe
changes the wording, the worker treats the response as transient
and eventually surfaces the row to permanently_failed after max
retries. Worst-case regression is "benign case gets noisier",
not data loss.
2. Migration 983: CHECK constraint chk_reversal_pending_has_next_
retry_at CHECK (status != 'reversal_pending' OR next_retry_at
IS NOT NULL). Added NOT VALID so the constraint is enforced on
new writes without scanning existing rows; a follow-up VALIDATE
can run once the table is known to be clean. Prevents the
"invisible orphan" failure mode where a reversal_pending row
with NULL next_retry_at would be skipped by any future stricter
worker query.
3. End-to-end reversal flow test (reversal_e2e_test.go) chains
three sub-scenarios: (a) happy path — refund.succeeded →
reversal_pending → worker → reversed with stripe_reversal_id
persisted; (b) invalid stripe_transfer_id → worker terminates
rapidly to permanently_failed with single Stripe call, no
retries (the highest-value coverage per day-3 review); (c)
already-reversed out-of-band → worker flips to reversed with
informative message.
Architecture note — the sentinels were moved to a new leaf
package `internal/core/connecterrors` because both marketplace
(needs them for the worker's errors.Is checks) and services (needs
them to emit) import them, and an import cycle
(marketplace → monitoring → services) would form if either owned
them directly. marketplace re-exports them as type aliases so the
worker code reads naturally against the marketplace namespace.
New tests:
* services/stripe_connect_service_test.go — 7 cases on
isAlreadyReversedMessage (pins Stripe's wording), 1 case on
the error-classification shape. Doesn't invoke stripe.SetBackend
— the translation logic is tested via a crafted *stripe.Error,
the emission is trusted on the read of `errors.As` + the known
shape of stripe.Error.
* marketplace/reversal_e2e_test.go — 3 end-to-end sub-tests
chaining refund → worker against a dual-role mock. The
invalid-id case asserts single-call-no-retries termination.
* Migration 983 applied cleanly to the local Postgres; constraint
visible in \d seller_transfers as NOT VALID (behavior correct
for future writes, existing rows grandfathered).
Self-assessment on day-2's struct-literal refactor of
processSellerTransfers (deferred from day 2):
The refactor is borderline — neither clearer nor confusing than the
original mutation-after-construct pattern. Logged in the v1.0.7-rc1
CHANGELOG as a post-v1.0.7 consideration: if GORM BeforeUpdate
hooks prove cleaner on other state machines (axis 2), revisit the
anti-mutation test approach.
CHANGELOG v1.0.7-rc1 entry added documenting items A + B end-to-end.
Tag not yet applied — items C, D, E, F remain on the v1.0.7 plan.
The rc1 tag lands when those four items close + the smoke probe
validates the full cadence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 00:12:03 +00:00
|
|
|
-- v1.0.7 item B (day 3): CHECK constraint guaranteeing that every
|
|
|
|
|
-- row in status='reversal_pending' has next_retry_at set.
|
|
|
|
|
--
|
|
|
|
|
-- Failure mode this prevents: a reversal_pending row with
|
|
|
|
|
-- next_retry_at=NULL is invisible to the worker's select
|
|
|
|
|
-- WHERE status='reversal_pending' AND (next_retry_at IS NULL OR next_retry_at <= NOW())
|
|
|
|
|
-- (actually the `IS NULL` branch IS included in the worker query,
|
|
|
|
|
-- so the row would be picked up — but the pattern is fragile:
|
|
|
|
|
-- anyone who later writes a stricter WHERE clause expecting
|
|
|
|
|
-- next_retry_at NOT NULL would silently orphan these rows).
|
|
|
|
|
--
|
|
|
|
|
-- Belt-and-suspenders: the constraint makes the invariant structural.
|
|
|
|
|
-- reverseSellerAccounting always sets next_retry_at=NOW() on
|
|
|
|
|
-- transition; the worker always sets a retry timestamp on
|
|
|
|
|
-- same-state retry; terminal transitions (reversed,
|
|
|
|
|
-- permanently_failed) clear next_retry_at back to NULL. No code
|
|
|
|
|
-- path produces NULL in combination with reversal_pending — this
|
|
|
|
|
-- constraint codifies that.
|
|
|
|
|
--
|
|
|
|
|
-- The constraint is a CHECK not a NOT NULL because the column is
|
|
|
|
|
-- legitimately NULL in the other statuses (pending, completed,
|
|
|
|
|
-- failed without pending retry, reversed, permanently_failed).
|
|
|
|
|
|
2026-04-18 02:08:14 +00:00
|
|
|
-- ADD CONSTRAINT is not natively IF NOT EXISTS in Postgres — the
|
|
|
|
|
-- DO block catches `duplicate_object` so re-running the migration
|
|
|
|
|
-- (test runs, manual psql apply, re-deploys) is a no-op. Same shape
|
|
|
|
|
-- the runner expects from other idempotent migrations in this
|
|
|
|
|
-- directory.
|
|
|
|
|
DO $$
|
|
|
|
|
BEGIN
|
|
|
|
|
ALTER TABLE seller_transfers
|
|
|
|
|
ADD CONSTRAINT chk_reversal_pending_has_next_retry_at
|
|
|
|
|
CHECK (status <> 'reversal_pending' OR next_retry_at IS NOT NULL)
|
|
|
|
|
NOT VALID;
|
|
|
|
|
EXCEPTION
|
|
|
|
|
WHEN duplicate_object THEN
|
|
|
|
|
-- constraint already present (prior manual apply or prior run)
|
|
|
|
|
NULL;
|
|
|
|
|
END $$;
|
feat(marketplace): stripe reversal error disambiguation + CHECK constraint + E2E — v1.0.7 item B day 3
Day-3 closure of item B. The three things day 2 deferred are now done:
1. Stripe error disambiguation.
ReverseTransfer in StripeConnectService now parses
stripe.Error.Code + HTTPStatusCode + Msg to emit the sentinels
the worker routes on. Pre-day-3 the sentinels were declared but
the service wrapped every error opaquely, making this the exact
"temporary compromise frozen into permanent" pattern the audit
was meant to prevent — flagged during review and fixed same day.
Mapping:
* 404 + code=resource_missing → ErrTransferNotFound
* 400 + msg matches "already" + "reverse" → ErrTransferAlreadyReversed
* any other → transient (wrapped raw, retry)
The "already reversed" case has no machine-readable code in
stripe-go (unlike ChargeAlreadyRefunded for charges — the SDK
doesn't enumerate the equivalent for transfers), so it's
message-parsed. Fragility documented at the call site: if Stripe
changes the wording, the worker treats the response as transient
and eventually surfaces the row to permanently_failed after max
retries. Worst-case regression is "benign case gets noisier",
not data loss.
2. Migration 983: CHECK constraint chk_reversal_pending_has_next_
retry_at CHECK (status != 'reversal_pending' OR next_retry_at
IS NOT NULL). Added NOT VALID so the constraint is enforced on
new writes without scanning existing rows; a follow-up VALIDATE
can run once the table is known to be clean. Prevents the
"invisible orphan" failure mode where a reversal_pending row
with NULL next_retry_at would be skipped by any future stricter
worker query.
3. End-to-end reversal flow test (reversal_e2e_test.go) chains
three sub-scenarios: (a) happy path — refund.succeeded →
reversal_pending → worker → reversed with stripe_reversal_id
persisted; (b) invalid stripe_transfer_id → worker terminates
rapidly to permanently_failed with single Stripe call, no
retries (the highest-value coverage per day-3 review); (c)
already-reversed out-of-band → worker flips to reversed with
informative message.
Architecture note — the sentinels were moved to a new leaf
package `internal/core/connecterrors` because both marketplace
(needs them for the worker's errors.Is checks) and services (needs
them to emit) import them, and an import cycle
(marketplace → monitoring → services) would form if either owned
them directly. marketplace re-exports them as type aliases so the
worker code reads naturally against the marketplace namespace.
New tests:
* services/stripe_connect_service_test.go — 7 cases on
isAlreadyReversedMessage (pins Stripe's wording), 1 case on
the error-classification shape. Doesn't invoke stripe.SetBackend
— the translation logic is tested via a crafted *stripe.Error,
the emission is trusted on the read of `errors.As` + the known
shape of stripe.Error.
* marketplace/reversal_e2e_test.go — 3 end-to-end sub-tests
chaining refund → worker against a dual-role mock. The
invalid-id case asserts single-call-no-retries termination.
* Migration 983 applied cleanly to the local Postgres; constraint
visible in \d seller_transfers as NOT VALID (behavior correct
for future writes, existing rows grandfathered).
Self-assessment on day-2's struct-literal refactor of
processSellerTransfers (deferred from day 2):
The refactor is borderline — neither clearer nor confusing than the
original mutation-after-construct pattern. Logged in the v1.0.7-rc1
CHANGELOG as a post-v1.0.7 consideration: if GORM BeforeUpdate
hooks prove cleaner on other state machines (axis 2), revisit the
anti-mutation test approach.
CHANGELOG v1.0.7-rc1 entry added documenting items A + B end-to-end.
Tag not yet applied — items C, D, E, F remain on the v1.0.7 plan.
The rc1 tag lands when those four items close + the smoke probe
validates the full cadence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 00:12:03 +00:00
|
|
|
|
|
|
|
|
-- NOT VALID creates the constraint without scanning existing rows.
|
|
|
|
|
-- Pre-v1.0.7.2 rows are grandfathered — item A + day 2 of B don't
|
|
|
|
|
-- produce any reversal_pending rows with NULL next_retry_at, but
|
|
|
|
|
-- if any slipped in via tests or manual ops writes, validation
|
|
|
|
|
-- would fail the migration. Use VALIDATE CONSTRAINT in a follow-up
|
|
|
|
|
-- once the table is known to be clean:
|
|
|
|
|
--
|
|
|
|
|
-- ALTER TABLE seller_transfers VALIDATE CONSTRAINT chk_reversal_pending_has_next_retry_at;
|
|
|
|
|
--
|
|
|
|
|
-- Running VALIDATE requires a SHARE UPDATE EXCLUSIVE lock (not
|
|
|
|
|
-- AccessExclusive), which is cheap and non-blocking for readers.
|
|
|
|
|
-- Left as a separate one-shot so the deploy is independent of
|
|
|
|
|
-- table cleanup timing.
|
|
|
|
|
--
|
|
|
|
|
-- Post-validate, future INSERTs / UPDATEs that violate the
|
|
|
|
|
-- invariant will fail synchronously — the constraint is enforced
|
|
|
|
|
-- regardless of NOT VALID / VALIDATED state.
|