Commit graph

7 commits

Author SHA1 Message Date
senke
3c4d0148be feat(webhooks): persist raw hyperswitch payloads to audit log — v1.0.7 item E
Every POST /webhooks/hyperswitch delivery now writes a row to
`hyperswitch_webhook_log` regardless of signature-valid or
processing outcome. Captures both legitimate deliveries and attack
probes — a forensics query now has the actual bytes to read, not
just a "webhook rejected" log line. Disputes (axis-1 P1.6) ride
along: the log captures dispute.* events alongside payment and
refund events, ready for when disputes get a handler.

Table shape (migration 984):
  * payload TEXT — readable in psql, invalid UTF-8 replaced with
    empty (forensics value is in headers + ip + timing for those
    attacks, not the binary body).
  * signature_valid BOOLEAN + partial index for "show me attack
    attempts" being instantaneous.
  * processing_result TEXT — 'ok' / 'error: <msg>' /
    'signature_invalid' / 'skipped'. Matches the P1.5 action
    semantic exactly.
  * source_ip, user_agent, request_id — forensics essentials.
    request_id is captured from Hyperswitch's X-Request-Id header
    when present, else a server-side UUID so every row correlates
    to VEZA's structured logs.
  * event_type — best-effort extract from the JSON payload, NULL
    on malformed input.

Hardening:
  * 64KB body cap via io.LimitReader rejects oversize with 413
    before any INSERT — prevents log-spam DoS.
  * Single INSERT per delivery with final state; no two-phase
    update race on signature-failure path. signature_invalid and
    processing-error rows both land.
  * DB persistence failures are logged but swallowed — the
    endpoint's contract is to ack Hyperswitch, not perfect audit.

Retention sweep:
  * CleanupHyperswitchWebhookLog in internal/jobs, daily tick,
    batched DELETE (10k rows + 100ms pause) so a large backlog
    doesn't lock the table.
  * HYPERSWITCH_WEBHOOK_LOG_RETENTION_DAYS (default 90).
  * Same goroutine-ticker pattern as ScheduleOrphanTracksCleanup.
  * Wired in cmd/api/main.go alongside the existing cleanup jobs.

Tests: 5 in webhook_log_test.go (persistence, request_id auto-gen,
invalid-JSON leaves event_type empty, invalid-signature capture,
extractEventType 5 sub-cases) + 4 in cleanup_hyperswitch_webhook_
log_test.go (deletes-older-than, noop, default-on-zero,
context-cancel). Migration 984 applied cleanly to local Postgres;
all indexes present.

Also (v107-plan.md):
  * Item G acceptance gains an explicit Idempotency-Key threading
    requirement with an empty-key loud-fail test — "literally
    copy-paste D's 4-line test skeleton". Closes the risk that
    item G silently reopens the HTTP-retry duplicate-charge
    exposure D closed.

Out of scope for E (noted in CHANGELOG):
  * Rate limit on the endpoint — pre-existing middleware covers
    it at the router level; adding a per-endpoint limit is
    separate scope.
  * Readable-payload SQL view — deferred, the TEXT column is
    already human-readable; a convenience view is a nice-to-have
    not a ship-blocker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 02:44:58 +02:00
senke
3cd82ba5be fix(hyperswitch): idempotency-key on create-payment and create-refund — v1.0.7 item D
Every outbound POST /payments and POST /refunds from the Hyperswitch
client now carries an Idempotency-Key HTTP header. Key values are
explicit parameters at every call site — no context-carrier magic,
no auto-generation. An empty key is a loud error from the client
(not silent header omission) so a future new call site that forgets
to supply one fails immediately, not months later under an obscure
replay scenario.

Key choices, both stable across HTTP retries of the same logical
call:
  * CreatePayment → order.ID.String() (GORM BeforeCreate populates
    order.ID before the PSP call in ConfirmOrder).
  * CreateRefund → pendingRefund.ID.String() (populated by the
    Phase 1 tx.Create in RefundOrder, available for the Phase 2 PSP
    call).

Scope note (reproduced here for the next reader who grep-s the
commit log for "Idempotency-Key"):

  Idempotency-Key covers HTTP-transport retry (TLS reconnect,
  proxy retry, DNS flap) within a single CreatePayment /
  CreateRefund invocation. It does NOT cover application-level
  replay (user double-click, form double-submit, retry after crash
  before DB write). That class of bug requires state-machine
  preconditions on VEZA side — already addressed by the order
  state machine + the handler-level guards on POST
  /api/v1/payments (for payments) and the partial UNIQUE on
  `refunds.hyperswitch_refund_id` landed in v1.0.6.1 (for refunds).

  Hyperswitch TTL on Idempotency-Key: typically 24h-7d server-side
  (verify against current PSP docs). Beyond TTL, a retry with the
  same key is treated as a new request. Not a concern at current
  volumes; document if retry logic ever extends beyond 1 hour.

Explicitly out of scope: item D does NOT add application-level
retry logic. The current "try once, fail loudly" behavior on PSP
errors is preserved. Adding retries is a separate design exercise
(backoff, max attempts, circuit breaker) not part of this commit.

Interfaces changed:
  * hyperswitch.Client.CreatePayment(ctx, idempotencyKey, ...)
  * hyperswitch.Client.CreatePaymentSimple(...) convenience wrapper
  * hyperswitch.Client.CreateRefund(ctx, idempotencyKey, ...)
  * hyperswitch.Provider.CreatePayment threads through
  * hyperswitch.Provider.CreateRefund threads through
  * marketplace.PaymentProvider interface — first param after ctx
  * marketplace.refundProvider interface — first param after ctx

Removed:
  * hyperswitch.Provider.Refund (zero callers, superseded by
    CreateRefund which returns (refund_id, status, err) and is the
    only method marketplace's refundProvider cares about).

Tests:
  * Two new httptest.Server-backed tests (client_test.go) pin the
    Idempotency-Key header value for CreatePayment and CreateRefund.
  * Two new empty-key tests confirm the client errors rather than
    silently sending no header.
  * TestRefundOrder_OpensPendingRefund gains an assertion that
    f.provider.lastIdempotencyKey == refund.ID.String() — if a
    future refactor threads the key from somewhere else (paymentID,
    uuid.New() per call, etc.) the test fails loudly.
  * Four pre-existing test mocks updated for the new signature
    (mockRefundPaymentProvider in marketplace, mockPaymentProvider
    in tests/integration and tests/contract, mockRefundPayment
    Provider in tests/integration/refund_flow).

Subscription's CreateSubscriptionPayment interface declares its own
shape and has no live Hyperswitch-backed implementation today —
v1.0.6.2 noted this as the payment-gate bypass surface, v1.0.7
item G will ship the real provider. When that lands, item G's
implementation threads the idempotency key through in the same
pattern (documented in v107-plan.md item G acceptance).

CHANGELOG v1.0.7-rc1 entry updated with the full item D scope note
and the "out of scope: retries" caveat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 02:30:02 +02:00
senke
eedaad9f83 refactor(connect): persist stripe_transfer_id on create + retry — v1.0.7 item A
TransferService.CreateTransfer signature changes from (...) error to
(...) (string, error) — the caller now captures the Stripe transfer
identifier and persists it on the SellerTransfer row. Pre-v1.0.7 the
stripe_transfer_id column was declared on the model and table but
never written to, which blocked the reversal worker (v1.0.7 item B)
from identifying which transfer to reverse on refund.

Changes:
  * `TransferService` interface and `StripeConnectService.CreateTransfer`
    both return the Stripe transfer id alongside the error.
  * `processSellerTransfers` (marketplace service) persists the id on
    success before `tx.Create(&st)` so a crash between Stripe ACK and
    DB commit leaves no inconsistency.
  * `TransferRetryWorker.retryOne` persists on retry success — a row
    that failed on first attempt and succeeded via the worker is
    reversal-ready all the same.
  * `admin_transfer_handler.RetryTransfer` (manual retry) persists too.
  * `SellerPayout.ExternalPayoutID` is populated by the Connect payout
    flow (`payout.go`) — the field existed but was never written.
  * Four test mocks updated; two tests assert the id is persisted on
    the happy path, one on the failure path confirms we don't write a
    fake id when the provider errors.

Migration `981_seller_transfers_stripe_reversal_id.sql`:
  * Adds nullable `stripe_reversal_id` column for item B.
  * Partial UNIQUE indexes on both stripe_transfer_id and
    stripe_reversal_id (WHERE IS NOT NULL AND <> ''), mirroring the
    v1.0.6.1 pattern for refunds.hyperswitch_refund_id.
  * Logs a count of historical completed transfers that lack an id —
    these are candidates for the backfill CLI follow-up task.

Backfill for historical rows is a separate follow-up (cmd/tools/
backfill_stripe_transfer_ids, calling Stripe's transfers.List with
Destination + Metadata[order_id]). Pre-v1.0.7 transfers without a
backfilled id cannot be auto-reversed on refund — document in P2.9
admin-recovery when it lands. Acceptable scope per v107-plan.

Migration number bumped 980 → 981 because v1.0.6.2 used 980 for the
unpaid-subscription cleanup; v107-plan updated with the note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 13:08:39 +02:00
senke
149f76ccc7 docs: amend v1.0.6.2 CHANGELOG + item G recovery endpoint
CHANGELOG v1.0.6.2 block now documents the distribution-handler
propagate fix as part of the release (applied in commit 26cb52333
before re-tagging). v1.0.7 item G acceptance gains a recovery
endpoint requirement so the "complete payment" error message has a
real target rather than leaving users stuck.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 12:53:43 +02:00
senke
26cb523334 fix(distribution,audit): propagate ErrSubscriptionNoPayment to handler + P0.12 closure date + E2E regression TODO
Self-review of the v1.0.6.2 hotfix surfaced that
distribution.checkEligibility silently swallowed
subscription.ErrSubscriptionNoPayment as "ineligible, no extra info",
so a user with a fantôme subscription trying to submit a distribution
got "Distribution requires Creator or Premium plan" — misleading, the
user has a plan but no payment. checkEligibility now propagates the
error so the handler can surface "Your subscription is not linked to
a payment. Complete payment to enable distribution."

Security is unchanged — the gate still refuses. This is a UX clarity
fix for honest-path users who landed in the fantôme state via a
broken payment flow.

Also:
- Closure timestamp added to axis-1 P0.12 ("closed 2026-04-17 in
  v1.0.6.2 (commit 9a8d2a4e7)") so future readers know the finding's
  lifecycle without re-grepping the CHANGELOG.
- Item G in v107-plan.md gains an explicit E2E Playwright @critical
  acceptance — the shell probe + Go unit tests validate the fix
  today but don't run on every commit, so a refactor of Subscribe or
  checkEligibility could silently re-open the bypass. The E2E test
  makes regression coverage automatic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 12:43:21 +02:00
senke
68a0d390e2 docs(audit): P1.7 → P0.12 post-probe; add v1.0.7 item G + Idempotency-Key TTL note
2026-04-17 Q2 probe confirmed the subscription money-movement finding
wasn't a "needs confirmation from ops" P1 — it was a live P0 bypass.
An authenticated user could POST /api/v1/subscriptions/subscribe,
receive 201 active without payment, and satisfy the distribution
eligibility gate. v1.0.6.2 (commit 9a8d2a4e7) closed the bypass at
the consumption site via GetUserSubscription filter + migration 980
cleanup.

axis-1-correctness.md:
  * P1.7 renamed to P0.12 with the bypass chain, probe evidence, and
    v1.0.6.2 closure cross-reference.
  * Residual subscription-refund / webhook completeness work split out
    as P1.7' (original scope, still v1.0.8).

v107-plan.md:
  * Item G added (M effort) — replaces the v1.0.6.2 filter with a
    mandatory pending_payment state + webhook-driven activation,
    closing the creation path rather than compensating at the gate.
  * Dependency graph gains a third track (independent of A/B/C/D/E/F).
  * Effort total revised from 9-10d to 12-13d single-dev, 5d to 7d
    two-dev parallel.
  * Item D acceptance gains a TTL caveat section — Hyperswitch
    Idempotency-Key has a 24h-7d server-side TTL; app-level
    idempotency (order.id / partial UNIQUE) remains the load-bearing
    guard beyond that window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 12:31:07 +02:00
senke
6b345ede9f docs(audit): 2026-04 correctness/accounting findings (axis 1)
Axis 1 of the 5-axis VEZA audit, scoped to money-movement correctness
and ledger↔PSP reconciliation. Layout: one file per axis under
docs/audit-2026-04/, README index, v107-plan.md derived.

P0 findings (block v1.0.7 "ready-to-show" gate):
  * P0.1 — SellerTransfer.StripeTransferID declared but never populated.
    stripe_connect_service.CreateTransfer discards the *stripe.Transfer
    return value (`_, err := transfer.New(params)`), so the column in
    models.go:237 is dead. Structural blocker for the CHANGELOG-parked
    v1.0.7 "Stripe Connect reversal" item.
  * P0.2 — No Stripe Connect reversal on refund.succeeded. Every refund
    today creates a permanent VEZA↔Stripe ledger gap. Action reworked
    to decouple via a new `seller_transfers.status = 'reversal_pending'`
    state + async worker, so Stripe flaps never block buyer-facing
    refund UX.
  * P0.3 — No reconciliation sweep for stuck orders / refunds / refund
    rows with empty hyperswitch_refund_id. Hourly worker recommended,
    same pattern as v1.0.5 Fix 6 orphan-tracks cleaner.
  * P0.4 — No Idempotency-Key on outbound Hyperswitch POST /payments and
    POST /refunds. Action includes an explicit scope note: the header
    covers HTTP-transport retry only, NOT application-level replay (for
    which the fix is a state-machine precondition).

P1 findings:
  * P1.5 — Webhook raw payloads not persisted (blocks dispute forensics)
  * P1.6 — Disputes / chargebacks silently dropped (new, surfaced during
    review; dispute.* webhooks fall through the default case)
  * P1.7 — Subscription money-movement not covered by v1.0.6 hardening
  * P1.8 — No ledger-health Prometheus metrics

P2 findings:
  * P2.9 — No admin API for manual override
  * P2.10 — Partial refund latent compromise (amount *int64 always nil)

wontfix:
  * wontfix.11 — Per-seller retry interval (re-evaluate at 10× load)

Derived deliverable: v107-plan.md sequences the 6 de-duplicated items
(4 P0 + 2 P1) with a dependency graph, two parallel tracks, per-commit
effort estimates (D→A→B; E→C→F), release gating and open questions
(volume magnitude, Connect backfill %).

Info needed from ops (tracked in axis-1 doc, not determinable from
code): last manual reconciliation date, whether subscriptions are
currently sold, current order/refund volume.

Axes 2-5 deferred: README.md marks axis 2 (state machines) as gated
on v1.0.7 landing first, otherwise the transition matrix captures a
v1.0.6.1 snapshot that's immediately stale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 03:21:33 +02:00