Commit graph

8 commits

Author SHA1 Message Date
senke
2a96766ae3 feat(subscription): pending_payment state machine + mandatory provider (v1.0.9 item G — Phase 1)
First instalment of Item G from docs/audit-2026-04/v107-plan.md §G.
This commit lands the state machine + create-flow change. Phase 2
(webhook handler + recovery endpoint + reconciler sweep) follows.

What changes :
  - **`models.go`** — adds `StatusPendingPayment` to the
    SubscriptionStatus enum. Free-text VARCHAR(30) so no DDL needed
    for the value itself; Phase 2's reconciler index lives in
    migration 986 (additive, partial index on `created_at` WHERE
    status='pending_payment').
  - **`service.go`** — `PaymentProvider.CreateSubscriptionPayment`
    interface gains an `idempotencyKey string` parameter, mirroring
    the marketplace.refundProvider contract added in v1.0.7 item D.
    Callers pass the new subscription row's UUID so a retried HTTP
    request collapses to one PSP charge instead of duplicating it.
  - **`createNewSubscription`** — refactored state machine :
      * Free plan → StatusActive (unchanged, in subscribeToFreePlan).
      * Paid plan, trial available, first-time user → StatusTrialing,
        no PSP call (no invoice either — Phase 2 will create the
        first paid invoice on trial expiry).
      * Paid plan, no trial / repeat user → **StatusPendingPayment**
        + invoice + PSP CreateSubscriptionPayment with idempotency
        key = subscription.ID.String(). Webhook
        subscription.payment_succeeded (Phase 2) flips to active;
        subscription.payment_failed flips to expired.
  - **`if s.paymentProvider != nil` short-circuit removed**. Paid
    plans now require a configured PaymentProvider — without one,
    `createNewSubscription` returns ErrPaymentProviderRequired. The
    handler maps this to HTTP 503 "Payment provider not configured —
    paid plans temporarily unavailable", surfacing env misconfig to
    ops instead of silently giving away paid plans (the v1.0.6.2
    fantôme bug class).
  - **`GetUserSubscription` query unchanged** — already filters on
    `status IN ('active','trialing')`, so pending_payment rows
    correctly read as "no active subscription" for feature-gate
    purposes. The v1.0.6.2 hasEffectivePayment filter is kept as
    defence-in-depth for legacy rows.
  - **`hyperswitch.Provider`** — implements
    `subscription.PaymentProvider` by delegating to the existing
    `CreatePaymentSimple`. Compile-time interface assertion added
    (`var _ subscription.PaymentProvider = (*Provider)(nil)`).
  - **`routes_subscription.go`** — wires the Hyperswitch provider
    into `subscription.NewService` when HyperswitchEnabled +
    HyperswitchAPIKey + HyperswitchURL are all set. Without those,
    the service falls back to no-provider mode (paid subscribes
    return 503).
  - **Tests** : new TestSubscribe_PendingPaymentStateMachine in
    gate_test.go covers all five visible outcomes (free / paid+
    provider / paid+no-provider / first-trial / repeat-trial) with a
    fakePaymentProvider that records calls. Asserts on idempotency
    key = subscription.ID.String(), PSP call counts, and the
    Subscribe response shape (client_secret + payment_id surfaced).
    5/5 green, sqlite :memory:.

Phase 2 backlog (next session) :
  - `ProcessSubscriptionWebhook(ctx, payload)` — flip pending_payment
    → active on success / expired on failure, idempotent against
    replays.
  - Recovery endpoint `POST /api/v1/subscriptions/complete/:id` —
    return the existing client_secret to resume a stalled flow.
  - Reconciliation sweep for rows stuck in pending_payment past the
    webhook-arrival window (uses the new partial index from
    migration 986).
  - Distribution.checkEligibility explicit pending_payment branch
    (today it's already handled implicitly via the active/trialing
    filter).
  - E2E @critical : POST /subscribe → POST /distribution/submit
    asserts 403 with "complete payment" until webhook fires.

Backward compat : clients on the previous flow that called
/subscribe expecting an immediately-active row will now see
status=pending_payment + a client_secret. They must drive the PSP
confirm step before the row is granted feature access. The
v1.0.6.2 voided_subscriptions cleanup migration (980) handles
pre-existing fantôme rows.

go build ./... clean. Subscription + handlers test suites green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 10:02:00 +02:00
senke
d03232c85c feat(storage): add track storage_backend column + config prep (v1.0.8 P0)
Some checks failed
Veza CI / Backend (Go) (push) Failing after 0s
Veza CI / Frontend (Web) (push) Failing after 0s
Veza CI / Rust (Stream Server) (push) Failing after 0s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 0s
Veza CI / Notify on failure (push) Failing after 0s
Phase 0 of the MinIO upload migration (FUNCTIONAL_AUDIT §4 item 2).
Schema + config only — Phase 1 will wire TrackService.UploadTrack()
to actually route writes to S3 when the flag is flipped.

Schema (migration 985):
- tracks.storage_backend VARCHAR(16) NOT NULL DEFAULT 'local'
  CHECK in ('local', 's3')
- tracks.storage_key VARCHAR(512) NULL (S3 object key when backend=s3)
- Partial index on storage_backend = 's3' (migration progress queries)
- Rollback drops both columns + index; safe only while all rows are
  still 'local' (guard query in the rollback comment)

Go model (internal/models/track.go):
- StorageBackend string (default 'local', not null)
- StorageKey *string (nullable)
- Both tagged json:"-" — internal plumbing, never exposed publicly

Config (internal/config/config.go):
- New field Config.TrackStorageBackend
- Read from TRACK_STORAGE_BACKEND env var (default 'local')
- Production validation rule #11 (ValidateForEnvironment):
  - Must be 'local' or 's3' (reject typos like 'S3' or 'minio')
  - If 's3', requires AWS_S3_ENABLED=true (fail fast, do not boot with
    TrackStorageBackend=s3 while S3StorageService is nil)
- Dev/staging warns and falls back to 'local' instead of fail — keeps
  iteration fast while still flagging misconfig.

Docs:
- docs/ENV_VARIABLES.md §13 restructured as "HLS + track storage backend"
  with a migration playbook (local → s3 → migrate-storage CLI)
- docs/ENV_VARIABLES.md §28 validation rules: +2 entries for new rules
- docs/ENV_VARIABLES.md §29 drift findings: TRACK_STORAGE_BACKEND added
  to "missing from template" list before it was fixed
- veza-backend-api/.env.template: TRACK_STORAGE_BACKEND=local with
  comment pointing at Phase 1/2/3 plans

No behavior change yet — TrackService.UploadTrack() still hardcodes the
local path via copyFileAsync(). Phase 1 wires it.

Refs:
- AUDIT_REPORT.md §9 item (deferrals v1.0.8)
- FUNCTIONAL_AUDIT.md §4 item 2 "Stockage local disque only"
- /home/senke/.claude/plans/audit-fonctionnel-wild-hickey.md Item 3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:54:28 +02:00
senke
3c4d0148be feat(webhooks): persist raw hyperswitch payloads to audit log — v1.0.7 item E
Every POST /webhooks/hyperswitch delivery now writes a row to
`hyperswitch_webhook_log` regardless of signature-valid or
processing outcome. Captures both legitimate deliveries and attack
probes — a forensics query now has the actual bytes to read, not
just a "webhook rejected" log line. Disputes (axis-1 P1.6) ride
along: the log captures dispute.* events alongside payment and
refund events, ready for when disputes get a handler.

Table shape (migration 984):
  * payload TEXT — readable in psql, invalid UTF-8 replaced with
    empty (forensics value is in headers + ip + timing for those
    attacks, not the binary body).
  * signature_valid BOOLEAN + partial index for "show me attack
    attempts" being instantaneous.
  * processing_result TEXT — 'ok' / 'error: <msg>' /
    'signature_invalid' / 'skipped'. Matches the P1.5 action
    semantic exactly.
  * source_ip, user_agent, request_id — forensics essentials.
    request_id is captured from Hyperswitch's X-Request-Id header
    when present, else a server-side UUID so every row correlates
    to VEZA's structured logs.
  * event_type — best-effort extract from the JSON payload, NULL
    on malformed input.

Hardening:
  * 64KB body cap via io.LimitReader rejects oversize with 413
    before any INSERT — prevents log-spam DoS.
  * Single INSERT per delivery with final state; no two-phase
    update race on signature-failure path. signature_invalid and
    processing-error rows both land.
  * DB persistence failures are logged but swallowed — the
    endpoint's contract is to ack Hyperswitch, not perfect audit.

Retention sweep:
  * CleanupHyperswitchWebhookLog in internal/jobs, daily tick,
    batched DELETE (10k rows + 100ms pause) so a large backlog
    doesn't lock the table.
  * HYPERSWITCH_WEBHOOK_LOG_RETENTION_DAYS (default 90).
  * Same goroutine-ticker pattern as ScheduleOrphanTracksCleanup.
  * Wired in cmd/api/main.go alongside the existing cleanup jobs.

Tests: 5 in webhook_log_test.go (persistence, request_id auto-gen,
invalid-JSON leaves event_type empty, invalid-signature capture,
extractEventType 5 sub-cases) + 4 in cleanup_hyperswitch_webhook_
log_test.go (deletes-older-than, noop, default-on-zero,
context-cancel). Migration 984 applied cleanly to local Postgres;
all indexes present.

Also (v107-plan.md):
  * Item G acceptance gains an explicit Idempotency-Key threading
    requirement with an empty-key loud-fail test — "literally
    copy-paste D's 4-line test skeleton". Closes the risk that
    item G silently reopens the HTTP-retry duplicate-charge
    exposure D closed.

Out of scope for E (noted in CHANGELOG):
  * Rate limit on the endpoint — pre-existing middleware covers
    it at the router level; adding a per-endpoint limit is
    separate scope.
  * Readable-payload SQL view — deferred, the TEXT column is
    already human-readable; a convenience view is a nice-to-have
    not a ship-blocker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 02:44:58 +02:00
senke
1a133af9ac feat(marketplace): stripe reversal error disambiguation + CHECK constraint + E2E — v1.0.7 item B day 3
Day-3 closure of item B. The three things day 2 deferred are now done:

1. Stripe error disambiguation.
   ReverseTransfer in StripeConnectService now parses
   stripe.Error.Code + HTTPStatusCode + Msg to emit the sentinels
   the worker routes on. Pre-day-3 the sentinels were declared but
   the service wrapped every error opaquely, making this the exact
   "temporary compromise frozen into permanent" pattern the audit
   was meant to prevent — flagged during review and fixed same day.

   Mapping:
     * 404 + code=resource_missing  → ErrTransferNotFound
     * 400 + msg matches "already" + "reverse" → ErrTransferAlreadyReversed
     * any other                    → transient (wrapped raw, retry)

   The "already reversed" case has no machine-readable code in
   stripe-go (unlike ChargeAlreadyRefunded for charges — the SDK
   doesn't enumerate the equivalent for transfers), so it's
   message-parsed. Fragility documented at the call site: if Stripe
   changes the wording, the worker treats the response as transient
   and eventually surfaces the row to permanently_failed after max
   retries. Worst-case regression is "benign case gets noisier",
   not data loss.

2. Migration 983: CHECK constraint chk_reversal_pending_has_next_
   retry_at CHECK (status != 'reversal_pending' OR next_retry_at
   IS NOT NULL). Added NOT VALID so the constraint is enforced on
   new writes without scanning existing rows; a follow-up VALIDATE
   can run once the table is known to be clean. Prevents the
   "invisible orphan" failure mode where a reversal_pending row
   with NULL next_retry_at would be skipped by any future stricter
   worker query.

3. End-to-end reversal flow test (reversal_e2e_test.go) chains
   three sub-scenarios: (a) happy path — refund.succeeded →
   reversal_pending → worker → reversed with stripe_reversal_id
   persisted; (b) invalid stripe_transfer_id → worker terminates
   rapidly to permanently_failed with single Stripe call, no
   retries (the highest-value coverage per day-3 review); (c)
   already-reversed out-of-band → worker flips to reversed with
   informative message.

Architecture note — the sentinels were moved to a new leaf
package `internal/core/connecterrors` because both marketplace
(needs them for the worker's errors.Is checks) and services (needs
them to emit) import them, and an import cycle
(marketplace → monitoring → services) would form if either owned
them directly. marketplace re-exports them as type aliases so the
worker code reads naturally against the marketplace namespace.

New tests:
  * services/stripe_connect_service_test.go — 7 cases on
    isAlreadyReversedMessage (pins Stripe's wording), 1 case on
    the error-classification shape. Doesn't invoke stripe.SetBackend
    — the translation logic is tested via a crafted *stripe.Error,
    the emission is trusted on the read of `errors.As` + the known
    shape of stripe.Error.
  * marketplace/reversal_e2e_test.go — 3 end-to-end sub-tests
    chaining refund → worker against a dual-role mock. The
    invalid-id case asserts single-call-no-retries termination.
  * Migration 983 applied cleanly to the local Postgres; constraint
    visible in \d seller_transfers as NOT VALID (behavior correct
    for future writes, existing rows grandfathered).

Self-assessment on day-2's struct-literal refactor of
processSellerTransfers (deferred from day 2):
The refactor is borderline — neither clearer nor confusing than the
original mutation-after-construct pattern. Logged in the v1.0.7-rc1
CHANGELOG as a post-v1.0.7 consideration: if GORM BeforeUpdate
hooks prove cleaner on other state machines (axis 2), revisit the
anti-mutation test approach.

CHANGELOG v1.0.7-rc1 entry added documenting items A + B end-to-end.
Tag not yet applied — items C, D, E, F remain on the v1.0.7 plan.
The rc1 tag lands when those four items close + the smoke probe
validates the full cadence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 02:12:03 +02:00
senke
8d6f798f2d feat(marketplace): seller transfer state machine matrix — v1.0.7 item B day 1
Day-1 foundation for item B (async Stripe Connect reversal worker).
No worker code, no runtime enforcement yet — just the authoritative
state machine that day 2's code will route through. Before writing
the worker we want a single place where the legal transitions are
defined and tested, so the worker's behavior can be argued against
the matrix rather than implicitly codified across call sites.

transfer_transitions.go:
  * SellerTransferStatus constants (Pending, Completed, Failed,
    ReversalPending [new], Reversed [new], PermanentlyFailed).
  * AllowedTransferTransitions map: pending → {completed, failed};
    completed → {reversal_pending}; failed → {completed,
    permanently_failed}; reversal_pending → {reversed,
    permanently_failed}; reversed and permanently_failed as dead ends.
  * CanTransitionTransferStatus(from, to) — same-state always OK
    (idempotent bumps of retry_count / next_retry_at); unknown from
    fails conservatively (typos in call sites become visible).

transfer_transitions_test.go:
  * TestTransferStateTransitions iterates the full 6×6 matrix (36
    pairs) and asserts every pair against the expected outcome.
  * TestTransferStateTransitions_TerminalStatesHaveNoOutgoing
    double-locks Reversed + PermanentlyFailed as dead ends at the
    map level (not just at the caller level).
  * TestTransferStateTransitions_MatrixKeysAreAccountedFor keeps the
    canonical status list in sync with the map; a new status added
    to one but not the other fails the test.
  * TestCanTransitionTransferStatus_UnknownFromIsConservative
    documents the "unknown from → always false" policy so a future
    reader sees the intent.

Migration 982 adds a partial composite index on (status,
next_retry_at) WHERE status='reversal_pending', sibling to the
existing idx_seller_transfers_retry (scoped to failed). Two parallel
partial indexes cost less than widening the existing one (which
would need a table-level lock) and keep the worker query planner-
friendly.

Day 2 routes processSellerTransfers, TransferRetryWorker,
reverseSellerAccounting, admin_transfer_handler through
CanTransitionTransferStatus at every Status mutation, and writes
StripeReversalWorker. Day 3 exercises the end-to-end flow
(refund → reversal_pending → worker → reversed) in a smoke probe.

Checkpoint: ping user at end of day 1 before day 2 per discipline
agreed upfront.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:13:02 +02:00
senke
eedaad9f83 refactor(connect): persist stripe_transfer_id on create + retry — v1.0.7 item A
TransferService.CreateTransfer signature changes from (...) error to
(...) (string, error) — the caller now captures the Stripe transfer
identifier and persists it on the SellerTransfer row. Pre-v1.0.7 the
stripe_transfer_id column was declared on the model and table but
never written to, which blocked the reversal worker (v1.0.7 item B)
from identifying which transfer to reverse on refund.

Changes:
  * `TransferService` interface and `StripeConnectService.CreateTransfer`
    both return the Stripe transfer id alongside the error.
  * `processSellerTransfers` (marketplace service) persists the id on
    success before `tx.Create(&st)` so a crash between Stripe ACK and
    DB commit leaves no inconsistency.
  * `TransferRetryWorker.retryOne` persists on retry success — a row
    that failed on first attempt and succeeded via the worker is
    reversal-ready all the same.
  * `admin_transfer_handler.RetryTransfer` (manual retry) persists too.
  * `SellerPayout.ExternalPayoutID` is populated by the Connect payout
    flow (`payout.go`) — the field existed but was never written.
  * Four test mocks updated; two tests assert the id is persisted on
    the happy path, one on the failure path confirms we don't write a
    fake id when the provider errors.

Migration `981_seller_transfers_stripe_reversal_id.sql`:
  * Adds nullable `stripe_reversal_id` column for item B.
  * Partial UNIQUE indexes on both stripe_transfer_id and
    stripe_reversal_id (WHERE IS NOT NULL AND <> ''), mirroring the
    v1.0.6.1 pattern for refunds.hyperswitch_refund_id.
  * Logs a count of historical completed transfers that lack an id —
    these are candidates for the backfill CLI follow-up task.

Backfill for historical rows is a separate follow-up (cmd/tools/
backfill_stripe_transfer_ids, calling Stripe's transfers.List with
Destination + Metadata[order_id]). Pre-v1.0.7 transfers without a
backfilled id cannot be auto-reversed on refund — document in P2.9
admin-recovery when it lands. Acceptable scope per v107-plan.

Migration number bumped 980 → 981 because v1.0.6.2 used 980 for the
unpaid-subscription cleanup; v107-plan updated with the note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 13:08:39 +02:00
senke
9a8d2a4e73 chore(release): v1.0.6.2 — subscription payment-gate bypass hotfix
Closes a bypass surfaced by the 2026-04 audit probe (axis-1 Q2): any
authenticated user could POST /api/v1/subscriptions/subscribe on a paid
plan and receive 201 active without the payment provider ever being
invoked. The resulting row satisfied `checkEligibility()` in the
distribution service via `can_sell_on_marketplace=true` on the Creator
plan — effectively free access to /api/v1/distribution/submit, which
dispatches to external partners.

Fix is centralised in `GetUserSubscription` so there is no code path
that can grant subscription-gated access without routing through the
payment check. Effective-payment = free plan OR unexpired trial OR
invoice with non-empty hyperswitch_payment_id. Migration 980 sweeps
pre-existing fantôme rows into `expired`, preserving the tuple in a
dated audit table for support outreach.

Subscribe and subscribeToFreePlan treat the new ErrSubscriptionNoPayment
as equivalent to ErrNoActiveSubscription so re-subscription works
cleanly post-cleanup. GET /me/subscription surfaces needs_payment=true
with a support-contact message rather than a misleading "you're on
free" or an opaque 500. TODO(v1.0.7-item-G) annotation marks where the
`if s.paymentProvider != nil` short-circuit needs to become a mandatory
pending_payment state.

Probe script `scripts/probes/subscription-unpaid-activation.sh` kept as
a versioned regression test — dry-run by default, --destructive logs in
and attempts the exploit against a live backend with automatic cleanup.
8-case unit test matrix covers the full hasEffectivePayment predicate.

Smoke validated end-to-end against local v1.0.6.2: POST /subscribe
returns 201 (by design — item G closes the creation path), but
GET /me/subscription returns subscription=null + needs_payment=true,
distribution eligibility returns false.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 12:21:53 +02:00
senke
f047276362 chore: cleanup old e2e tests, playwright configs, reorganize down migrations
- Remove old apps/web/e2e/ test suite (replaced by tests/e2e/)
- Remove old playwright configs (smoke, storybook, visual, root)
- Move down migrations to veza-backend-api/migrations/rollback/
- Remove stale test results and playwright report artifacts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 11:35:26 +01:00