# Changelog - Veza ## [v1.0.7-rc1] - in progress (2026-04-18) Release-candidate entry — items A and B delivered, items C/D/E/F remain. See `docs/audit-2026-04/v107-plan.md` for the full scope. This CHANGELOG section documents what's landed against main so far; a final v1.0.7 tag requires the remaining items to close. ### Item A — persist stripe_transfer_id on seller_transfers Pre-v1.0.7 `TransferService.CreateTransfer` returned `error` only — the Stripe transfer id was discarded (the single line `_, err := transfer.New(params)` threw it away) and the `stripe_transfer_id` column sat empty on every row. This blocked item B's reversal worker from identifying which transfer to reverse. * Interface signature change: `(..., error)` → `(id string, ..., error)`. * Four call sites capture and persist the id: processSellerTransfers (new sale), TransferRetryWorker (retry recovery), admin_transfer_handler.RetryTransfer (manual admin retry), payout.RequestPayout (writes to SellerPayout.ExternalPayoutID). * Four test mocks extended. Three assertions added verifying persistence on the happy path; one failure-path test confirms the id is NOT persisted when the provider errors. * Migration `981_seller_transfers_stripe_reversal_id.sql` adds `stripe_reversal_id` (prep for B) and partial UNIQUE indexes on both id columns (matching the v1.0.6.1 pattern for refunds.hyperswitch_refund_id). * Defensive guard: `StripeConnectService.CreateTransfer` fails the call if Stripe returns `(tr, nil)` with `tr.ID == ""` — the SDK's invariant, but a violation would leave the row permanently un-reversible, so better to fail loudly. Backfill for historical rows where the id is empty (ops task #38) is tracked separately: pre-v1.0.7 transfers cannot be auto-reversed; the backfill CLI queries Stripe's transfers.List by metadata[order_id] to populate missing ids, acceptable to leave NULL per v107-plan. ### Item D — Idempotency-Key on CreatePayment / CreateRefund The Hyperswitch client now sends an `Idempotency-Key` HTTP header on every outbound POST /payments and POST /refunds. The header value is an explicit parameter at every call site — no context-carrier magic, no auto-generation — so the contract is visible in every call and impossible to forget (empty keys cause a loud error, not silent header omission). Key values: * CreatePayment → `order.ID.String()` (UUID generated by GORM BeforeCreate before the HTTP call). * CreateRefund → `pendingRefund.ID.String()` (same pattern — UUID populated by the Phase 1 tx.Create in RefundOrder, available and stable for the Phase 2 PSP call). Scope (load-bearing note for future readers): `Idempotency-Key` covers HTTP-transport retry (TLS reconnect, proxy retry, DNS flap) within a single CreatePayment / CreateRefund invocation. It does NOT cover application-level replay (user double-click, form double-submit, retry after crash before DB write). That class of bug requires state-machine preconditions on VEZA side — already addressed by the order state machine + checkout handler guards (for payments) and the partial UNIQUE on `refunds.hyperswitch_refund_id` landed in v1.0.6.1 (for refunds). Hyperswitch TTL on Idempotency-Key is typically 24h–7d server-side (verify against current PSP docs). Beyond TTL, a retry with the same key is treated as a new request. Not a concern at current volumes; document if retry logic ever extends beyond 1 hour. What stays unchanged: this commit does NOT add application-level retry logic. The current "try once, fail loudly" behavior on PSP errors is preserved. Adding retries is a separate design exercise (backoff, max attempts, circuit breaker) explicitly out of scope for item D. Tests: * Two httptest.Server-backed tests in client_test.go pin the header value emitted for CreatePayment and CreateRefund, plus two tests asserting empty keys cause a loud error. * TestRefundOrder_OpensPendingRefund now pins the `refund.ID.String() == lastIdempotencyKey` contract so a future refactor that drops or reshapes the key fails the test. * Four existing test mocks updated for the new signature. Subscription's CreateSubscriptionPayment interface also takes a payment provider but no implementation is wired in today (v1.0.6.2 noted this as the bypass surface, v1.0.7 item G is the full fix). When item G lands its Hyperswitch-backed subscription provider, it will need to thread the idempotency key through the same way — noted in item G's acceptance in v107-plan.md. ### Item B — async Stripe Connect reversal worker `reverseSellerAccounting` moved from synchronous "mark row reversed locally without calling Stripe" to asynchronous "mark row reversal_pending, let the worker reconcile out-of-band". Decouples buyer-facing refund UX (completes immediately) from Stripe settlement health (may retry, may 404 if already reversed, may permanently fail and need ops attention). State machine — single source of truth in `internal/core/marketplace/transfer_transitions.go`: pending → {completed, failed} completed → {reversal_pending} (item B) failed → {completed, permanently_failed} reversal_pending → {reversed, permanently_failed} (item B) reversed → {} (terminal) permanently_failed → {} (terminal) `SellerTransfer.TransitionStatus(tx, to, extras)` validates against the matrix and performs a conditional UPDATE guarded by the expected `from` (optimistic lock semantics — concurrent workers racing on the same row find RowsAffected=0 and log a conflict). `TestNoDirectTransferStatusMutation` greps the marketplace package for raw `.Status = "..."` or `Model(&SellerTransfer{}).Update("status"...)` outside a minimal allowlist and fails if found; validated against an injected violation during development. StripeReversalWorker (`internal/core/marketplace/reversal_worker.go`): * Tick interval: `REVERSAL_CHECK_INTERVAL` (default 1m). * Batch limit 20 per tick, indexed on partial composite `(status, next_retry_at) WHERE status='reversal_pending'` (migration 982). * Exponential backoff: `REVERSAL_BACKOFF_BASE` * 2^retry_count, capped at `REVERSAL_BACKOFF_MAX` (defaults 1m and 1h). * `REVERSAL_MAX_RETRIES` (default 5) transitions the row to permanently_failed. * Legacy rows with empty stripe_transfer_id → permanently_failed immediately with a distinctive error_message, so ops can find them via grep once the backfill CLI (task #38) lands. Stripe error disambiguation (day 3 closure of the day-2 dead-code gap): * 404 + `resource_missing` → `ErrTransferNotFound` → worker transitions to permanently_failed (data-integrity signal, never retry — would amplify the inconsistency). * 400 + message contains "already" + "reversal/reversed" → `ErrTransferAlreadyReversed` → worker treats as success (someone reversed out-of-band via Dashboard or another instance; idempotent). * Any other error is transient → retry with backoff. * Sentinels live in `internal/core/connecterrors` as a leaf package because marketplace and services both need them and an import cycle (marketplace → monitoring → services) would form if either owned them directly. Migration `982` adds the partial composite index for the worker's hot path. Migration `983` adds a CHECK constraint (`status != 'reversal_pending' OR next_retry_at IS NOT NULL`) so the invariant that every reversal_pending row carries a retry timestamp is structural — a bug that ever writes NULL next_retry_at on a reversal_pending row fails the INSERT/UPDATE at the DB, not silently orphans the row. Worker covers 9 unit-test cases plus 3 end-to-end scenarios (refund → worker → reversed, including the invalid-stripe_transfer_id terminal path). Integration smoke against local Postgres confirmed migrations 981/982/983 apply cleanly. Behavior change visible to tests: the refund.succeeded webhook now leaves the seller_transfer at reversal_pending rather than reversed directly. `TestProcessRefundWebhook_SucceededFinalizesState` updated to assert the new expected state and the presence of next_retry_at. Worker wired in `cmd/api/main.go` alongside TransferRetryWorker, sharing the same StripeConnectService instance. Gated on `StripeConnectEnabled && StripeConnectSecretKey != ""` (same as TransferRetryWorker) — in dev without Stripe configured, the worker never starts. ### Notes * `REVERSAL_*` env vars documented in `.env.template` so ops can tune without source-diving. * Anti-mutation test decision (grep-based rather than GORM BeforeUpdate hook) forced a minor refactor of `processSellerTransfers` to construct SellerTransfer rows in a single struct literal rather than mutating Status in place after construction. The refactor is neither clearer nor more confusing than the original — borderline stylistic. Logged as a post-v1.0.7 consideration: if the GORM hook approach proves cleaner in axis 2 (state-machine transitions for other entities), revisit and potentially retire the grep test in favor of a hook. * Item A unknown #2 (backfill coverage on historical transfers) tracked as task #38; item B unknown: none surfaced during implementation. ## [v1.0.6.2] - 2026-04-17 ### Hotfix — subscription payment-gate bypass Discovered during the 2026-04 audit probe (ops question Q2, "are paid subscriptions actually gated server-side?"). An authenticated user could POST `/api/v1/subscriptions/subscribe` with a paid plan and receive HTTP 201 with `status=active` — with the payment provider never invoked when `HYPERSWITCH_ENABLED=false` (or unset). The resulting row satisfied `checkEligibility()` in the distribution service, which returns `sub.Plan.HasDistribution || sub.Plan.CanSellOnMarketplace`. The Creator plan carries `can_sell_on_marketplace=true`, so any user could reach `/api/v1/distribution/submit` — a paid feature that dispatches to external distribution partners — without paying. Fix — `GetUserSubscription` now filters out active/trialing rows that lack an effective payment linkage. "Effective" means: on a free plan, or in an unexpired trial, or at least one attached invoice carries a PSP payment intent (`hyperswitch_payment_id` non-empty). This is the sole centralised gate; all paid-feature eligibility paths (distribution and anything added later) route through it. * `ErrSubscriptionNoPayment` added to `internal/core/subscription`. `GetUserSubscription` returns it when a row sits in active/trialing but fails the payment-effective predicate. Callers treat it as ineligible (distribution returns `false, nil`; subscription HTTP handlers return 404 "Active subscription" for cancel/reactivate/ billing-cycle paths; `GET /me/subscription` returns an explicit `needs_payment=true` payload so honest-path users who landed here via a broken flow get actionable information, not a misleading "you're on free" or an opaque 500). * `Subscribe` and `subscribeToFreePlan` also treat the new error as "no existing active subscription" so a user can re-subscribe cleanly once migration 980 has voided their fantôme row. * `distribution.checkEligibility` propagates `ErrSubscriptionNoPayment` instead of swallowing it as a generic ineligible; the distribution handler surfaces a specific 403 message ("Your subscription is not linked to a payment. Complete payment to enable distribution.") so an honest-path user isn't told to "upgrade their plan" when they already have one. * Migration `980_void_unpaid_subscriptions.sql` sweeps all pre-v1.0.6.2 fantôme rows into `status='expired'`, capturing the `(subscription_id, user_id, plan_id, previous_status)` tuple in a dated audit table (`voided_subscriptions_20260417`) so support can notify any honest-path user who landed there by mistake. * Probe script `scripts/probes/subscription-unpaid-activation.sh` kept as a versioned regression test. `--dry-run` lists plans; `--destructive` logs in and attempts the exploit, cleaning up after itself. Exit 0 = no bypass; exit 1 = bypass detected. * Unit test `gate_test.go` covers the 8-branch matrix of the `hasEffectivePayment` predicate (free pass, paid with/without invoice, paid with empty vs populated `hyperswitch_payment_id`, trial variants with future/past/nil `trial_end`, no row at all). * `TODO(v1.0.7-item-G)` annotation on the `if s.paymentProvider != nil` short-circuit in `createNewSubscription` so the v1.0.7 work that replaces it with a mandatory `pending_payment` state retains the audit trail. ### Security Closes a subscription-gate bypass affecting distribution eligibility. Internal audit finding; no external report. Axis-1 correctness item P1.7 will be reclassified to P0 and item G added to the v1.0.7 plan in a follow-up commit. ## [v1.0.6.1] - 2026-04-17 ### Hotfix — partial UNIQUE on refunds.hyperswitch_refund_id Surfaced by the v1.0.6 refund smoke test (scenario S4, triggered after S3 left a failed refund in its post-Phase-1 / pre-Phase-2 state): the plain UNIQUE constraint from migration 978 rejected a second refund attempt on a *different* order because both rows had `hyperswitch_refund_id=''` (Go's zero-value string → empty string, not NULL). Postgres treats two empty strings as colliding under a regular UNIQUE; it only skips NULLs. * Migration `979_refunds_unique_partial.sql` drops the original constraint and replaces it with a partial UNIQUE that only enforces uniqueness when `hyperswitch_refund_id IS NOT NULL AND <> ''`. * Preserves the load-bearing idempotency guarantee for successful refunds (duplicate webhook lands on the same row because the PSP refund_id is set). * No Go code change — the model and service logic were already correct; only the DB constraint shape needed fixing. Smoke coverage that caught it + re-validates the fix: * S1 happy path: refund + order + license + seller_transfer + seller_balance all reconciled end-to-end * S2 idempotent replay: succeeded_at + transfer.updated_at + available_cents strictly unchanged across 2 webhook deliveries (THE critical proof — duplicate Hyperswitch retries are no-ops at the row level, not at the handler level) * S3 PSP error rollback: order reverts to completed, refund persisted as failed, no seller debit * S4 webhook refund.failed: order reverts, license intact, seller balance intact — **this is the scenario that surfaced the bug** * S5 double-submit: second POST returns 400 ErrRefundAlreadyRequested, only 1 refund row persisted ## [v1.0.6] - 2026-04-17 ### Ergonomics + operational hardening — six items from the v1.0.5 backlog Follow-up to the hardening sprint. v1.0.5 validated the `register → verify → play` critical path end-to-end; v1.0.6 addresses the next layer — the UX friction and operational blindspots that a first-day public user (or a first-day on-call) would hit. Six targeted commits. #### Fix 1 — Self-service creator role (`c32278dc1`) New `POST /api/v1/users/me/upgrade-creator`. Verified users click a "Become an artist" button in `/settings → Account` and their role flips from `user` to `creator` on one conscious click — no KYC, no cooldown, no admin round-trip. One-way by design (downgrade = support ticket) so we don't have to handle the "my uploads orphaned" edge case. * Gated strictly on `is_verified=true` (403 `EMAIL_NOT_VERIFIED` otherwise). * Idempotent 200 for anyone already creator-tier — no clutter. * UPDATE scoped `WHERE role='user'` so a concurrent admin assignment can't be silently overwritten. * Audit trail: `user.upgrade_creator` action logged with the full role transition metadata. * Migration `977_users_promoted_to_creator_at.sql` adds a nullable `promoted_to_creator_at TIMESTAMPTZ` column — distinguishes organic self-promotions from admin-assigned roles for analytics. * Tests: 6 Go (happy path, unverified, already-creator, admin idempotent, 404, no-auth) + 7 Vitest (verified button, unverified state, hidden for creator, hidden for admin, refetch on success, idempotent message, server error toast). #### Fix 2 — Upload size limits from a single source (`5848c2e40`) The v1.0.5 audit flagged a "front 500MB vs back 100MB" mismatch. In reality every live pair was aligned (tracks 100/100, cloud 500/500, video 500/500) — the real architectural bug was **five duplicated hardcoded values** that could drift silently as soon as anyone tuned one. * `internal/config/upload_limits.go`: `AudioLimit`, `ImageLimit`, `VideoLimit` expose `Bytes()`, `MB()`, `HumanReadable()`, `AllowedMIMEs`. Read lazily from env (`MAX_UPLOAD_AUDIO_MB`, `MAX_UPLOAD_IMAGE_MB`, `MAX_UPLOAD_VIDEO_MB`, defaults 100/10/500). Invalid/negative/zero env values fall back to default. * `track/service.go`, `track_upload_handler.go`, `education_handler.go`, `upload.go:GetUploadLimits` all consume the single source. Changing one env retunes every path. * Frontend `useUploadLimits()` hook: react-query with 5 min stale, 30 min gc, 1 retry then optimistic fallback to baked-in defaults so the dropzone stays responsive even without the network round trip. `useUploadModal` replaces `MAX_FILE_SIZE` constant with the live value; `UploadModal` forwards `audioMaxHuman` to `UploadModalDropzone` so the label and error toast track the env. * Out of scope (tracked for later): `CloudUploadModal.tsx` still hardcodes 500MB — cloud uploads accept audio+zip+midi with a different category semantic than the three in `/upload/limits`. Unifying deserves its own design pass. * Tests: 4 Go (defaults, env override, invalid env fallback, MIME lists) + 4 Vitest (sync fallback, typed mapping, partial-payload fallback per category, network failure keeps fallback). #### Fix 3 — Unified SMTP env schema (`066144352`) Two email services in-tree read *different* env vars for the same fields — surfaced during the v1.0.5.1 hotfix: internal/email/sender.go internal/services/email_service.go SMTP_USERNAME SMTP_USER SMTP_FROM FROM_EMAIL SMTP_FROM_NAME FROM_NAME v1.0.6 reconciles both onto canonical `SMTP_*` names, with a migration fallback to the legacy names that logs a structured deprecation warning (`remove_in: v1.1.0`). * `internal/email/sender.go` is the single loader — both services delegate to it via `LoadSMTPConfigFromEnvWithLogger(*zap.Logger)`. Canonical wins over deprecated; no precedence surprise. * `docker-compose.yml` backend-api env: `FROM_EMAIL` / `FROM_NAME` → `SMTP_FROM` / `SMTP_FROM_NAME` to match the canonical schema. * `.env.template` trimmed — only canonical vars ship, old ones removed (still accepted in running env for zero-downtime rollover). * No default injected for Host/Port in the loader. `Host==""` → callers go log-only (matches historic dev behavior). Dev defaults stay in `.env.template`, so prod fails fast instead of silently dialing localhost. * Tests: 5 Go (empty env, canonical direct, deprecated fallback + warning emission, canonical silently wins over deprecated, nil logger allowed). #### Fix 4 — Refund reverse-charge with idempotent webhook (`959031667`) The structural one. Before v1.0.6, `RefundOrder` wrote `status='refunded'` to the DB and called Hyperswitch synchronously, treating the API ack as terminal. In reality Hyperswitch returns `pending` and only finalizes via webhook. Customers could see "refunded" while their bank was still uncredited, and the seller balance kept its credit even on successful refunds. * Two-phase flow: 1. **Open pending refund** (short row-locked tx): validate permissions + 14-day window + double-submit guard; persist `Refund{status=pending}`; flip order to `refund_pending` (not `refunded` — that's the webhook's job). 2. **PSP call outside the tx**: `Provider.CreateRefund` returns `(refund_id, status, err)`. On error, mark refund failed + roll order back to `completed`. On success, capture the `hyperswitch_refund_id` as the idempotency key — stay in `pending` even if the sync status is "succeeded" (per customer guidance: never trust the sync ack, always wait for the webhook). 3. **`ProcessRefundWebhook`** drives terminal state. Row-lock + `IsTerminal()` short-circuit: any duplicate Hyperswitch retry is a no-op 200. On `refund.succeeded`: flip refund + order to succeeded/refunded, revoke licenses, debit seller balance, mark every `SellerTransfer` for the order as `reversed`. * Migration `978_refunds_table.sql` with `UNIQUE(hyperswitch_refund_id)` — this is the load-bearing idempotency guarantee. * Webhook routing: `HyperswitchWebhookPayload.IsRefundEvent()` dispatches `refund.*` events to `ProcessRefundWebhook`; payment events keep flowing through the existing `ProcessPaymentWebhook`. * `DebitSellerBalance` ported off Postgres-only `GREATEST()` to portable `CASE WHEN`; the path wasn't exercised before v1.0.6, so this is a quality fix not a regression. * Partial refunds: signature carries `amount *int64` (nil = full) but service call-site passes nil — full-only for v1.0.6. Partial-refund UX is deferred to v1.0.7. * Stripe Connect Transfers:reversal call flagged TODO(v1.0.7). Internal balance + transfer-status are corrected here so buyer and seller views match the moment the PSP confirms; the missing piece is the money-movement round-trip at Stripe. Internal accounting is consistent — external settlement catches up with v1.0.7. * Tests: 15 Go cases covering Phase 1 (pending state, PSP error rollback, double-submit, permissions, window), webhook finalization (succeeded, failed, idempotent replay with `succeeded_at` timestamp invariant, unknown refund_id, missing refund_id, non-terminal ignored), and dispatcher logic (6 `IsRefundEvent` cases across flat/nested/event_type shapes). #### Fix 5 — RTMP ingest health banner on Go Live (`64fa0c9ac`) "Go Live" was silent when `nginx-rtmp` wasn't running. An artist could copy the RTMP URL + stream key, fire OBS, and broadcast into the void with no in-UI signal. * `GET /api/v1/live/health` TCP-dials `NGINX_RTMP_ADDR` (default `localhost:1935`), 2s timeout, 15s TTL cache protected by a mutex so a burst of page loads can't hammer the ingest. Returns UI-safe `error` string (no raw hostname leak) and `Cache-Control: private, max-age=15` so browsers honor the same window. * Unreachable path emits a WARN log so operators see the outage before users do. * Frontend `useLiveHealth()` hook: react-query 15s stale, 1 retry, then optimistic `{ rtmpReachable: true }` — better to miss a banner than flash a false negative on a transient health-endpoint blip. * `LiveRtmpHealthBanner` at the top of `GoLivePage`: amber, non-blocking, copy explicitly tells the artist the stream key is still valid but broadcasting won't reach anyone, with a Retry button that invalidates the health query. * Tests: 3 Go (listener reachable + Cache-Control; dead port unreachable + UI-safe error asserting no `127.0.0.1` leak; TTL cache survives listener teardown) + 3 Vitest (hidden when reachable, visible with Retry when unreachable, Retry invalidates the right query key). #### Fix 6 — RabbitMQ publish failures no longer silent (`bf688af35`) `RabbitMQEventBus.Publish` returned the broker error but did not log it. Callers that wrapped `Publish` in fire-and-forget (`_ = eb.Publish(...)`) lost events with zero trace during RMQ outages. * `Publish` now emits a structured ERROR on broker failure with the exchange, routing_key, payload_bytes, content_type, and message_id context. Function still returns the error so call-sites that actually check it keep working. * `EventBus disabled` warning kept but upgraded with `payload_bytes` so dashboards can quantify drops when RMQ is intentionally off. * Aligns the legacy `internal/eventbus` with `infrastructure/eventbus` which already had this pattern. * Tests: 2 Go (disabled bus emits WARN + returns `EventBusUnavailableError`; nil logger stays panic-free for legacy callers). ### Breaking changes * `marketplace.MarketplaceService.RefundOrder` now returns `(*Refund, error)` instead of `error`. Callers consuming the service directly need to accept the pending refund row. * `marketplace.refundProvider` internal interface: `Refund(...) error` → `CreateRefund(...) (refundID, status string, err error)`. `hyperswitch.Provider` implements both; external mocks must be updated. * Order status machine gains `refund_pending` as an intermediate state. Clients reading `orders.status` should treat it as "in-flight refund, don't show as refunded yet". ### Known gaps (parked for v1.0.7) * Partial refunds — UX decision + call-site wiring * Stripe Connect Transfers:reversal — actually move money back at the PSP level (internal accounting is correct today) * `CloudUploadModal.tsx` hardcoded 500MB — category semantic doesn't map to the three exposed by `/upload/limits` * Smoke test of refund flow against Hyperswitch sandbox (manual, outside CI) ## [v1.0.5.1] - 2026-04-16 ### Hotfix — dev SMTP ergonomics Follow-up to the v1.0.5 smoke test: a fresh clone + `cp .env.template .env` + `make dev-full` produced a backend with `SMTP_HOST=""`, which silently short-circuits `EmailService.sendEmail` to a log-only path. New contributors hit register → "where's my verification email?" and had no obvious cue that the SMTP hookup was missing. - `veza-backend-api/.env.template`: `SMTP_HOST` / `SMTP_PORT` now default to the MailHog instance that ships with `make infra-up-dev` (`localhost:1025`, UI on `:8025`). `FROM_EMAIL` / `FROM_NAME` seeded with local-safe values. Comment rewritten to point at both the dev path and the prod override. - Also exports the duplicate variable names (`SMTP_USERNAME`, `SMTP_FROM`, `SMTP_FROM_NAME`) read by `internal/email/sender.go` — a TODO flagged for v1.0.6 to reconcile the two email services onto a single env schema. Until then both sets cover every code path. No code change, no migration, no version bump in the Go module. Pure config hotfix. ## [v1.0.5] - 2026-04-16 ### Hardening sprint — seven critical-path fixes before public opening Audit follow-up on the `register → verify → play` critical path. The app was functional on the surface but broken underneath: the player was silent, emails weren't really sent, the marketplace gave products away in production, the chat silently de-synced across pods, maintenance mode was per-pod only, orphaned tracks accumulated forever in `processing`, and the response cache was corrupting range-aware media responses. Seven targeted fixes, each with its own commit, its own tests, and no behaviour change outside scope. #### Fix 1 — Player muet (`veza-backend-api` + `apps/web`) - New `GET /api/v1/tracks/:id/stream` handler in `internal/core/track/track_hls_handler.go`. Serves the raw file via `http.ServeContent` — `Range`, `If-Modified-Since` and `If-None-Match` handled for free, so `