# Changelog - Veza ## [v1.0.7] - 2026-04-23 Release final v1.0.7. Promotion de `v1.0.7-rc1` après cleanup session tier 1/2/3 + BFG history rewrite + réconciliation top-15 priorités. ### Cleanup & hygiène repo - **BFG history rewrite** : `.git` passé de 2.3 GB à 66 MB (−97%). Stripped : binaires Go (veza-api, migrate, modern-server, server), kubectl vendoré (60 MB), audio uploads (44×.mp3), PNG racine (48), `.playwright-mcp/` (36 YML), `.env*` committed, TLS certs, builds Incus, reports E2E/lint, screenshots. Force-push stages 1 + 2 OK. - **Branch `chore/v1.0.7-cleanup` merge** : 10 commits cleanup + 1 commit audit-reconciliation (`778c8550`) mergés en fast-forward. - **`.gitignore`** : bloc J3 ajouté (post-BFG paths) complémentant les blocs J1 (2026-04-14) et J2 (2026-04-20). ### Hardening backend - **`b5281bec`** : `core/marketplace/service.go` wrap `UpdateProductImages` et `SetProductLicenses` dans une transaction GORM (évite l'état `DELETE` puis `CREATE` partiellement réussi). - **`ebf3276d`** : `middleware.UserRateLimiter` wiré dans `AuthMiddleware.RequireAuth()` (BE-SVC-002). Auparavant configuré mais jamais appelé — rate limit per-user effectif après chaque RequireAuth sur toute route. Env vars : `USER_RATE_LIMIT_PER_MINUTE` (1000 défaut), `USER_RATE_LIMIT_BURST` (100 défaut). ### Orphelins supprimés - `veza-backend-api/internal/api/handlers/{chat,rbac,rbac_test}.go` (−1142 LOC — handlers dépréciés marqués DEPRECATED). - `veza-backend-api/internal/repository/user_repository.go` (mock in-mem orphelin). - `proto/chat/chat.proto` + `veza-common/src/types/{chat,websocket}.rs` (orphelins depuis suppression chat Rust 2026-02-22). - 19 workflows `.disabled` archivés dans `docs/archive/workflows/`. ### Fixes CI/hooks - **`4310dbb7`** : pinning MinIO + mc aux tags datés `RELEASE.2025-09-07*` dans 4 compose files (supply-chain). - **`12f873bd`** : fix double-bug `.husky/pre-commit` : - récursion `cd apps/web && ...` causait typecheck/lint/tests silencieusement no-op. - grep lint `"error"` matchait `"(0 errors, K warnings)"` → regex strict `\([1-9][0-9]* error`. ### Documentation - **`7d03ee66`** : réécriture complète de `docs/ENV_VARIABLES.md` (172 → ~600 lignes) couvrant ~180 env vars survey directement du code. 30 sections, 8 règles de validation prod, 14 vars dépréciées listées, 11 drift findings. - `.env.template` : ajout `HLS_STREAMING=false` + `HLS_STORAGE_DIR` avec documentation du fallback `/tracks/:id/stream` Range (FUNCTIONAL_AUDIT §4 item 5). - Audits (`AUDIT_REPORT.md` + `FUNCTIONAL_AUDIT.md`) mis à jour pour refléter le cleanup session : 10/15 items done, 3 false-positives classés (#4 context, #5 CSP, #10 RespondWithAppError), 2 deferrals v1.0.8 (#8 OpenAPI typegen, #14 E2E CI). ### Deferrals v1.0.8 - OpenAPI typegen finish (5j, plan séparé requis). - E2E Playwright CI trigger (3j). - MinIO/S3 dans path upload (2-3j, FUNCTIONAL §4 item 2). - STUN/TURN WebRTC si calls public (1-2j). - Item G subscription `pending_payment` (v107-plan §G). --- ## [v1.0.7-rc1] - 2026-04-19 Release-candidate tag for v1.0.7. Items A through F of the post-v1.0.6.2 money-movement hardening plan are complete; item G (subscription pending_payment state) and follow-ups #44 / #52 are deferred to post-rc1. See `docs/audit-2026-04/v107-plan.md` for scope. Four skipped @critical E2E classes (v107-e2e-05/06/08/09) are flagged in `tests/e2e/SKIPPED_TESTS.md` for staging verification before v1.0.7 final. ### Item A — persist stripe_transfer_id on seller_transfers Pre-v1.0.7 `TransferService.CreateTransfer` returned `error` only — the Stripe transfer id was discarded (the single line `_, err := transfer.New(params)` threw it away) and the `stripe_transfer_id` column sat empty on every row. This blocked item B's reversal worker from identifying which transfer to reverse. * Interface signature change: `(..., error)` → `(id string, ..., error)`. * Four call sites capture and persist the id: processSellerTransfers (new sale), TransferRetryWorker (retry recovery), admin_transfer_handler.RetryTransfer (manual admin retry), payout.RequestPayout (writes to SellerPayout.ExternalPayoutID). * Four test mocks extended. Three assertions added verifying persistence on the happy path; one failure-path test confirms the id is NOT persisted when the provider errors. * Migration `981_seller_transfers_stripe_reversal_id.sql` adds `stripe_reversal_id` (prep for B) and partial UNIQUE indexes on both id columns (matching the v1.0.6.1 pattern for refunds.hyperswitch_refund_id). * Defensive guard: `StripeConnectService.CreateTransfer` fails the call if Stripe returns `(tr, nil)` with `tr.ID == ""` — the SDK's invariant, but a violation would leave the row permanently un-reversible, so better to fail loudly. Backfill for historical rows where the id is empty (ops task #38) is tracked separately: pre-v1.0.7 transfers cannot be auto-reversed; the backfill CLI queries Stripe's transfers.List by metadata[order_id] to populate missing ids, acceptable to leave NULL per v107-plan. ### Item F — ledger-health metrics + alerts Five Prometheus gauges expose money-movement pipeline state so ops dashboards + alert rules can spot a stall before a customer does. Paired with counter/histogram metrics for the item-C reconciler so the dashboard tells the whole story at a glance ("we have N stuck orders and the reconciler has resolved M of them today"). Gauges (sampled every 60s via `ScheduleLedgerHealthSampler`): * `veza_ledger_orphan_refund_rows` — THE alert gauge. Pending refunds with empty hyperswitch_refund_id older than 5m. Non-zero = Phase 2 crash in RefundOrder. Pages on > 0 for 5m. * `veza_ledger_stuck_orders_pending` — orders pending > 30m with non-empty payment_id (webhook never arrived). Pages on > 0 for 10m. * `veza_ledger_stuck_refunds_pending` — refunds with hs_id but still pending > 30m. * `veza_ledger_failed_transfers_at_max_retry` — seller_transfers in permanently_failed. * `veza_ledger_reversal_pending_transfers` — item B rows stuck in reversal_pending > 30m (worker behind or Stripe down). Reconciler metrics (item F extends item C observability): * `veza_reconciler_actions_total{phase}` — counter labelled by phase (stuck_orders | stuck_refunds | orphan_refunds). * `veza_reconciler_orphan_refunds_total` — dedicated counter for the two-phase-commit-bug canary. * `veza_reconciler_sweep_duration_seconds` — histogram with 10 exponential buckets (0.1s to ~100s). * `veza_reconciler_last_run_timestamp` — unix ts of last tick. Alert fires if `time() - ts > 7200` (2 * default RECONCILE_INTERVAL). Sampler queries are all indexed on `status + created_at` (or `status + updated_at` for reversal_pending). Query errors set the gauge to -1 — a distinctive value dashboards filter on ("sampler broken, don't trust the number") instead of leaking a stale value. Alert rules in `config/alertmanager/ledger.yml`: * `VezaOrphanRefundRows` — page on > 0 for 5m (two-phase bug) * `VezaStuckOrdersPending` — page on > 0 for 10m (webhook pipeline stuck) * `VezaReconcilerStale` — page on last-run > 2h (worker dead, stuck/orphan rows accumulating) Grafana dashboard `config/grafana/dashboards/ledger-health.json`: 5 stat panels (top row) + stuck-state timeseries + reconciler action rate + sweep duration quantiles + seconds-since-last-tick + orphan refunds cumulative. Worker instrumentation: ReconcileHyperswitchWorker now emits RecordReconcilerAction / RecordReconcilerOrphanRefund / RecordReconcilerSweepDuration at the right points. Tests cover the sampler's count queries (5 cases, all branches) plus the recorder shape. Sampler wired in cmd/api/main.go with graceful shutdown; runs regardless of Hyperswitch enablement (gauges default to 0, which is the correct story for "Hyperswitch not configured"). ### Item C — Hyperswitch reconciliation sweep New `ReconcileHyperswitchWorker` sweeps for pending orders and refunds whose terminal webhook never arrived (network hiccup, our endpoint down, PSP queue stuck). For each stuck row the worker pulls live PSP state and synthesises a webhook payload that feeds the normal `ProcessPaymentWebhook` / `ProcessRefundWebhook` dispatcher. The existing terminal-state guards in those handlers make the reconciliation idempotent against real webhooks — a late webhook after the reconciler has already resolved the row is a no-op. Covers three stuck-state classes: 1. **Stuck orders** (pending > 30m, non-empty payment_id): we opened an order, called CreatePayment, got back a payment_id, but never received the succeeded/failed webhook. Worker calls GetPaymentStatus and dispatches a synthetic `payment.` webhook. 2. **Stuck refunds with a PSP id** (pending > 30m, non-empty hyperswitch_refund_id): same pattern via GetRefundStatus + synthetic `refund.` webhook. The PSP's error_message is forwarded into the payload so downstream handlers persist it. 3. **Orphan refunds** (pending > 5m, EMPTY hyperswitch_refund_id): the harder case. We opened a Phase 1 Refund row but crashed before Phase 2 (PSP call). The row has no PSP id, the PSP has no record. Worker marks the row `failed` with an explanatory error_message, rolls the order back to `completed` (so the buyer can retry), and logs **ERROR** — this is operator- attention territory: a mid-refund crash happened, root cause should be investigated. Batch-bounded (50 rows per phase per tick) so a 10k-row backlog doesn't hammer Hyperswitch on a single tick. PSP read errors leave the row unchanged — next tick retries. Configuration: * RECONCILE_WORKER_ENABLED=true (default) * RECONCILE_INTERVAL=1h (default; ops can drop to 5m during incident response without a code change) * RECONCILE_ORDER_STUCK_AFTER=30m * RECONCILE_REFUND_STUCK_AFTER=30m * RECONCILE_REFUND_ORPHAN_AFTER=5m (shorter because orphan is an "app crashed" signal, not "network hiccup") Interfaces introduced: * `marketplace.HyperswitchReadClient` — the worker depends on read-only PSP access (`GetPaymentStatus`, `GetRefundStatus`) without knowing about CreatePayment / CreateRefund. Implemented by `hyperswitch.Provider`. * `hyperswitch.Client.GetRefund` + `RefundStatus` struct added (mirror of existing GetPayment / PaymentStatus). Worker wired in cmd/api/main.go alongside the other marketplace workers; gated on `HyperswitchEnabled && HyperswitchAPIKey != ""`. A separate scoped `marketplace.NewService` is constructed for the dispatcher side (the webhook-handler uses its own via `APIRouter.getMarketplaceService` with additional storage/checkout opts the reconciler doesn't need). Tests (10 cases, all green, sqlite :memory:): * happy-path stuck order → synthetic webhook dispatched with correct event_type / payment_id / status. * recent order (under the stuck threshold) → untouched. * completed order → untouched. * order with empty payment_id → untouched (pre-PSP-call, nothing to reconcile). * PSP read error on GetPaymentStatus → row stays pending, worker logs and moves on. * orphan refund → auto-failed + order rolled back + error logged. * recent orphan refund (under 5m) → left alone for Phase 2 to complete. * stuck refund with PSP id → synthetic webhook dispatched. * refund with status=failed → PSP error_message survives into the synthetic payload (downstream relies on it). * all-terminal-state seed (completed / refunded / succeeded rows) → zero PSP calls, zero dispatches. ### Item E — webhook raw-payload audit log Every POST /webhooks/hyperswitch delivery is now persisted to `hyperswitch_webhook_log` regardless of signature-valid or processing outcome. Captures both legitimate deliveries and attack probes — a forensics query "what did we actually receive from this IP last Tuesday" now has the actual bytes to read, not just "webhook rejected: invalid signature" in a grep-able log line. Table shape (migration 984): * `payload TEXT` — Hyperswitch sends JSON, TEXT is readable in psql without base64-decoding. Invalid UTF-8 replaced with empty string before INSERT (forensics value of a binary blob is zero vs. the headers+ip+timing we keep regardless). * `signature_valid BOOLEAN` — partial index on `WHERE signature_valid = false` makes "show me attack attempts" queries instantaneous. * `processing_result TEXT` — 'ok', 'error: ', 'signature_invalid', or 'skipped'. Matches the action semantic exactly. * `source_ip`, `user_agent`, `request_id` — forensics essentials. request_id is captured from Hyperswitch's `X-Request-Id` header if sent, else a UUID generated server-side so every row is correlatable to VEZA's structured logs. * `event_type` — best-effort extract from the JSON payload. NULL when the payload isn't valid JSON or doesn't carry an event_type field. Useful for "how many dispute.* events have we seen this month" without needing a dispute handler implemented yet (the log captures disputes alongside everything else, ready for axis-1 P1.6 when it lands). Hardening: * 64KB body cap (via `io.LimitReader`) rejects oversize payloads with 413 before any INSERT — prevents log-spam DoS. * INSERT-once-at-end-with-final-state pattern: one row per delivery, no two-phase update risk. Signature-invalid and processing-error rows both land. * DB persistence failures are logged but never fail the webhook response — the endpoint's primary contract is acking Hyperswitch. Retention sweep (CleanupHyperswitchWebhookLog in internal/jobs): * Daily tick, batched DELETE (10k rows per batch with 100ms pause between) so a large backlog doesn't lock the table. * Retention configurable via `HYPERSWITCH_WEBHOOK_LOG_RETENTION_DAYS` (default 90). * Uses the same goroutine-ticker pattern as ScheduleOrphanTracksCleanup / ScheduleSessionCleanup. Tests: * 5 tests in `internal/services/hyperswitch/webhook_log_test.go`: minimal-field persistence, request_id auto-generation on empty input, invalid-JSON leaves event_type empty, invalid-signature rows are captured (forensics assert), extractEventType variants (5 sub-cases). * 4 tests in `internal/jobs/cleanup_hyperswitch_webhook_log_test.go`: deletes-older-than-retention, noop-when-nothing-expired, default-retention-on-zero, context-cancellation-respected. ### Item D — Idempotency-Key on CreatePayment / CreateRefund The Hyperswitch client now sends an `Idempotency-Key` HTTP header on every outbound POST /payments and POST /refunds. The header value is an explicit parameter at every call site — no context-carrier magic, no auto-generation — so the contract is visible in every call and impossible to forget (empty keys cause a loud error, not silent header omission). Key values: * CreatePayment → `order.ID.String()` (UUID generated by GORM BeforeCreate before the HTTP call). * CreateRefund → `pendingRefund.ID.String()` (same pattern — UUID populated by the Phase 1 tx.Create in RefundOrder, available and stable for the Phase 2 PSP call). Scope (load-bearing note for future readers): `Idempotency-Key` covers HTTP-transport retry (TLS reconnect, proxy retry, DNS flap) within a single CreatePayment / CreateRefund invocation. It does NOT cover application-level replay (user double-click, form double-submit, retry after crash before DB write). That class of bug requires state-machine preconditions on VEZA side — already addressed by the order state machine + checkout handler guards (for payments) and the partial UNIQUE on `refunds.hyperswitch_refund_id` landed in v1.0.6.1 (for refunds). Hyperswitch TTL on Idempotency-Key is typically 24h–7d server-side (verify against current PSP docs). Beyond TTL, a retry with the same key is treated as a new request. Not a concern at current volumes; document if retry logic ever extends beyond 1 hour. What stays unchanged: this commit does NOT add application-level retry logic. The current "try once, fail loudly" behavior on PSP errors is preserved. Adding retries is a separate design exercise (backoff, max attempts, circuit breaker) explicitly out of scope for item D. Tests: * Two httptest.Server-backed tests in client_test.go pin the header value emitted for CreatePayment and CreateRefund, plus two tests asserting empty keys cause a loud error. * TestRefundOrder_OpensPendingRefund now pins the `refund.ID.String() == lastIdempotencyKey` contract so a future refactor that drops or reshapes the key fails the test. * Four existing test mocks updated for the new signature. Subscription's CreateSubscriptionPayment interface also takes a payment provider but no implementation is wired in today (v1.0.6.2 noted this as the bypass surface, v1.0.7 item G is the full fix). When item G lands its Hyperswitch-backed subscription provider, it will need to thread the idempotency key through the same way — noted in item G's acceptance in v107-plan.md. ### Item B — async Stripe Connect reversal worker `reverseSellerAccounting` moved from synchronous "mark row reversed locally without calling Stripe" to asynchronous "mark row reversal_pending, let the worker reconcile out-of-band". Decouples buyer-facing refund UX (completes immediately) from Stripe settlement health (may retry, may 404 if already reversed, may permanently fail and need ops attention). State machine — single source of truth in `internal/core/marketplace/transfer_transitions.go`: pending → {completed, failed} completed → {reversal_pending} (item B) failed → {completed, permanently_failed} reversal_pending → {reversed, permanently_failed} (item B) reversed → {} (terminal) permanently_failed → {} (terminal) `SellerTransfer.TransitionStatus(tx, to, extras)` validates against the matrix and performs a conditional UPDATE guarded by the expected `from` (optimistic lock semantics — concurrent workers racing on the same row find RowsAffected=0 and log a conflict). `TestNoDirectTransferStatusMutation` greps the marketplace package for raw `.Status = "..."` or `Model(&SellerTransfer{}).Update("status"...)` outside a minimal allowlist and fails if found; validated against an injected violation during development. StripeReversalWorker (`internal/core/marketplace/reversal_worker.go`): * Tick interval: `REVERSAL_CHECK_INTERVAL` (default 1m). * Batch limit 20 per tick, indexed on partial composite `(status, next_retry_at) WHERE status='reversal_pending'` (migration 982). * Exponential backoff: `REVERSAL_BACKOFF_BASE` * 2^retry_count, capped at `REVERSAL_BACKOFF_MAX` (defaults 1m and 1h). * `REVERSAL_MAX_RETRIES` (default 5) transitions the row to permanently_failed. * Legacy rows with empty stripe_transfer_id → permanently_failed immediately with a distinctive error_message, so ops can find them via grep once the backfill CLI (task #38) lands. Stripe error disambiguation (day 3 closure of the day-2 dead-code gap): * 404 + `resource_missing` → `ErrTransferNotFound` → worker transitions to permanently_failed (data-integrity signal, never retry — would amplify the inconsistency). * 400 + message contains "already" + "reversal/reversed" → `ErrTransferAlreadyReversed` → worker treats as success (someone reversed out-of-band via Dashboard or another instance; idempotent). * Any other error is transient → retry with backoff. * Sentinels live in `internal/core/connecterrors` as a leaf package because marketplace and services both need them and an import cycle (marketplace → monitoring → services) would form if either owned them directly. Migration `982` adds the partial composite index for the worker's hot path. Migration `983` adds a CHECK constraint (`status != 'reversal_pending' OR next_retry_at IS NOT NULL`) so the invariant that every reversal_pending row carries a retry timestamp is structural — a bug that ever writes NULL next_retry_at on a reversal_pending row fails the INSERT/UPDATE at the DB, not silently orphans the row. Worker covers 9 unit-test cases plus 3 end-to-end scenarios (refund → worker → reversed, including the invalid-stripe_transfer_id terminal path). Integration smoke against local Postgres confirmed migrations 981/982/983 apply cleanly. Behavior change visible to tests: the refund.succeeded webhook now leaves the seller_transfer at reversal_pending rather than reversed directly. `TestProcessRefundWebhook_SucceededFinalizesState` updated to assert the new expected state and the presence of next_retry_at. Worker wired in `cmd/api/main.go` alongside TransferRetryWorker, sharing the same StripeConnectService instance. Gated on `StripeConnectEnabled && StripeConnectSecretKey != ""` (same as TransferRetryWorker) — in dev without Stripe configured, the worker never starts. ### Notes * `REVERSAL_*` env vars documented in `.env.template` so ops can tune without source-diving. * Anti-mutation test decision (grep-based rather than GORM BeforeUpdate hook) forced a minor refactor of `processSellerTransfers` to construct SellerTransfer rows in a single struct literal rather than mutating Status in place after construction. The refactor is neither clearer nor more confusing than the original — borderline stylistic. Logged as a post-v1.0.7 consideration: if the GORM hook approach proves cleaner in axis 2 (state-machine transitions for other entities), revisit and potentially retire the grep test in favor of a hook. * Item A unknown #2 (backfill coverage on historical transfers) tracked as task #38; item B unknown: none surfaced during implementation. ## [v1.0.6.2] - 2026-04-17 ### Hotfix — subscription payment-gate bypass Discovered during the 2026-04 audit probe (ops question Q2, "are paid subscriptions actually gated server-side?"). An authenticated user could POST `/api/v1/subscriptions/subscribe` with a paid plan and receive HTTP 201 with `status=active` — with the payment provider never invoked when `HYPERSWITCH_ENABLED=false` (or unset). The resulting row satisfied `checkEligibility()` in the distribution service, which returns `sub.Plan.HasDistribution || sub.Plan.CanSellOnMarketplace`. The Creator plan carries `can_sell_on_marketplace=true`, so any user could reach `/api/v1/distribution/submit` — a paid feature that dispatches to external distribution partners — without paying. Fix — `GetUserSubscription` now filters out active/trialing rows that lack an effective payment linkage. "Effective" means: on a free plan, or in an unexpired trial, or at least one attached invoice carries a PSP payment intent (`hyperswitch_payment_id` non-empty). This is the sole centralised gate; all paid-feature eligibility paths (distribution and anything added later) route through it. * `ErrSubscriptionNoPayment` added to `internal/core/subscription`. `GetUserSubscription` returns it when a row sits in active/trialing but fails the payment-effective predicate. Callers treat it as ineligible (distribution returns `false, nil`; subscription HTTP handlers return 404 "Active subscription" for cancel/reactivate/ billing-cycle paths; `GET /me/subscription` returns an explicit `needs_payment=true` payload so honest-path users who landed here via a broken flow get actionable information, not a misleading "you're on free" or an opaque 500). * `Subscribe` and `subscribeToFreePlan` also treat the new error as "no existing active subscription" so a user can re-subscribe cleanly once migration 980 has voided their fantôme row. * `distribution.checkEligibility` propagates `ErrSubscriptionNoPayment` instead of swallowing it as a generic ineligible; the distribution handler surfaces a specific 403 message ("Your subscription is not linked to a payment. Complete payment to enable distribution.") so an honest-path user isn't told to "upgrade their plan" when they already have one. * Migration `980_void_unpaid_subscriptions.sql` sweeps all pre-v1.0.6.2 fantôme rows into `status='expired'`, capturing the `(subscription_id, user_id, plan_id, previous_status)` tuple in a dated audit table (`voided_subscriptions_20260417`) so support can notify any honest-path user who landed there by mistake. * Probe script `scripts/probes/subscription-unpaid-activation.sh` kept as a versioned regression test. `--dry-run` lists plans; `--destructive` logs in and attempts the exploit, cleaning up after itself. Exit 0 = no bypass; exit 1 = bypass detected. * Unit test `gate_test.go` covers the 8-branch matrix of the `hasEffectivePayment` predicate (free pass, paid with/without invoice, paid with empty vs populated `hyperswitch_payment_id`, trial variants with future/past/nil `trial_end`, no row at all). * `TODO(v1.0.7-item-G)` annotation on the `if s.paymentProvider != nil` short-circuit in `createNewSubscription` so the v1.0.7 work that replaces it with a mandatory `pending_payment` state retains the audit trail. ### Security Closes a subscription-gate bypass affecting distribution eligibility. Internal audit finding; no external report. Axis-1 correctness item P1.7 will be reclassified to P0 and item G added to the v1.0.7 plan in a follow-up commit. ## [v1.0.6.1] - 2026-04-17 ### Hotfix — partial UNIQUE on refunds.hyperswitch_refund_id Surfaced by the v1.0.6 refund smoke test (scenario S4, triggered after S3 left a failed refund in its post-Phase-1 / pre-Phase-2 state): the plain UNIQUE constraint from migration 978 rejected a second refund attempt on a *different* order because both rows had `hyperswitch_refund_id=''` (Go's zero-value string → empty string, not NULL). Postgres treats two empty strings as colliding under a regular UNIQUE; it only skips NULLs. * Migration `979_refunds_unique_partial.sql` drops the original constraint and replaces it with a partial UNIQUE that only enforces uniqueness when `hyperswitch_refund_id IS NOT NULL AND <> ''`. * Preserves the load-bearing idempotency guarantee for successful refunds (duplicate webhook lands on the same row because the PSP refund_id is set). * No Go code change — the model and service logic were already correct; only the DB constraint shape needed fixing. Smoke coverage that caught it + re-validates the fix: * S1 happy path: refund + order + license + seller_transfer + seller_balance all reconciled end-to-end * S2 idempotent replay: succeeded_at + transfer.updated_at + available_cents strictly unchanged across 2 webhook deliveries (THE critical proof — duplicate Hyperswitch retries are no-ops at the row level, not at the handler level) * S3 PSP error rollback: order reverts to completed, refund persisted as failed, no seller debit * S4 webhook refund.failed: order reverts, license intact, seller balance intact — **this is the scenario that surfaced the bug** * S5 double-submit: second POST returns 400 ErrRefundAlreadyRequested, only 1 refund row persisted ## [v1.0.6] - 2026-04-17 ### Ergonomics + operational hardening — six items from the v1.0.5 backlog Follow-up to the hardening sprint. v1.0.5 validated the `register → verify → play` critical path end-to-end; v1.0.6 addresses the next layer — the UX friction and operational blindspots that a first-day public user (or a first-day on-call) would hit. Six targeted commits. #### Fix 1 — Self-service creator role (`c32278dc1`) New `POST /api/v1/users/me/upgrade-creator`. Verified users click a "Become an artist" button in `/settings → Account` and their role flips from `user` to `creator` on one conscious click — no KYC, no cooldown, no admin round-trip. One-way by design (downgrade = support ticket) so we don't have to handle the "my uploads orphaned" edge case. * Gated strictly on `is_verified=true` (403 `EMAIL_NOT_VERIFIED` otherwise). * Idempotent 200 for anyone already creator-tier — no clutter. * UPDATE scoped `WHERE role='user'` so a concurrent admin assignment can't be silently overwritten. * Audit trail: `user.upgrade_creator` action logged with the full role transition metadata. * Migration `977_users_promoted_to_creator_at.sql` adds a nullable `promoted_to_creator_at TIMESTAMPTZ` column — distinguishes organic self-promotions from admin-assigned roles for analytics. * Tests: 6 Go (happy path, unverified, already-creator, admin idempotent, 404, no-auth) + 7 Vitest (verified button, unverified state, hidden for creator, hidden for admin, refetch on success, idempotent message, server error toast). #### Fix 2 — Upload size limits from a single source (`5848c2e40`) The v1.0.5 audit flagged a "front 500MB vs back 100MB" mismatch. In reality every live pair was aligned (tracks 100/100, cloud 500/500, video 500/500) — the real architectural bug was **five duplicated hardcoded values** that could drift silently as soon as anyone tuned one. * `internal/config/upload_limits.go`: `AudioLimit`, `ImageLimit`, `VideoLimit` expose `Bytes()`, `MB()`, `HumanReadable()`, `AllowedMIMEs`. Read lazily from env (`MAX_UPLOAD_AUDIO_MB`, `MAX_UPLOAD_IMAGE_MB`, `MAX_UPLOAD_VIDEO_MB`, defaults 100/10/500). Invalid/negative/zero env values fall back to default. * `track/service.go`, `track_upload_handler.go`, `education_handler.go`, `upload.go:GetUploadLimits` all consume the single source. Changing one env retunes every path. * Frontend `useUploadLimits()` hook: react-query with 5 min stale, 30 min gc, 1 retry then optimistic fallback to baked-in defaults so the dropzone stays responsive even without the network round trip. `useUploadModal` replaces `MAX_FILE_SIZE` constant with the live value; `UploadModal` forwards `audioMaxHuman` to `UploadModalDropzone` so the label and error toast track the env. * Out of scope (tracked for later): `CloudUploadModal.tsx` still hardcodes 500MB — cloud uploads accept audio+zip+midi with a different category semantic than the three in `/upload/limits`. Unifying deserves its own design pass. * Tests: 4 Go (defaults, env override, invalid env fallback, MIME lists) + 4 Vitest (sync fallback, typed mapping, partial-payload fallback per category, network failure keeps fallback). #### Fix 3 — Unified SMTP env schema (`066144352`) Two email services in-tree read *different* env vars for the same fields — surfaced during the v1.0.5.1 hotfix: internal/email/sender.go internal/services/email_service.go SMTP_USERNAME SMTP_USER SMTP_FROM FROM_EMAIL SMTP_FROM_NAME FROM_NAME v1.0.6 reconciles both onto canonical `SMTP_*` names, with a migration fallback to the legacy names that logs a structured deprecation warning (`remove_in: v1.1.0`). * `internal/email/sender.go` is the single loader — both services delegate to it via `LoadSMTPConfigFromEnvWithLogger(*zap.Logger)`. Canonical wins over deprecated; no precedence surprise. * `docker-compose.yml` backend-api env: `FROM_EMAIL` / `FROM_NAME` → `SMTP_FROM` / `SMTP_FROM_NAME` to match the canonical schema. * `.env.template` trimmed — only canonical vars ship, old ones removed (still accepted in running env for zero-downtime rollover). * No default injected for Host/Port in the loader. `Host==""` → callers go log-only (matches historic dev behavior). Dev defaults stay in `.env.template`, so prod fails fast instead of silently dialing localhost. * Tests: 5 Go (empty env, canonical direct, deprecated fallback + warning emission, canonical silently wins over deprecated, nil logger allowed). #### Fix 4 — Refund reverse-charge with idempotent webhook (`959031667`) The structural one. Before v1.0.6, `RefundOrder` wrote `status='refunded'` to the DB and called Hyperswitch synchronously, treating the API ack as terminal. In reality Hyperswitch returns `pending` and only finalizes via webhook. Customers could see "refunded" while their bank was still uncredited, and the seller balance kept its credit even on successful refunds. * Two-phase flow: 1. **Open pending refund** (short row-locked tx): validate permissions + 14-day window + double-submit guard; persist `Refund{status=pending}`; flip order to `refund_pending` (not `refunded` — that's the webhook's job). 2. **PSP call outside the tx**: `Provider.CreateRefund` returns `(refund_id, status, err)`. On error, mark refund failed + roll order back to `completed`. On success, capture the `hyperswitch_refund_id` as the idempotency key — stay in `pending` even if the sync status is "succeeded" (per customer guidance: never trust the sync ack, always wait for the webhook). 3. **`ProcessRefundWebhook`** drives terminal state. Row-lock + `IsTerminal()` short-circuit: any duplicate Hyperswitch retry is a no-op 200. On `refund.succeeded`: flip refund + order to succeeded/refunded, revoke licenses, debit seller balance, mark every `SellerTransfer` for the order as `reversed`. * Migration `978_refunds_table.sql` with `UNIQUE(hyperswitch_refund_id)` — this is the load-bearing idempotency guarantee. * Webhook routing: `HyperswitchWebhookPayload.IsRefundEvent()` dispatches `refund.*` events to `ProcessRefundWebhook`; payment events keep flowing through the existing `ProcessPaymentWebhook`. * `DebitSellerBalance` ported off Postgres-only `GREATEST()` to portable `CASE WHEN`; the path wasn't exercised before v1.0.6, so this is a quality fix not a regression. * Partial refunds: signature carries `amount *int64` (nil = full) but service call-site passes nil — full-only for v1.0.6. Partial-refund UX is deferred to v1.0.7. * Stripe Connect Transfers:reversal call flagged TODO(v1.0.7). Internal balance + transfer-status are corrected here so buyer and seller views match the moment the PSP confirms; the missing piece is the money-movement round-trip at Stripe. Internal accounting is consistent — external settlement catches up with v1.0.7. * Tests: 15 Go cases covering Phase 1 (pending state, PSP error rollback, double-submit, permissions, window), webhook finalization (succeeded, failed, idempotent replay with `succeeded_at` timestamp invariant, unknown refund_id, missing refund_id, non-terminal ignored), and dispatcher logic (6 `IsRefundEvent` cases across flat/nested/event_type shapes). #### Fix 5 — RTMP ingest health banner on Go Live (`64fa0c9ac`) "Go Live" was silent when `nginx-rtmp` wasn't running. An artist could copy the RTMP URL + stream key, fire OBS, and broadcast into the void with no in-UI signal. * `GET /api/v1/live/health` TCP-dials `NGINX_RTMP_ADDR` (default `localhost:1935`), 2s timeout, 15s TTL cache protected by a mutex so a burst of page loads can't hammer the ingest. Returns UI-safe `error` string (no raw hostname leak) and `Cache-Control: private, max-age=15` so browsers honor the same window. * Unreachable path emits a WARN log so operators see the outage before users do. * Frontend `useLiveHealth()` hook: react-query 15s stale, 1 retry, then optimistic `{ rtmpReachable: true }` — better to miss a banner than flash a false negative on a transient health-endpoint blip. * `LiveRtmpHealthBanner` at the top of `GoLivePage`: amber, non-blocking, copy explicitly tells the artist the stream key is still valid but broadcasting won't reach anyone, with a Retry button that invalidates the health query. * Tests: 3 Go (listener reachable + Cache-Control; dead port unreachable + UI-safe error asserting no `127.0.0.1` leak; TTL cache survives listener teardown) + 3 Vitest (hidden when reachable, visible with Retry when unreachable, Retry invalidates the right query key). #### Fix 6 — RabbitMQ publish failures no longer silent (`bf688af35`) `RabbitMQEventBus.Publish` returned the broker error but did not log it. Callers that wrapped `Publish` in fire-and-forget (`_ = eb.Publish(...)`) lost events with zero trace during RMQ outages. * `Publish` now emits a structured ERROR on broker failure with the exchange, routing_key, payload_bytes, content_type, and message_id context. Function still returns the error so call-sites that actually check it keep working. * `EventBus disabled` warning kept but upgraded with `payload_bytes` so dashboards can quantify drops when RMQ is intentionally off. * Aligns the legacy `internal/eventbus` with `infrastructure/eventbus` which already had this pattern. * Tests: 2 Go (disabled bus emits WARN + returns `EventBusUnavailableError`; nil logger stays panic-free for legacy callers). ### Breaking changes * `marketplace.MarketplaceService.RefundOrder` now returns `(*Refund, error)` instead of `error`. Callers consuming the service directly need to accept the pending refund row. * `marketplace.refundProvider` internal interface: `Refund(...) error` → `CreateRefund(...) (refundID, status string, err error)`. `hyperswitch.Provider` implements both; external mocks must be updated. * Order status machine gains `refund_pending` as an intermediate state. Clients reading `orders.status` should treat it as "in-flight refund, don't show as refunded yet". ### Known gaps (parked for v1.0.7) * Partial refunds — UX decision + call-site wiring * Stripe Connect Transfers:reversal — actually move money back at the PSP level (internal accounting is correct today) * `CloudUploadModal.tsx` hardcoded 500MB — category semantic doesn't map to the three exposed by `/upload/limits` * Smoke test of refund flow against Hyperswitch sandbox (manual, outside CI) ## [v1.0.5.1] - 2026-04-16 ### Hotfix — dev SMTP ergonomics Follow-up to the v1.0.5 smoke test: a fresh clone + `cp .env.template .env` + `make dev-full` produced a backend with `SMTP_HOST=""`, which silently short-circuits `EmailService.sendEmail` to a log-only path. New contributors hit register → "where's my verification email?" and had no obvious cue that the SMTP hookup was missing. - `veza-backend-api/.env.template`: `SMTP_HOST` / `SMTP_PORT` now default to the MailHog instance that ships with `make infra-up-dev` (`localhost:1025`, UI on `:8025`). `FROM_EMAIL` / `FROM_NAME` seeded with local-safe values. Comment rewritten to point at both the dev path and the prod override. - Also exports the duplicate variable names (`SMTP_USERNAME`, `SMTP_FROM`, `SMTP_FROM_NAME`) read by `internal/email/sender.go` — a TODO flagged for v1.0.6 to reconcile the two email services onto a single env schema. Until then both sets cover every code path. No code change, no migration, no version bump in the Go module. Pure config hotfix. ## [v1.0.5] - 2026-04-16 ### Hardening sprint — seven critical-path fixes before public opening Audit follow-up on the `register → verify → play` critical path. The app was functional on the surface but broken underneath: the player was silent, emails weren't really sent, the marketplace gave products away in production, the chat silently de-synced across pods, maintenance mode was per-pod only, orphaned tracks accumulated forever in `processing`, and the response cache was corrupting range-aware media responses. Seven targeted fixes, each with its own commit, its own tests, and no behaviour change outside scope. #### Fix 1 — Player muet (`veza-backend-api` + `apps/web`) - New `GET /api/v1/tracks/:id/stream` handler in `internal/core/track/track_hls_handler.go`. Serves the raw file via `http.ServeContent` — `Range`, `If-Modified-Since` and `If-None-Match` handled for free, so `