Phase 0 of the MinIO upload migration (FUNCTIONAL_AUDIT §4 item 2).
Schema + config only — Phase 1 will wire TrackService.UploadTrack()
to actually route writes to S3 when the flag is flipped.
Schema (migration 985):
- tracks.storage_backend VARCHAR(16) NOT NULL DEFAULT 'local'
CHECK in ('local', 's3')
- tracks.storage_key VARCHAR(512) NULL (S3 object key when backend=s3)
- Partial index on storage_backend = 's3' (migration progress queries)
- Rollback drops both columns + index; safe only while all rows are
still 'local' (guard query in the rollback comment)
Go model (internal/models/track.go):
- StorageBackend string (default 'local', not null)
- StorageKey *string (nullable)
- Both tagged json:"-" — internal plumbing, never exposed publicly
Config (internal/config/config.go):
- New field Config.TrackStorageBackend
- Read from TRACK_STORAGE_BACKEND env var (default 'local')
- Production validation rule #11 (ValidateForEnvironment):
- Must be 'local' or 's3' (reject typos like 'S3' or 'minio')
- If 's3', requires AWS_S3_ENABLED=true (fail fast, do not boot with
TrackStorageBackend=s3 while S3StorageService is nil)
- Dev/staging warns and falls back to 'local' instead of fail — keeps
iteration fast while still flagging misconfig.
Docs:
- docs/ENV_VARIABLES.md §13 restructured as "HLS + track storage backend"
with a migration playbook (local → s3 → migrate-storage CLI)
- docs/ENV_VARIABLES.md §28 validation rules: +2 entries for new rules
- docs/ENV_VARIABLES.md §29 drift findings: TRACK_STORAGE_BACKEND added
to "missing from template" list before it was fixed
- veza-backend-api/.env.template: TRACK_STORAGE_BACKEND=local with
comment pointing at Phase 1/2/3 plans
No behavior change yet — TrackService.UploadTrack() still hardcodes the
local path via copyFileAsync(). Phase 1 wires it.
Refs:
- AUDIT_REPORT.md §9 item (deferrals v1.0.8)
- FUNCTIONAL_AUDIT.md §4 item 2 "Stockage local disque only"
- /home/senke/.claude/plans/audit-fonctionnel-wild-hickey.md Item 3
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New ReconcileHyperswitchWorker sweeps for pending orders and refunds
whose terminal webhook never arrived. Pulls live PSP state for each
stuck row and synthesises a webhook payload to feed the normal
ProcessPaymentWebhook / ProcessRefundWebhook dispatcher. The existing
terminal-state guards on those handlers make reconciliation
idempotent against real webhooks — a late webhook after the reconciler
resolved the row is a no-op.
Three stuck-state classes covered:
1. Stuck orders (pending > 30m, non-empty payment_id) → GetPaymentStatus
+ synthetic payment.<status> webhook.
2. Stuck refunds with PSP id (pending > 30m, non-empty
hyperswitch_refund_id) → GetRefundStatus + synthetic
refund.<status> webhook (error_message forwarded).
3. Orphan refunds (pending > 5m, EMPTY hyperswitch_refund_id) →
mark failed + roll order back to completed + log ERROR. This
is the "we crashed between Phase 1 and Phase 2 of RefundOrder"
case, operator-attention territory.
New interfaces:
* marketplace.HyperswitchReadClient — read-only PSP surface the
worker depends on (GetPaymentStatus, GetRefundStatus). The
worker never calls CreatePayment / CreateRefund.
* hyperswitch.Client.GetRefund + RefundStatus struct added.
* hyperswitch.Provider gains GetRefundStatus + GetPaymentStatus
pass-throughs that satisfy the marketplace interface.
Configuration (all env-var tunable with sensible defaults):
* RECONCILE_WORKER_ENABLED=true
* RECONCILE_INTERVAL=1h (ops can drop to 5m during incident
response without a code change)
* RECONCILE_ORDER_STUCK_AFTER=30m
* RECONCILE_REFUND_STUCK_AFTER=30m
* RECONCILE_REFUND_ORPHAN_AFTER=5m (shorter because "app crashed"
is a different signal from "network hiccup")
Operational details:
* Batch limit 50 rows per phase per tick so a 10k-row backlog
doesn't hammer Hyperswitch. Next tick picks up the rest.
* PSP read errors leave the row untouched — next tick retries.
Reconciliation is always safe to replay.
* Structured log on every action so `grep reconcile` tells the
ops story: which order/refund got synced, against what status,
how long it was stuck.
* Worker wired in cmd/api/main.go, gated on
HyperswitchEnabled + HyperswitchAPIKey. Graceful shutdown
registered.
* RunOnce exposed as public API for ad-hoc ops trigger during
incident response.
Tests — 10 cases, all green (sqlite :memory:):
* TestReconcile_StuckOrder_SyncsViaSyntheticWebhook
* TestReconcile_RecentOrder_NotTouched
* TestReconcile_CompletedOrder_NotTouched
* TestReconcile_OrderWithEmptyPaymentID_NotTouched
* TestReconcile_PSPReadErrorLeavesRowIntact
* TestReconcile_OrphanRefund_AutoFails_OrderRollsBack
* TestReconcile_RecentOrphanRefund_NotTouched
* TestReconcile_StuckRefund_SyncsViaSyntheticWebhook
* TestReconcile_StuckRefund_FailureStatus_PassesErrorMessage
* TestReconcile_AllTerminalStates_NoOp
CHANGELOG v1.0.7-rc1 updated with the full item C section between D
and the existing E block, matching the order convention (ship order:
A → D → B → E → C, CHANGELOG order follows).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every POST /webhooks/hyperswitch delivery now writes a row to
`hyperswitch_webhook_log` regardless of signature-valid or
processing outcome. Captures both legitimate deliveries and attack
probes — a forensics query now has the actual bytes to read, not
just a "webhook rejected" log line. Disputes (axis-1 P1.6) ride
along: the log captures dispute.* events alongside payment and
refund events, ready for when disputes get a handler.
Table shape (migration 984):
* payload TEXT — readable in psql, invalid UTF-8 replaced with
empty (forensics value is in headers + ip + timing for those
attacks, not the binary body).
* signature_valid BOOLEAN + partial index for "show me attack
attempts" being instantaneous.
* processing_result TEXT — 'ok' / 'error: <msg>' /
'signature_invalid' / 'skipped'. Matches the P1.5 action
semantic exactly.
* source_ip, user_agent, request_id — forensics essentials.
request_id is captured from Hyperswitch's X-Request-Id header
when present, else a server-side UUID so every row correlates
to VEZA's structured logs.
* event_type — best-effort extract from the JSON payload, NULL
on malformed input.
Hardening:
* 64KB body cap via io.LimitReader rejects oversize with 413
before any INSERT — prevents log-spam DoS.
* Single INSERT per delivery with final state; no two-phase
update race on signature-failure path. signature_invalid and
processing-error rows both land.
* DB persistence failures are logged but swallowed — the
endpoint's contract is to ack Hyperswitch, not perfect audit.
Retention sweep:
* CleanupHyperswitchWebhookLog in internal/jobs, daily tick,
batched DELETE (10k rows + 100ms pause) so a large backlog
doesn't lock the table.
* HYPERSWITCH_WEBHOOK_LOG_RETENTION_DAYS (default 90).
* Same goroutine-ticker pattern as ScheduleOrphanTracksCleanup.
* Wired in cmd/api/main.go alongside the existing cleanup jobs.
Tests: 5 in webhook_log_test.go (persistence, request_id auto-gen,
invalid-JSON leaves event_type empty, invalid-signature capture,
extractEventType 5 sub-cases) + 4 in cleanup_hyperswitch_webhook_
log_test.go (deletes-older-than, noop, default-on-zero,
context-cancel). Migration 984 applied cleanly to local Postgres;
all indexes present.
Also (v107-plan.md):
* Item G acceptance gains an explicit Idempotency-Key threading
requirement with an empty-key loud-fail test — "literally
copy-paste D's 4-line test skeleton". Closes the risk that
item G silently reopens the HTTP-retry duplicate-charge
exposure D closed.
Out of scope for E (noted in CHANGELOG):
* Rate limit on the endpoint — pre-existing middleware covers
it at the router level; adding a per-endpoint limit is
separate scope.
* Readable-payload SQL view — deferred, the TEXT column is
already human-readable; a convenience view is a nice-to-have
not a ship-blocker.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Day-2 cut of item B: the reversal path becomes async. Pre-v1.0.7
(and v1.0.7 day 1) the refund handler flipped seller_transfers
straight from completed to reversed without ever calling Stripe —
the ledger said "reversed" while the seller's Stripe balance still
showed the original transfer as settled. The new flow:
refund.succeeded webhook
→ reverseSellerAccounting transitions row: completed → reversal_pending
→ StripeReversalWorker (every REVERSAL_CHECK_INTERVAL, default 1m)
→ calls ReverseTransfer on Stripe
→ success: row → reversed + persist stripe_reversal_id
→ 404 already-reversed (dead code until day 3): row → reversed + log
→ 404 resource_missing (dead code until day 3): row → permanently_failed
→ transient error: stay reversal_pending, bump retry_count,
exponential backoff (base * 2^retry, capped at backoffMax)
→ retries exhausted: row → permanently_failed
→ buyer-facing refund completes immediately regardless of Stripe health
State machine enforcement:
* New `SellerTransfer.TransitionStatus(tx, to, extras)` wraps every
mutation: validates against AllowedTransferTransitions, guarded
UPDATE with WHERE status=<from> (optimistic lock semantics), no
RowsAffected = stale state / concurrent winner detected.
* processSellerTransfers no longer mutates .Status in place —
terminal status is decided before struct construction, so the
row is Created with its final state.
* transfer_retry.retryOne and admin RetryTransfer route through
TransitionStatus. Legacy direct assignment removed.
* TestNoDirectTransferStatusMutation greps the package for any
`st.Status = "..."` / `t.Status = "..."` / GORM
Model(&SellerTransfer{}).Update("status"...) outside the
allowlist and fails if found. Verified by temporarily injecting
a violation during development — test caught it as expected.
Configuration (v1.0.7 item B):
* REVERSAL_WORKER_ENABLED=true (default)
* REVERSAL_MAX_RETRIES=5 (default)
* REVERSAL_CHECK_INTERVAL=1m (default)
* REVERSAL_BACKOFF_BASE=1m (default)
* REVERSAL_BACKOFF_MAX=1h (default, caps exponential growth)
* .env.template documents TRANSFER_RETRY_* and REVERSAL_* env vars
so an ops reader can grep them.
Interface change: TransferService.ReverseTransfer(ctx,
stripe_transfer_id, amount *int64, reason) (reversalID, error)
added. All four mocks extended (process_webhook, transfer_retry,
admin_transfer_handler, payment_flow integration). amount=nil means
full reversal; v1.0.7 always passes nil (partial reversal is future
scope per axis-1 P2).
Stripe 404 disambiguation (ErrTransferAlreadyReversed /
ErrTransferNotFound) is wired in the worker as dead code — the
sentinels are declared and the worker branches on them, but
StripeConnectService.ReverseTransfer doesn't yet emit them. Day 3
will parse stripe.Error.Code and populate the sentinels; no worker
change needed at that point. Keeping the handling skeleton in day 2
so the worker's branch shape doesn't change between days and the
tests can already cover all four paths against the mock.
Worker unit tests (9 cases, all green, sqlite :memory:):
* happy path: reversal_pending → reversed + stripe_reversal_id set
* already reversed (mock returns sentinel): → reversed + log
* not found (mock returns sentinel): → permanently_failed + log
* transient 503: retry_count++, next_retry_at set with backoff,
stays reversal_pending
* backoff capped at backoffMax (verified with base=1s, max=10s,
retry_count=4 → capped at 10s not 16s)
* max retries exhausted: → permanently_failed
* legacy row with empty stripe_transfer_id: → permanently_failed,
does not call Stripe
* only picks up reversal_pending (skips all other statuses)
* respects next_retry_at (future rows skipped)
Existing test updated: TestProcessRefundWebhook_SucceededFinalizesState
now asserts the row lands at reversal_pending with next_retry_at
set (worker's responsibility to drive to reversed), not reversed.
Worker wired in cmd/api/main.go alongside TransferRetryWorker,
sharing the same StripeConnectService instance. Shutdown path
registered for graceful stop.
Cut from day 2 scope (per agreed-upon discipline), landing in day 3:
* Stripe 404 disambiguation implementation (parse error.Code)
* End-to-end smoke probe (refund → reversal_pending → worker
processes → reversed) against local Postgres + mock Stripe
* Batch-size tuning / inter-batch sleep — batchLimit=20 today is
safely under Stripe's 100 req/s default rate limit; revisit if
observed load warrants
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third item of the v1.0.6 backlog. The v1.0.5.1 hotfix surfaced that two
email paths in-tree read *different* env vars for the same configuration:
internal/email/sender.go internal/services/email_service.go
SMTP_USERNAME SMTP_USER
SMTP_FROM FROM_EMAIL
SMTP_FROM_NAME FROM_NAME
The hotfix worked around it by exporting both sets in `.env.template`.
This commit reconciles them onto a single schema so the workaround can
go away.
Changes
* `internal/email/sender.go` is now the single loader. The canonical
names (`SMTP_USERNAME`, `SMTP_FROM`, `SMTP_FROM_NAME`) are read
first; the legacy names (`SMTP_USER`, `FROM_EMAIL`, `FROM_NAME`)
stay supported as a migration fallback that logs a structured
deprecation warning ("remove_in: v1.1.0"). Canonical always wins
over deprecated — no silent precedence flip.
* `NewSMTPEmailSender` callers keep working unchanged; a new
`LoadSMTPConfigFromEnvWithLogger(*zap.Logger)` variant lets callers
opt into the warning stream.
* `internal/services/email_service.go` drops its six inline
`os.Getenv` reads and delegates to the shared loader, so
`AuthService.Register` and `RequestPasswordReset` now see exactly
the same config as the async job worker.
* `.env.template`: the duplicate (SMTP_USER + FROM_EMAIL + FROM_NAME)
block added in v1.0.5.1 is removed — only the canonical SMTP_*
names ship for new contributors.
* `docker-compose.yml` (backend-api service): FROM_EMAIL / FROM_NAME
renamed to SMTP_FROM / SMTP_FROM_NAME to match the canonical schema.
* No Host/Port default injected in the loader. If SMTP_HOST is
empty, callers see Host=="" and log-only (historic dev behavior).
Dev defaults (MailHog localhost:1025) live in `.env.template`, so
a fresh clone still works; a misconfigured prod pod fails loud
instead of silently dialing localhost.
Tests
* 5 new Go tests in `internal/email/smtp_env_test.go`: empty-env
returns empty config; canonical names read directly; deprecated
names fall back (one warning per var); canonical wins over
deprecated silently; nil logger is allowed.
* Existing `TestLoadSMTPConfigFromEnv`, `TestSMTPEmailSender_Send`,
and every auth/services package remained green (40+ packages).
Import-cycle note: the loader deliberately lives in `internal/email`,
not `internal/config`, because `internal/config` already depends on
`internal/email` (wiring `EmailSender` at boot). Putting the loader in
`email` keeps the dependency flow one-way.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A fresh clone + `cp veza-backend-api/.env.template .env` + `make dev-full`
booted the backend with `SMTP_HOST=""` — `EmailService.sendEmail` short-
circuits to log-only when the host is empty, so `register` + `password
reset` produced users stuck with no way to verify (or recover) in dev,
and the smoke test caught MailHog empty despite the service being up.
- `.env.template` now ships MailHog-ready defaults (`localhost:1025`,
UI on `:8025`, `FROM_EMAIL=no-reply@veza.local`) so a bare clone +
copy gives a working register flow. Comment rewritten to point at
both the dev path and the prod override.
- Also exports duplicate variable names (`SMTP_USERNAME`, `SMTP_FROM`,
`SMTP_FROM_NAME`) read by `internal/email/sender.go`. The two email
services in-tree disagree on env schema (`SMTP_USER` vs
`SMTP_USERNAME`, `FROM_EMAIL` vs `SMTP_FROM`, `FROM_NAME` vs
`SMTP_FROM_NAME`); until v1.0.6 reconciles them, both sets are
populated so whichever path fires finds its names.
Pure config hotfix. No code change, no migration.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add DATABASE_READ_URL config and InitReadReplica in database package
- Add ForRead() helper for read-only handler routing
- Update TrackService and TrackSearchService to use read replica for reads
- Document setup in DEPLOYMENT_GUIDE.md and .env.template