senke/veza - Talas Project: Beyond coding. We Forge.

senke/veza

Author	SHA1	Message	Date
senke	d03232c85c	feat(storage): add track storage_backend column + config prep (v1.0.8 P0) Some checks failed Veza CI / Backend (Go) (push) Failing after 0s Details Veza CI / Frontend (Web) (push) Failing after 0s Details Veza CI / Rust (Stream Server) (push) Failing after 0s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 0s Details Veza CI / Notify on failure (push) Failing after 0s Details Phase 0 of the MinIO upload migration (FUNCTIONAL_AUDIT §4 item 2). Schema + config only — Phase 1 will wire TrackService.UploadTrack() to actually route writes to S3 when the flag is flipped. Schema (migration 985): - tracks.storage_backend VARCHAR(16) NOT NULL DEFAULT 'local' CHECK in ('local', 's3') - tracks.storage_key VARCHAR(512) NULL (S3 object key when backend=s3) - Partial index on storage_backend = 's3' (migration progress queries) - Rollback drops both columns + index; safe only while all rows are still 'local' (guard query in the rollback comment) Go model (internal/models/track.go): - StorageBackend string (default 'local', not null) - StorageKey *string (nullable) - Both tagged json:"-" — internal plumbing, never exposed publicly Config (internal/config/config.go): - New field Config.TrackStorageBackend - Read from TRACK_STORAGE_BACKEND env var (default 'local') - Production validation rule #11 (ValidateForEnvironment): - Must be 'local' or 's3' (reject typos like 'S3' or 'minio') - If 's3', requires AWS_S3_ENABLED=true (fail fast, do not boot with TrackStorageBackend=s3 while S3StorageService is nil) - Dev/staging warns and falls back to 'local' instead of fail — keeps iteration fast while still flagging misconfig. Docs: - docs/ENV_VARIABLES.md §13 restructured as "HLS + track storage backend" with a migration playbook (local → s3 → migrate-storage CLI) - docs/ENV_VARIABLES.md §28 validation rules: +2 entries for new rules - docs/ENV_VARIABLES.md §29 drift findings: TRACK_STORAGE_BACKEND added to "missing from template" list before it was fixed - veza-backend-api/.env.template: TRACK_STORAGE_BACKEND=local with comment pointing at Phase 1/2/3 plans No behavior change yet — TrackService.UploadTrack() still hardcodes the local path via copyFileAsync(). Phase 1 wires it. Refs: - AUDIT_REPORT.md §9 item (deferrals v1.0.8) - FUNCTIONAL_AUDIT.md §4 item 2 "Stockage local disque only" - /home/senke/.claude/plans/audit-fonctionnel-wild-hickey.md Item 3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:54:28 +02:00
senke	7d03ee6686	docs(env): canonicalize ENV_VARIABLES.md + add HLS_STREAMING template Some checks failed Veza CI / Backend (Go) (push) Failing after 0s Details Veza CI / Frontend (Web) (push) Failing after 0s Details Veza CI / Rust (Stream Server) (push) Failing after 0s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 0s Details Veza CI / Notify on failure (push) Failing after 0s Details Resolves AUDIT_REPORT §9 item #15 (last real item before v1.0.7 final) and FUNCTIONAL_AUDIT §4 stability item 5. docs/ENV_VARIABLES.md: - Complete rewrite from 172 → ~600 lines covering all ~180 env vars surveyed directly from code (os.Getenv in Go, std::env::var in Rust, import.meta.env in React). - 30 sections: core, DB, Redis, JWT, OAuth, CORS, rate-limit, SMTP, Hyperswitch, Stripe Connect, RabbitMQ, S3/MinIO, HLS, stream server, Elasticsearch, ClamAV, Sentry, logging, metrics, frontend Vite, feature flags, password policy, build info, RTMP/misc, Rust stream schema, security headers recap, deprecated vars, prod validation rules, drift findings, startup checklist. - Documents 8 production-critical validation rules (validation.go:869-1018). - Flags 14 deprecated vars with canonical replacements for v1.1.0 cleanup. - Catalogs 11 vars used by code but missing from template (HLS_STREAMING, SLOW_REQUEST_THRESHOLD_MS, CONFIG_WATCH, HANDLER_TIMEOUT, VAPID_*, etc). veza-backend-api/.env.template: - Add HLS_STREAMING=false with documentation of fallback behavior (/tracks/:id/stream with Range support when off). - Add HLS_STORAGE_DIR=/tmp/veza-hls. Closes last blocker before v1.0.7 final tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:36:44 +02:00
senke	7e180a2c08	feat(workers): hyperswitch reconciliation sweep for stuck pending states — v1.0.7 item C New ReconcileHyperswitchWorker sweeps for pending orders and refunds whose terminal webhook never arrived. Pulls live PSP state for each stuck row and synthesises a webhook payload to feed the normal ProcessPaymentWebhook / ProcessRefundWebhook dispatcher. The existing terminal-state guards on those handlers make reconciliation idempotent against real webhooks — a late webhook after the reconciler resolved the row is a no-op. Three stuck-state classes covered: 1. Stuck orders (pending > 30m, non-empty payment_id) → GetPaymentStatus + synthetic payment.<status> webhook. 2. Stuck refunds with PSP id (pending > 30m, non-empty hyperswitch_refund_id) → GetRefundStatus + synthetic refund.<status> webhook (error_message forwarded). 3. Orphan refunds (pending > 5m, EMPTY hyperswitch_refund_id) → mark failed + roll order back to completed + log ERROR. This is the "we crashed between Phase 1 and Phase 2 of RefundOrder" case, operator-attention territory. New interfaces: * marketplace.HyperswitchReadClient — read-only PSP surface the worker depends on (GetPaymentStatus, GetRefundStatus). The worker never calls CreatePayment / CreateRefund. * hyperswitch.Client.GetRefund + RefundStatus struct added. * hyperswitch.Provider gains GetRefundStatus + GetPaymentStatus pass-throughs that satisfy the marketplace interface. Configuration (all env-var tunable with sensible defaults): * RECONCILE_WORKER_ENABLED=true * RECONCILE_INTERVAL=1h (ops can drop to 5m during incident response without a code change) * RECONCILE_ORDER_STUCK_AFTER=30m * RECONCILE_REFUND_STUCK_AFTER=30m * RECONCILE_REFUND_ORPHAN_AFTER=5m (shorter because "app crashed" is a different signal from "network hiccup") Operational details: * Batch limit 50 rows per phase per tick so a 10k-row backlog doesn't hammer Hyperswitch. Next tick picks up the rest. * PSP read errors leave the row untouched — next tick retries. Reconciliation is always safe to replay. * Structured log on every action so `grep reconcile` tells the ops story: which order/refund got synced, against what status, how long it was stuck. * Worker wired in cmd/api/main.go, gated on HyperswitchEnabled + HyperswitchAPIKey. Graceful shutdown registered. * RunOnce exposed as public API for ad-hoc ops trigger during incident response. Tests — 10 cases, all green (sqlite :memory:): * TestReconcile_StuckOrder_SyncsViaSyntheticWebhook * TestReconcile_RecentOrder_NotTouched * TestReconcile_CompletedOrder_NotTouched * TestReconcile_OrderWithEmptyPaymentID_NotTouched * TestReconcile_PSPReadErrorLeavesRowIntact * TestReconcile_OrphanRefund_AutoFails_OrderRollsBack * TestReconcile_RecentOrphanRefund_NotTouched * TestReconcile_StuckRefund_SyncsViaSyntheticWebhook * TestReconcile_StuckRefund_FailureStatus_PassesErrorMessage * TestReconcile_AllTerminalStates_NoOp CHANGELOG v1.0.7-rc1 updated with the full item C section between D and the existing E block, matching the order convention (ship order: A → D → B → E → C, CHANGELOG order follows). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 03:08:15 +02:00
senke	3c4d0148be	feat(webhooks): persist raw hyperswitch payloads to audit log — v1.0.7 item E Every POST /webhooks/hyperswitch delivery now writes a row to `hyperswitch_webhook_log` regardless of signature-valid or processing outcome. Captures both legitimate deliveries and attack probes — a forensics query now has the actual bytes to read, not just a "webhook rejected" log line. Disputes (axis-1 P1.6) ride along: the log captures dispute.* events alongside payment and refund events, ready for when disputes get a handler. Table shape (migration 984): * payload TEXT — readable in psql, invalid UTF-8 replaced with empty (forensics value is in headers + ip + timing for those attacks, not the binary body). * signature_valid BOOLEAN + partial index for "show me attack attempts" being instantaneous. * processing_result TEXT — 'ok' / 'error: <msg>' / 'signature_invalid' / 'skipped'. Matches the P1.5 action semantic exactly. * source_ip, user_agent, request_id — forensics essentials. request_id is captured from Hyperswitch's X-Request-Id header when present, else a server-side UUID so every row correlates to VEZA's structured logs. * event_type — best-effort extract from the JSON payload, NULL on malformed input. Hardening: * 64KB body cap via io.LimitReader rejects oversize with 413 before any INSERT — prevents log-spam DoS. * Single INSERT per delivery with final state; no two-phase update race on signature-failure path. signature_invalid and processing-error rows both land. * DB persistence failures are logged but swallowed — the endpoint's contract is to ack Hyperswitch, not perfect audit. Retention sweep: * CleanupHyperswitchWebhookLog in internal/jobs, daily tick, batched DELETE (10k rows + 100ms pause) so a large backlog doesn't lock the table. * HYPERSWITCH_WEBHOOK_LOG_RETENTION_DAYS (default 90). * Same goroutine-ticker pattern as ScheduleOrphanTracksCleanup. * Wired in cmd/api/main.go alongside the existing cleanup jobs. Tests: 5 in webhook_log_test.go (persistence, request_id auto-gen, invalid-JSON leaves event_type empty, invalid-signature capture, extractEventType 5 sub-cases) + 4 in cleanup_hyperswitch_webhook_ log_test.go (deletes-older-than, noop, default-on-zero, context-cancel). Migration 984 applied cleanly to local Postgres; all indexes present. Also (v107-plan.md): * Item G acceptance gains an explicit Idempotency-Key threading requirement with an empty-key loud-fail test — "literally copy-paste D's 4-line test skeleton". Closes the risk that item G silently reopens the HTTP-retry duplicate-charge exposure D closed. Out of scope for E (noted in CHANGELOG): * Rate limit on the endpoint — pre-existing middleware covers it at the router level; adding a per-endpoint limit is separate scope. * Readable-payload SQL view — deferred, the TEXT column is already human-readable; a convenience view is a nice-to-have not a ship-blocker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 02:44:58 +02:00
senke	d2bb9c0e78	feat(marketplace): async stripe connect reversal worker — v1.0.7 item B day 2 Day-2 cut of item B: the reversal path becomes async. Pre-v1.0.7 (and v1.0.7 day 1) the refund handler flipped seller_transfers straight from completed to reversed without ever calling Stripe — the ledger said "reversed" while the seller's Stripe balance still showed the original transfer as settled. The new flow: refund.succeeded webhook → reverseSellerAccounting transitions row: completed → reversal_pending → StripeReversalWorker (every REVERSAL_CHECK_INTERVAL, default 1m) → calls ReverseTransfer on Stripe → success: row → reversed + persist stripe_reversal_id → 404 already-reversed (dead code until day 3): row → reversed + log → 404 resource_missing (dead code until day 3): row → permanently_failed → transient error: stay reversal_pending, bump retry_count, exponential backoff (base * 2^retry, capped at backoffMax) → retries exhausted: row → permanently_failed → buyer-facing refund completes immediately regardless of Stripe health State machine enforcement: * New `SellerTransfer.TransitionStatus(tx, to, extras)` wraps every mutation: validates against AllowedTransferTransitions, guarded UPDATE with WHERE status=<from> (optimistic lock semantics), no RowsAffected = stale state / concurrent winner detected. * processSellerTransfers no longer mutates .Status in place — terminal status is decided before struct construction, so the row is Created with its final state. * transfer_retry.retryOne and admin RetryTransfer route through TransitionStatus. Legacy direct assignment removed. * TestNoDirectTransferStatusMutation greps the package for any `st.Status = "..."` / `t.Status = "..."` / GORM Model(&SellerTransfer{}).Update("status"...) outside the allowlist and fails if found. Verified by temporarily injecting a violation during development — test caught it as expected. Configuration (v1.0.7 item B): * REVERSAL_WORKER_ENABLED=true (default) * REVERSAL_MAX_RETRIES=5 (default) * REVERSAL_CHECK_INTERVAL=1m (default) * REVERSAL_BACKOFF_BASE=1m (default) * REVERSAL_BACKOFF_MAX=1h (default, caps exponential growth) * .env.template documents TRANSFER_RETRY_* and REVERSAL_* env vars so an ops reader can grep them. Interface change: TransferService.ReverseTransfer(ctx, stripe_transfer_id, amount int64, reason) (reversalID, error) added. All four mocks extended (process_webhook, transfer_retry, admin_transfer_handler, payment_flow integration). amount=nil means full reversal; v1.0.7 always passes nil (partial reversal is future scope per axis-1 P2). Stripe 404 disambiguation (ErrTransferAlreadyReversed / ErrTransferNotFound) is wired in the worker as dead code — the sentinels are declared and the worker branches on them, but StripeConnectService.ReverseTransfer doesn't yet emit them. Day 3 will parse stripe.Error.Code and populate the sentinels; no worker change needed at that point. Keeping the handling skeleton in day 2 so the worker's branch shape doesn't change between days and the tests can already cover all four paths against the mock. Worker unit tests (9 cases, all green, sqlite :memory:): happy path: reversal_pending → reversed + stripe_reversal_id set * already reversed (mock returns sentinel): → reversed + log * not found (mock returns sentinel): → permanently_failed + log * transient 503: retry_count++, next_retry_at set with backoff, stays reversal_pending * backoff capped at backoffMax (verified with base=1s, max=10s, retry_count=4 → capped at 10s not 16s) * max retries exhausted: → permanently_failed * legacy row with empty stripe_transfer_id: → permanently_failed, does not call Stripe * only picks up reversal_pending (skips all other statuses) * respects next_retry_at (future rows skipped) Existing test updated: TestProcessRefundWebhook_SucceededFinalizesState now asserts the row lands at reversal_pending with next_retry_at set (worker's responsibility to drive to reversed), not reversed. Worker wired in cmd/api/main.go alongside TransferRetryWorker, sharing the same StripeConnectService instance. Shutdown path registered for graceful stop. Cut from day 2 scope (per agreed-upon discipline), landing in day 3: * Stripe 404 disambiguation implementation (parse error.Code) * End-to-end smoke probe (refund → reversal_pending → worker processes → reversed) against local Postgres + mock Stripe * Batch-size tuning / inter-batch sleep — batchLimit=20 today is safely under Stripe's 100 req/s default rate limit; revisit if observed load warrants Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 15:34:29 +02:00
senke	9002e91d91	refactor(backend,infra): unify SMTP env schema on canonical SMTP_* names Third item of the v1.0.6 backlog. The v1.0.5.1 hotfix surfaced that two email paths in-tree read different env vars for the same configuration: internal/email/sender.go internal/services/email_service.go SMTP_USERNAME SMTP_USER SMTP_FROM FROM_EMAIL SMTP_FROM_NAME FROM_NAME The hotfix worked around it by exporting both sets in `.env.template`. This commit reconciles them onto a single schema so the workaround can go away. Changes * `internal/email/sender.go` is now the single loader. The canonical names (`SMTP_USERNAME`, `SMTP_FROM`, `SMTP_FROM_NAME`) are read first; the legacy names (`SMTP_USER`, `FROM_EMAIL`, `FROM_NAME`) stay supported as a migration fallback that logs a structured deprecation warning ("remove_in: v1.1.0"). Canonical always wins over deprecated — no silent precedence flip. * `NewSMTPEmailSender` callers keep working unchanged; a new `LoadSMTPConfigFromEnvWithLogger(zap.Logger)` variant lets callers opt into the warning stream. `internal/services/email_service.go` drops its six inline `os.Getenv` reads and delegates to the shared loader, so `AuthService.Register` and `RequestPasswordReset` now see exactly the same config as the async job worker. * `.env.template`: the duplicate (SMTP_USER + FROM_EMAIL + FROM_NAME) block added in v1.0.5.1 is removed — only the canonical SMTP_* names ship for new contributors. * `docker-compose.yml` (backend-api service): FROM_EMAIL / FROM_NAME renamed to SMTP_FROM / SMTP_FROM_NAME to match the canonical schema. * No Host/Port default injected in the loader. If SMTP_HOST is empty, callers see Host=="" and log-only (historic dev behavior). Dev defaults (MailHog localhost:1025) live in `.env.template`, so a fresh clone still works; a misconfigured prod pod fails loud instead of silently dialing localhost. Tests * 5 new Go tests in `internal/email/smtp_env_test.go`: empty-env returns empty config; canonical names read directly; deprecated names fall back (one warning per var); canonical wins over deprecated silently; nil logger is allowed. * Existing `TestLoadSMTPConfigFromEnv`, `TestSMTPEmailSender_Send`, and every auth/services package remained green (40+ packages). Import-cycle note: the loader deliberately lives in `internal/email`, not `internal/config`, because `internal/config` already depends on `internal/email` (wiring `EmailSender` at boot). Putting the loader in `email` keeps the dependency flow one-way. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 20:44:09 +02:00
senke	070e31a463	chore(release): v1.0.5.1 — dev SMTP ergonomics hotfix A fresh clone + `cp veza-backend-api/.env.template .env` + `make dev-full` booted the backend with `SMTP_HOST=""` — `EmailService.sendEmail` short- circuits to log-only when the host is empty, so `register` + `password reset` produced users stuck with no way to verify (or recover) in dev, and the smoke test caught MailHog empty despite the service being up. - `.env.template` now ships MailHog-ready defaults (`localhost:1025`, UI on `:8025`, `FROM_EMAIL=no-reply@veza.local`) so a bare clone + copy gives a working register flow. Comment rewritten to point at both the dev path and the prod override. - Also exports duplicate variable names (`SMTP_USERNAME`, `SMTP_FROM`, `SMTP_FROM_NAME`) read by `internal/email/sender.go`. The two email services in-tree disagree on env schema (`SMTP_USER` vs `SMTP_USERNAME`, `FROM_EMAIL` vs `SMTP_FROM`, `FROM_NAME` vs `SMTP_FROM_NAME`); until v1.0.6 reconciles them, both sets are populated so whichever path fires finds its names. Pure config hotfix. No code change, no migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 18:16:54 +02:00
senke	2df921abd5	v0.9.1	2026-03-05 19:22:31 +01:00
senke	ae81e171c7	feat(seller): add Stripe Connect config	2026-02-23 22:09:23 +01:00
senke	b73387af3c	feat(api): add PostgreSQL read replica support (3.7) - Add DATABASE_READ_URL config and InitReadReplica in database package - Add ForRead() helper for read-only handler routing - Update TrackService and TrackSearchService to use read replica for reads - Document setup in DEPLOYMENT_GUIDE.md and .env.template	2026-02-14 22:50:23 +01:00
senke	92f432fb9e	chore: consolidate pending changes (Hyperswitch, PostCard, dashboard, stream server, etc.)	2026-02-14 21:45:15 +01:00
senke	30f17dfc2a	chore(backend): config, router, auth, stream service, sanitizer, tests Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-11 22:19:09 +01:00
senke	712bfb6b8c	config(template): add comprehensive .env.template Created centralized environment template with all configuration variables documented and categorized. Categories: - REQUIRED: DATABASE_URL, JWT_SECRET (min 32 chars), REDIS - RECOMMENDED: SENTRY_DSN, COOKIE_SECURE, CORS_ALLOWED_ORIGINS - OPTIONAL: RABBITMQ, SMTP, CLAMAV, S3 Features: - Clear documentation for each variable - Default values specified - Validation rules documented - Environment-specific guidance (dev vs prod) - Security notes for sensitive values Impact: Single source of truth for configuration, reduces config drift. Fixes: P3.4 (part 1) from audit AUDIT_TEMP_29_01_2026.md	2026-01-29 23:32:18 +01:00

13 commits