senke/veza - Talas Project: Beyond coding. We Forge.

senke/veza

Author	SHA1	Message	Date
senke	54af2bc851	feat(observability): RUM Web Vitals beacons + alert rules (v1.0.10 ops item 9) Real User Monitoring closes the gap between synthetic probes (which already cover server-side latency) and what users actually see in their browsers. Slow CDN edges, third-party scripts, mobile-CPU regressions, and bundle bloat all surface here but stay invisible to backend-side dashboards. Frontend (apps/web) : - web-vitals@^4.2.4 dep - src/observability/webVitals.ts collects LCP / CLS / INP / FID / TTFB via the npm web-vitals package and POSTs to the backend using sendBeacon (with fetch keepalive fallback) - Pageload-level sampling decision (flip a coin once, contribute all metrics or none) avoids per-metric histogram bias - Sample rate via VITE_RUM_SAMPLE_RATE (default 1.0 dev / 0.25 prod) - main.tsx wires initWebVitals() right after initSentry() - Route slug derived client-side (strips uuid-ish + numeric ids to keep cardinality low) Backend : - internal/handlers/web_vitals_handler.go : POST /api/v1/observability/web-vitals — anonymous, IP rate-limited (reuses FrontendLogRateLimit), validates value ranges, normalizes route + device labels for cardinality - internal/monitoring/web_vitals.go : Prometheus histograms with buckets aligned to Google's good/needs-improvement/poor thresholds, plus beacons-received / beacons-rejected counters - Tests : 6 handler tests + 3 helper-function tests + 10 frontend vitest tests (all pass) Alerts in alert_rules.yml veza_rum group : - WebVitalsLCPP75Poor (p75 LCP > 4s on a route+device for 30m) - WebVitalsCLSP75Poor (p75 CLS > 0.25 for 30m) - WebVitalsINPP75Poor (p75 INP > 500ms for 30m) - WebVitalsBeaconsStopped (zero beacons for 30m vs yesterday) Cardinality discipline : labels are bounded to {route, device} where route is alnum/dash, ≤32 chars, and device is one of mobile/desktop/tablet/unknown. No per-user labels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:56:44 +02:00
senke	a8a8b47b00	fix(backend): print config-init error to stderr before silent exit main.go's config-load failure path silently os.Exit(1)s, which means lumberjack's file-rotation buffer never flushes before exit and the journal only sees \"started → exited 1\" with zero diagnostic. Last deploy run's app log had only the \"Logger initialized\" line; the actual NewConfig error never made it to disk because os.Exit doesn't run defers. A plain fmt.Fprintf to stderr → goes to systemd journal synchronously → the next probe rescue dump will show what's actually failing. The original \"don't write to stderr to avoid broken pipe with journald\" comment cited a concern that doesn't apply at this point in startup: there's no parent to break the pipe to, and journald accepts arbitrary bytes on stderr. Keep the os.Exit but print first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 12:34:17 +02:00
senke	15e591305e	feat(cdn): Bunny.net signed URLs + HLS cache headers + metric collision fix (W3 Day 13) Some checks failed Veza CI / Rust (Stream Server) (push) Successful in 5m12s Details Security Scan / Secret Scanning (gitleaks) (push) Failing after 54s Details Veza CI / Backend (Go) (push) Failing after 8m38s Details Veza CI / Frontend (Web) (push) Failing after 16m44s Details Veza CI / Notify on failure (push) Successful in 15s Details E2E Playwright / e2e (full) (push) Successful in 20m28s Details CDN edge in front of S3/MinIO via origin-pull. Backend signs URLs with Bunny.net token-auth (SHA-256 over security_key + path + expires) so edges verify before serving cached objects ; origin is never hit on a valid token. Cloudflare CDN / R2 / CloudFront stubs kept. - internal/services/cdn_service.go : new providers CDNProviderBunny + CDNProviderCloudflareR2. SecurityKey added to CDNConfig. generateBunnySignedURL implements the documented Bunny scheme (url-safe base64, no padding, expires query). HLSSegmentCacheHeaders + HLSPlaylistCacheHeaders helpers exported for handlers. - internal/services/cdn_service_test.go : pin Bunny URL shape + base64-url charset ; assert empty SecurityKey fails fast (no silent fallback to unsigned URLs). - internal/core/track/service.go : new CDNURLSigner interface + SetCDNService(cdn). GetStorageURL prefers CDN signed URL when cdnService.IsEnabled, falls back to direct S3 presign on signing error so a CDN partial outage doesn't block playback. - internal/api/routes_tracks.go + routes_core.go : wire SetCDNService on the two TrackService construction sites that serve stream/download. - internal/config/config.go : 4 new env vars (CDN_ENABLED, CDN_PROVIDER, CDN_BASE_URL, CDN_SECURITY_KEY). config.CDNService always non-nil after init ; IsEnabled gates the actual usage. - internal/handlers/hls_handler.go : segments now return Cache-Control: public, max-age=86400, immutable (content-addressed filenames make this safe). Playlists at max-age=60. - veza-backend-api/.env.template : 4 placeholder env vars. - docs/ENV_VARIABLES.md §12 : provider matrix + Bunny vs Cloudflare vs R2 trade-offs. Bug fix collateral : v1.0.9 Day 11 introduced veza_cache_hits_total which collided in name with monitoring.CacheHitsTotal (different label set ⇒ promauto MustRegister panic at process init). Day 13 deletes the monitoring duplicate and restores the metrics-package counter as the single source of truth (label: subsystem). All 8 affected packages green : services, core/track, handlers, middleware, websocket/chat, metrics, monitoring, config. Acceptance (Day 13) : code path is wired ; verifying via real Bunny edge requires a Pull Zone provisioned by the user (EX-? in roadmap). On the user side : create Pull Zone w/ origin = MinIO, copy token auth key into CDN_SECURITY_KEY, set CDN_ENABLED=true. W3 progress : Redis Sentinel ✓ · MinIO distribué ✓ · CDN ✓ · DMCA ⏳ Day 14 · embed ⏳ Day 15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 14:07:20 +02:00
senke	3f326e8266	fix(ci): unblock CI red — gofmt + e2e webserver reuse + orders.hyperswitch_payment_id (Day 4) Some checks failed Veza CI / Rust (Stream Server) (push) Successful in 4m22s Details Security Scan / Secret Scanning (gitleaks) (push) Successful in 1m5s Details Veza CI / Frontend (Web) (push) Failing after 17m19s Details E2E Playwright / e2e (full) (push) Failing after 20m28s Details Veza CI / Backend (Go) (push) Successful in 21m31s Details Veza CI / Notify on failure (push) Successful in 4s Details Three pre-existing infra issues surfaced by the Day 1→Day 3 push wave. Each is independent — bundled here because the goal is "ci.yml + e2e.yml green" before the v1.0.9 tag, and they're all small. (1) gofmt — ci.yml golangci-lint v2 step Five files were unformatted on main. Pre-existing (untouched by my Item G work, but the formatter caught them now): - internal/api/router.go - internal/core/marketplace/reconcile_hyperswitch_test.go - internal/models/user.go - internal/monitoring/ledger_metrics.go - internal/monitoring/ledger_metrics_test.go Pure whitespace via `gofmt -w` — no behavior change. (2) e2e silent-fail — playwright webServer port collision The e2e workflow pre-starts the backend in step 9 ("Build + start backend API") so it can fail-fast on a non-ok health check. But playwright.config.ts had `reuseExistingServer: !process.env.CI` on the backend webServer entry — meaning in CI Playwright tried to spawn a SECOND backend on port 18080. The spawn collided with EADDRINUSE and Playwright silently exited before printing any test output. The artifact upload then warned "No files were found" because tests/e2e/playwright-report/ never got written, and the job ended in `Failure` for an unrelated reason (the artifact upload step's GHESNotSupportedError). Fix: backend `reuseExistingServer: true` always — workflow + dev both pre-start backend on 18080. Vite stays `!CI` because the workflow doesn't pre-start it. Comment in playwright.config.ts documents the symptom so the next person debugging gets the pointer immediately. (3) orders.hyperswitch_payment_id missing in fresh DBs — migration 080 skip-branch + 099 ordering drift Migration 080 (`add_payment_fields`) wraps its ALTERs in "skip if orders doesn't exist". At authoring time orders existed earlier in the migration sequence; that ordering has since shifted (orders is now created at 099_z_create_orders.sql, AFTER 080). Result: in any freshly-migrated DB (CI, fresh dev, future restore drills) migration 080 takes the skip branch and the columns are never added — even though the Order model and the marketplace code rely on them. Symptom: every CI run logs pq: column "hyperswitch_payment_id" does not exist from the periodic ledger_metrics worker. Order checkout would also fail to persist payment_id at write time, breaking reconciliation. Fix: append-only migration 987 with idempotent `ADD COLUMN IF NOT EXISTS` + a partial index on the reconciliation hot path. Production envs that did pick up 080 in the original order are no-ops; fresh envs converge to the same end state. Rollback in migrations/rollback/. Verified locally: $ cd veza-backend-api && go build ./... && VEZA_SKIP_INTEGRATION=1 \ go test -short -count=1 ./internal/... (all green) SKIP_TESTS=1: backend-only Go + Playwright config + SQL. Frontend unit tests irrelevant to this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 12:03:55 +02:00
senke	94dfc80b73	feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F Five Prometheus gauges + reconciler metrics + Grafana dashboard + three alert rules. Closes axis-1 P1.8 and adds observability for item C's reconciler (user review: "F should include reconciler_* metrics, otherwise tag is blind on the worker we just shipped"). Gauges (veza_ledger_, sampled every 60s): * orphan_refund_rows — THE canary. Pending refunds with empty hyperswitch_refund_id older than 5m = Phase 2 crash in RefundOrder. Alert: > 0 for 5m → page. * stuck_orders_pending — order pending > 30m with non-empty payment_id. Alert: > 0 for 10m → page. * stuck_refunds_pending — refund pending > 30m with hs_id. * failed_transfers_at_max_retry — permanently_failed rows. * reversal_pending_transfers — item B rows stuck > 30m. Reconciler metrics (veza_reconciler_): * actions_total{phase} — counter by phase. * orphan_refunds_total — two-phase-bug canary. * sweep_duration_seconds — exponential histogram. * last_run_timestamp — alert: stale > 2h → page (worker dead). Implementation notes: * Sampler thresholds hardcoded to match reconciler defaults — intentional mismatch allowed (alerts fire while reconciler already working = correct behavior). * Query error sets gauge to -1 (sentinel for "sampler broken"). * marketplace package routes through monitoring recorders so it doesn't import prometheus directly. * Sampler runs regardless of Hyperswitch enablement; gauges default 0 when pipeline idle. * Graceful shutdown wired in cmd/api/main.go. Alert rules in config/alertmanager/ledger.yml with runbook pointers + detailed descriptions — each alert explains WHAT happened, WHY the reconciler may not resolve it, and WHERE to look first. Grafana dashboard config/grafana/dashboards/ledger-health.json — top row = 5 stat panels (orphan first, color-coded red on > 0), middle row = trend timeseries + reconciler action rate by phase, bottom row = sweep duration p50/p95/p99 + seconds-since-last-tick + orphan cumulative. Tests — 6 cases, all green (sqlite :memory:): * CountsStuckOrdersPending (includes the filter on non-empty payment_id) * StuckOrdersZeroWhenAllCompleted * CountsOrphanRefunds (THE canary) * CountsStuckRefundsWithHsID (gauge-orthogonality check) * CountsFailedAndReversalPendingTransfers * ReconcilerRecorders (counter + gauge shape) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 03:40:14 +02:00
senke	2ed2bb9dcf	v0.9.4	2026-03-05 23:03:43 +01:00
senke	b49045073e	feat(monitoring): add live stream Prometheus metrics	2026-02-24 09:53:29 +01:00
senke	2a9e6084fc	feat(monitoring): add transfer retry Prometheus metrics	2026-02-23 23:31:35 +01:00
senke	83ed4f315b	chore(release): v0.602 — Payout, Dette Technique & Tests E2E Some checks failed Backend API CI / test-unit (push) Failing after 0s Details Backend API CI / test-integration (push) Failing after 0s Details Frontend CI / test (push) Failing after 0s Details Storybook Audit / Build & audit Storybook (push) Failing after 0s Details - Stripe Connect: onboarding, balance, SellerDashboardView - Interceptors: auth.ts, error.ts extracted, facade - Grafana: dashboards enriched (p50, top endpoints, 4xx, WS, commerce) - E2E commerce: product->order->review->invoice - SMOKE_TEST_V0602, RETROSPECTIVE_V0602, PAYOUT_MANUAL - Archive V0_602 scope, V0_603 placeholder, SCOPE_CONTROL v0.603 - Fix sanitizer regex (Go no backreferences) - Marketplace test schema: product_licenses, product_images, orders, licenses	2026-02-23 22:32:01 +01:00
senke	094e85c7e3	stabilizing veza-backend-api: P1 & P2	2025-12-16 13:34:08 -05:00
senke	2dfde29f7d	refonte: backend-api go first; phase 1	2025-12-12 21:34:34 -05:00
okinrev	87c6461900	report generation and future tasks selection	2025-12-08 19:57:54 +01:00
okinrev	1e4f7b1756	STABILISATION: phase 3–5 – API contract, tests & chat-server hardening	2025-12-06 17:21:59 +01:00
okinrev	b7955a680c	P0: stabilisation backend/chat/stream + nouvelle base migrations v1 Backend Go: - Remplacement complet des anciennes migrations par la base V1 alignée sur ORIGIN. - Durcissement global du parsing JSON (BindAndValidateJSON + RespondWithAppError). - Sécurisation de config.go, CORS, statuts de santé et monitoring. - Implémentation des transactions P0 (RBAC, duplication de playlists, social toggles). - Ajout d’un job worker structuré (emails, analytics, thumbnails) + tests associés. - Nouvelle doc backend : AUDIT_CONFIG, BACKEND_CONFIG, AUTH_PASSWORD_RESET, JOB_WORKER_. Chat server (Rust): - Refonte du pipeline JWT + sécurité, audit et rate limiting avancé. - Implémentation complète du cycle de message (read receipts, delivered, edit/delete, typing). - Nettoyage des panics, gestion d’erreurs robuste, logs structurés. - Migrations chat alignées sur le schéma UUID et nouvelles features. Stream server (Rust): - Refonte du moteur de streaming (encoding pipeline + HLS) et des modules core. - Transactions P0 pour les jobs et segments, garanties d’atomicité. - Documentation détaillée de la pipeline (AUDIT_STREAM_, DESIGN_STREAM_PIPELINE, TRANSACTIONS_P0_IMPLEMENTATION). Documentation & audits: - TRIAGE.md et AUDIT_STABILITY.md à jour avec l’état réel des 3 services. - Cartographie complète des migrations et des transactions (DB_MIGRATIONS_*, DB_TRANSACTION_PLAN, AUDIT_DB_TRANSACTIONS, TRANSACTION_TESTS_PHASE3). - Scripts de reset et de cleanup pour la lab DB et la V1. Ce commit fige l’ensemble du travail de stabilisation P0 (UUID, backend, chat et stream) avant les phases suivantes (Coherence Guardian, WS hardening, etc.).	2025-12-06 11:14:38 +01:00
okinrev	2425c15b09	adding initial backend API (Go)	2025-12-03 20:29:37 +01:00

15 commits