Commit graph

26 commits

Author SHA1 Message Date
senke
94dfc80b73 feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F
Five Prometheus gauges + reconciler metrics + Grafana dashboard +
three alert rules. Closes axis-1 P1.8 and adds observability for
item C's reconciler (user review: "F should include reconciler_*
metrics, otherwise tag is blind on the worker we just shipped").

Gauges (veza_ledger_, sampled every 60s):
  * orphan_refund_rows — THE canary. Pending refunds with empty
    hyperswitch_refund_id older than 5m = Phase 2 crash in
    RefundOrder. Alert: > 0 for 5m → page.
  * stuck_orders_pending — order pending > 30m with non-empty
    payment_id. Alert: > 0 for 10m → page.
  * stuck_refunds_pending — refund pending > 30m with hs_id.
  * failed_transfers_at_max_retry — permanently_failed rows.
  * reversal_pending_transfers — item B rows stuck > 30m.

Reconciler metrics (veza_reconciler_):
  * actions_total{phase} — counter by phase.
  * orphan_refunds_total — two-phase-bug canary.
  * sweep_duration_seconds — exponential histogram.
  * last_run_timestamp — alert: stale > 2h → page (worker dead).

Implementation notes:
  * Sampler thresholds hardcoded to match reconciler defaults —
    intentional mismatch allowed (alerts fire while reconciler
    already working = correct behavior).
  * Query error sets gauge to -1 (sentinel for "sampler broken").
  * marketplace package routes through monitoring recorders so it
    doesn't import prometheus directly.
  * Sampler runs regardless of Hyperswitch enablement; gauges
    default 0 when pipeline idle.
  * Graceful shutdown wired in cmd/api/main.go.

Alert rules in config/alertmanager/ledger.yml with runbook
pointers + detailed descriptions — each alert explains WHAT
happened, WHY the reconciler may not resolve it, and WHERE to
look first.

Grafana dashboard config/grafana/dashboards/ledger-health.json —
top row = 5 stat panels (orphan first, color-coded red on > 0),
middle row = trend timeseries + reconciler action rate by phase,
bottom row = sweep duration p50/p95/p99 + seconds-since-last-tick
+ orphan cumulative.

Tests — 6 cases, all green (sqlite :memory:):
  * CountsStuckOrdersPending (includes the filter on
    non-empty payment_id)
  * StuckOrdersZeroWhenAllCompleted
  * CountsOrphanRefunds (THE canary)
  * CountsStuckRefundsWithHsID (gauge-orthogonality check)
  * CountsFailedAndReversalPendingTransfers
  * ReconcilerRecorders (counter + gauge shape)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 03:40:14 +02:00
senke
113210734c chore(infra): J6 — mark 3 dormant docker-compose files as deprecated
Audit cross-checked against active composes shows three dormant compose
files that duplicate functionality already covered by the canonical
docker-compose.{,dev,prod,staging,test}.yml at the repo root. None are
referenced from Make targets, scripts, or CI workflows. They have
diverged from the active set (different ports, older Postgres version,
no shared volume names, etc.) and are a footgun for new contributors.

Files marked DEPRECATED with a header pointing at the canonical compose
to use instead:

  veza-stream-server/docker-compose.yml
    Standalone stream-server compose. Same service is provided by the
    root docker-compose.yml under the `docker-dev` profile.

  infra/docker-compose.lab.yml
    Lab Postgres on default port 5432. Conflicts with a host Postgres on
    most setups; root docker-compose.dev.yml uses non-default ports for
    a reason.

  config/docker/docker-compose.local.yml
    Local Postgres 15 variant on port 5433. Redundant with root
    docker-compose.dev.yml (Postgres 16, project-wide port mapping).

Not in this commit (intentionally limited J6 scope, per audit plan
"verify, don't refactor"):

  - No `extends:` consolidation across the active composes — that is a
    1-2 day refactor on its own and not a v1.0.4 concern.
  - The five active composes were syntactically validated locally
    (docker compose config); production and staging both require
    operator-injected env vars (DB_PASS, S3_*, RABBITMQ_PASS, etc.)
    which is the intended behavior, not a bug.
  - Cross-compose audit confirms zero references to the removed
    chat-server or any other dead service / image. Only one residual
    deprecation warning across all active composes: the obsolete
    `version:` field on docker-compose.{prod,test,test}.yml — cosmetic,
    not blocking.
  - Test suite verification (Go / Rust / Vitest) deferred to Forgejo CI
    rather than re-running locally. The pre-push hook + remote pipeline
    will gate the next push.

Follow-up candidates (not blocking v1.0.4):
  - Delete the three deprecated files once a 2-month grace period
    confirms no local dev workflow references them.
  - Drop the obsolete `version:` field across the active composes.

Refs: AUDIT_REPORT.md §6.1, §10 P7
2026-04-15 12:58:39 +02:00
senke
d5bfe4a558 docs: add project documentation, logging config, status script
Some checks failed
Backend API CI / test-unit (push) Failing after 0s
Backend API CI / test-integration (push) Failing after 0s
Frontend CI / test (push) Failing after 0s
Storybook Audit / Build & audit Storybook (push) Failing after 0s
- docs/VEZA_PROJECT_DOCUMENTATION.md
- config/logging.toml
- status.sh utility script

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 11:36:36 +01:00
senke
c0e2fe2e12 fix(v0.12.6.1): remediate remaining 15 MEDIUM + LOW pentest findings
MEDIUM-002: Remove manual X-Forwarded-For parsing in metrics_protection.go,
  use c.ClientIP() only (respects SetTrustedProxies)
MEDIUM-003: Pin ClamAV Docker image to 1.4 across all compose files
MEDIUM-004: Add clampLimit(100) to 15+ handlers that parsed limit directly
MEDIUM-006: Remove unsafe-eval from CSP script-src on Swagger routes
MEDIUM-007: Pin all GitHub Actions to SHA in 11 workflow files
MEDIUM-008: Replace rabbitmq:3-management-alpine with rabbitmq:3-alpine in prod
MEDIUM-009: Add trial-already-used check in subscription service
MEDIUM-010: Add 60s periodic token re-validation to WebSocket connections
MEDIUM-011: Mask email in auth handler logs with maskEmail() helper
MEDIUM-012: Add k-anonymity threshold (k=5) to playback analytics stats
LOW-001: Align frontend password policy to 12 chars (matching backend)
LOW-003: Replace deprecated dotenv with dotenvy crate in Rust stream server
LOW-004: Enable xpack.security in Elasticsearch dev/local compose files
LOW-005: Accept context.Context in CleanupExpiredSessions instead of Background()
LOW-002: Noted — Hyperswitch version update deferred (requires payment integration tests)

29/30 findings remediated. 1 noted (LOW-002).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 06:13:38 +01:00
senke
f5bca2b642 v0.9.5 2026-03-06 10:02:53 +01:00
senke
65375a61aa chore(release): v0.952 — Observe (Grafana v1-overview, Prometheus alert_rules_v1) 2026-03-02 19:08:55 +01:00
senke
83ed4f315b chore(release): v0.602 — Payout, Dette Technique & Tests E2E
Some checks failed
Backend API CI / test-unit (push) Failing after 0s
Backend API CI / test-integration (push) Failing after 0s
Frontend CI / test (push) Failing after 0s
Storybook Audit / Build & audit Storybook (push) Failing after 0s
- Stripe Connect: onboarding, balance, SellerDashboardView
- Interceptors: auth.ts, error.ts extracted, facade
- Grafana: dashboards enriched (p50, top endpoints, 4xx, WS, commerce)
- E2E commerce: product->order->review->invoice
- SMOKE_TEST_V0602, RETROSPECTIVE_V0602, PAYOUT_MANUAL
- Archive V0_602 scope, V0_603 placeholder, SCOPE_CONTROL v0.603
- Fix sanitizer regex (Go no backreferences)
- Marketplace test schema: product_licenses, product_images, orders, licenses
2026-02-23 22:32:01 +01:00
senke
30bc31f3a6 feat(monitoring): add Alertmanager with Slack notifications
- config/alertmanager/alertmanager.yml: route, slack-default and null receivers
- config/prometheus.yml: alerting.alertmanagers -> alertmanager:9093
- docker-compose.prod.yml: alertmanager service (port 9093)
2026-02-23 19:54:55 +01:00
senke
c002e74031 feat(monitoring): add 3 Grafana dashboards (API, Chat, Commerce)
- api-overview.json: request rate, p95 latency, 5xx errors, DB pool
- chat-overview.json: WebSocket upgrade rate, chat API
- commerce-overview.json: marketplace/commerce/orders metrics
- system-overview.json: replaces veza-dashboard.json
2026-02-23 19:54:01 +01:00
senke
0ff8a85684 feat(infra): blue-green deployment via HAProxy
- HAProxy: api/stream/web backends with blue+green servers (backup)
- docker-compose.prod: backend-api-blue/green, stream-server-blue/green, web-blue/green
- haproxy-blue.cfg, haproxy-green.cfg: config variants for active stack
- scripts/deploy-blue-green.sh: switch traffic via config copy + HUP reload
2026-02-23 19:52:19 +01:00
senke
279a10d317 chore(cleanup): remove veza-chat-server directory and all operational references
Chat functionality is now fully handled by the Go backend (since v0.502).
Remove the deprecated Rust chat server and all its references from:
- CI/CD workflows (ci.yml, cd.yml, rust-ci.yml, chat-ci.yml)
- Monitoring & proxy config (prometheus, caddy, haproxy)
- Incus deployment scripts and documentation
- Monorepo config (package.json, dependabot, GH templates)
2026-02-22 21:13:00 +01:00
senke
6b25ccc9da feat(monitoring): add Prometheus alerting rules for critical conditions
INF-08: Alert rules for service_down, high_error_rate (>5%),
high_latency (P99>2s), and redis_unreachable. Enabled rule_files
in prometheus.yml.
2026-02-22 17:36:07 +01:00
senke
3e0e1b5286 feat(infra): complete staging compose with chat, stream, and reverse proxy
INF-07: Added chat-server, stream-server, Caddy reverse proxy,
and healthchecks for all services in staging compose.
2026-02-22 17:36:03 +01:00
senke
98f6db3a1d fix(streaming): ensure HLS audio chain works end-to-end
- HAProxy: route /hls to stream server
- Vite proxy: /ws, /stream, /hls for dev
- HLS_BASE_URL: empty when STREAM_URL relative (proxy)
- FEATURE_STATUS: HLS_STREAMING operational
2026-02-18 12:42:42 +01:00
senke
b657776892 fix(infra): HAProxy HTTPS and stats security
P1.1 - Enable HTTPS in HAProxy for production:
- HTTP to HTTPS redirect (301)
- HTTPS frontend on port 443 with veza.pem
- config/ssl/ structure with README and generate-ssl-cert.sh
- docker-compose.prod.yml volume for certs

P1.3 - Restrict HAProxy stats to internal network:
- ACL from_internal (127.0.0.1, 172.20.0.0/16)
- stats admin if from_internal

Also: remove errorfile directives (use HAProxy built-in defaults)
2026-02-15 15:58:51 +01:00
senke
ae586f6134 Phase 2 stabilisation: code mort, Modal→Dialog, feature flags, tests, router split, Rust legacy
Bloc A - Code mort:
- Suppression Studio (components, views, features)
- Suppression gamification + services mock (projectService, storageService, gamificationService)
- Mise à jour Sidebar, Navbar, locales

Bloc B - Frontend:
- Suppression modal.tsx deprecated, Modal.stories (doublon Dialog)
- Feature flags: PLAYLIST_SEARCH, PLAYLIST_RECOMMENDATIONS, ROLE_MANAGEMENT = true
- Suppression 19 tests orphelins, retrait exclusions vitest.config

Bloc C - Backend:
- Extraction routes_auth.go depuis router.go

Bloc D - Rust:
- Suppression security_legacy.rs (code mort, patterns déjà dans security/)
2026-02-14 17:23:32 +01:00
senke
6e0d4457d9 chore(docker): document docker-compose file usage and purpose
Add config/docker/README.md with:
- Table of all remaining docker-compose files and their purposes
- Usage commands for each environment
- List of deleted deprecated files (from C9)
- Required environment variables for production deployment

Addresses audit finding: debt item 8 (12 docker-compose files confusion).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-11 22:53:24 +01:00
senke
0a269ed664 chore: remove dead code, backups, and deprecated docker-compose files
Removed files:
- apps/web/src/utils/storeSelectors.ts.backup (committed backup file)
- apps/web/desy/ (69 files, unused legacy design system)
- docker-compose.production.yml (root, superseded by docker-compose.prod.yml)
- config/docker/docker-compose.production.yml (deprecated copy)
- veza-stream-server/docker-compose.production.yml (deprecated copy)

Annotated ghost features in MSW handlers:
- Education endpoints marked as GHOST FEATURE (no backend)
- Gamification endpoints marked as GHOST FEATURE (no backend)

Not removed (out of scope for this commit):
- veza-desktop/ and veza-mobile/ (separate issue)
- Root-level audit markdown reports (product owner decision)

Addresses audit findings: debt items 12-18 (dead code, ghost features).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-11 22:52:59 +01:00
senke
f53b7f7d8a chore: update docker-compose, make, tmt config
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-11 22:18:57 +01:00
senke
44ddd3b858 chore(incus): add env template, document setup
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-11 19:49:01 +01:00
senke
d7bb127920 fix(security): stop tracking veza-stream-server/.env and config/incus env files
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-11 19:48:51 +01:00
senke
023b8a89c6 fix: Corriger URL Swagger et finaliser implémentation DeveloperPage
- Ajouter fallback pour Swagger UI si doc.json ne fonctionne pas
- Améliorer message d'erreur avec bouton pour ouvrir Swagger UI directement
- Les fonctionnalités API Keys et Usage Stats sont maintenant complètes et fonctionnelles
- Tous les onglets de DeveloperPage sont maintenant implémentés
2026-01-18 13:55:28 +01:00
senke
f0ba7de543 state-ownership: delete unused optimisticStoreUpdates.ts file
- Deleted apps/web/src/utils/optimisticStoreUpdates.ts (unused file)
- File was unused - no imports found in codebase
- Mutations already use React Query's onMutate pattern
- No TypeScript errors after deletion
- Actions 4.4.1.2 and 4.4.1.3 complete
2026-01-15 19:26:53 +01:00
senke
76d95ecfb4 incus deployement fully implemented, Makefile updated and make fmt ran 2026-01-13 19:47:57 +01:00
senke
f74b020d4b api-contracts: install openapi-generator-cli and create type generation script
- Completed Action 1.1.2.1: Installed @openapitools/openapi-generator-cli
- Completed Action 1.1.2.2: Created generate-types.sh script
- Added swagger annotations to cmd/modern-server/main.go
- Regenerated swagger.yaml with proper info section
- Successfully generated TypeScript types to src/types/generated/

The script generates types from veza-backend-api/openapi.yaml using
typescript-axios generator and creates barrel exports.
2026-01-11 16:30:43 +01:00
okinrev
327ac36a30 BASE: completing the initial repo state 2025-12-03 22:56:50 +01:00