Commit graph

31 commits

Author SHA1 Message Date
senke
d86815561c feat(infra): MinIO distributed EC:2 + migration script (W3 Day 12)
Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m21s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 54s
Veza CI / Backend (Go) (push) Failing after 8m27s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Failing after 12m42s
Veza CI / Frontend (Web) (push) Successful in 15m49s
Four-node distributed MinIO cluster, single erasure set EC:2, tolerates
2 simultaneous node losses. 50% storage efficiency. Pinned to
RELEASE.2025-09-07T16-13-09Z to match docker-compose so dev/prod
parity is preserved.

- infra/ansible/roles/minio_distributed/ : install pinned binary,
  systemd unit pointed at MINIO_VOLUMES with bracket-expansion form,
  EC:2 forced via MINIO_STORAGE_CLASS_STANDARD. Vault assertion
  blocks shipping placeholder credentials to staging/prod.
- bucket init : creates veza-prod-tracks, enables versioning, applies
  lifecycle.json (30d noncurrent expiry + 7d abort-multipart). Cold-tier
  transition ready but inert until minio_remote_tier_name is set.
- infra/ansible/playbooks/minio_distributed.yml : provisions the 4
  containers, applies common baseline + role.
- infra/ansible/inventory/lab.yml : new minio_nodes group.
- infra/ansible/tests/test_minio_resilience.sh : kill 2 nodes,
  verify EC:2 reconstruction (read OK + checksum matches), restart,
  wait for self-heal.
- scripts/minio-migrate-from-single.sh : mc mirror --preserve from
  the single-node bucket to the new cluster, count-verifies, prints
  rollout next-steps.
- config/prometheus/alert_rules.yml : MinIODriveOffline (warn) +
  MinIONodesUnreachable (page) — page fires at >= 2 nodes unreachable
  because that's the redundancy ceiling for EC:2.
- docs/ENV_VARIABLES.md §12 : MinIO migration cross-ref.

Acceptance (Day 12) : EC:2 survives 2 concurrent kills + self-heals.
Lab apply pending. No backend code change — interface stays AWS S3.

W3 progress : Redis Sentinel ✓ (Day 11), MinIO distribué ✓ (this),
CDN  Day 13, DMCA  Day 14, embed  Day 15.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:46:42 +02:00
senke
a36d9b2d59 feat(redis): Sentinel HA + cache hit rate metrics (W3 Day 11)
Some checks failed
Veza CI / Backend (Go) (push) Failing after 8m56s
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Rust (Stream Server) (push) Successful in 5m3s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 53s
Three Incus containers, each running redis-server + redis-sentinel
(co-located). redis-1 = master at first boot, redis-2/3 = replicas.
Sentinel quorum=2 of 3 ; failover-timeout=30s satisfies the W3
acceptance criterion.

- internal/config/redis_init.go : initRedis branches on
  REDIS_SENTINEL_ADDRS ; non-empty -> redis.NewFailoverClient with
  MasterName + SentinelAddrs + SentinelPassword. Empty -> existing
  single-instance NewClient (dev/local stays parametric).
- internal/config/config.go : 3 new fields (RedisSentinelAddrs,
  RedisSentinelMasterName, RedisSentinelPassword) read from env.
  parseRedisSentinelAddrs trims+filters CSV.
- internal/metrics/cache_hit_rate.go : new RecordCacheHit / Miss
  counters, labelled by subsystem. Cardinality bounded.
- internal/middleware/rate_limiter.go : instrument 3 Eval call sites
  (DDoS, frontend log throttle, upload throttle). Hit = Redis answered,
  Miss = error -> in-memory fallback.
- internal/services/chat_pubsub.go : instrument Publish + PublishPresence.
- internal/websocket/chat/presence_service.go : instrument SetOnline /
  SetOffline / Heartbeat / GetPresence. redis.Nil counts as a hit
  (legitimate empty result).
- infra/ansible/roles/redis_sentinel/ : install Redis 7 + Sentinel,
  render redis.conf + sentinel.conf, systemd units. Vault assertion
  prevents shipping placeholder passwords to staging/prod.
- infra/ansible/playbooks/redis_sentinel.yml : provisions the 3
  containers + applies common baseline + role.
- infra/ansible/inventory/lab.yml : new groups redis_ha + redis_ha_master.
- infra/ansible/tests/test_redis_failover.sh : kills the master
  container, polls Sentinel for the new master, asserts elapsed < 30s.
- config/grafana/dashboards/redis-cache-overview.json : 3 hit-rate
  stats (rate_limiter / chat_pubsub / presence) + ops/s breakdown.
- docs/ENV_VARIABLES.md §3 : 3 new REDIS_SENTINEL_* env vars.
- veza-backend-api/.env.template : 3 placeholders (empty default).

Acceptance (Day 11) : Sentinel failover < 30s ; cache hit-rate
dashboard populated. Lab test pending Sentinel deployment.

W3 verification gate progress : Redis Sentinel ✓ (this commit),
MinIO EC4+2  Day 12, CDN  Day 13, DMCA  Day 14, embed  Day 15.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:36:55 +02:00
senke
c78bf1b765 feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)
Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m4s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s
Veza CI / Backend (Go) (push) Failing after 15m45s
Veza CI / Frontend (Web) (push) Successful in 18m7s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Successful in 24m9s
Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
  * SLO_API_AVAILABILITY  : 99.5% on read (GET) endpoints
  * SLO_API_LATENCY       : 99% writes p95 < 500ms
  * SLO_PAYMENT_SUCCESS   : 99.5% on POST /api/v1/orders -> 2xx

Each SLO has two alerts :
  * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
  * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)

- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
  check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
  + PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
  + db-failover, redis-down, disk-full, cert-expiring-soon : one stub
  per likely page. Each lists first moves under 5min + common causes.

Acceptance (Day 10) : promtool check rules vert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 01:30:34 +02:00
senke
84e92a75e2 feat(observability): OTel SDK + collector + Tempo + 4 hot path spans (W2 Day 9)
Some checks failed
Veza CI / Notify on failure (push) Blocked by required conditions
Security Scan / Secret Scanning (gitleaks) (push) Waiting to run
Veza CI / Backend (Go) (push) Has been cancelled
Veza CI / Rust (Stream Server) (push) Has been cancelled
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Wires distributed tracing end-to-end. Backend exports OTLP/gRPC to a
collector, which tail-samples (errors + slow always, 10% rest) and
ships to Tempo. Grafana service-map dashboard pivots on the 4
instrumented hot paths.

- internal/tracing/otlp_exporter.go : InitOTLPTracer + Provider.Shutdown,
  BatchSpanProcessor (5s/512 batch), ParentBased(TraceIDRatio) sampler,
  W3C trace-context + baggage propagators. OTEL_SDK_DISABLED=true
  short-circuits to a no-op. Failure to dial collector is non-fatal.
- cmd/api/main.go : init at boot, defer Shutdown(5s) on exit. appVersion
  ldflag-overridable for resource attributes.
- 4 hot paths instrumented :
    * handlers/auth.go::Login           → "auth.login"
    * core/track/track_upload_handler.go::InitiateChunkedUpload → "track.upload.initiate"
    * core/marketplace/service.go::ProcessPaymentWebhook → "payment.webhook"
    * handlers/search_handlers.go::Search → "search.query"
  PII guarded — email masked, query content not recorded (length only).
- infra/ansible/roles/otel_collector : pin v0.116.1 contrib build,
  systemd unit, tail-sampling config (errors + > 500ms always kept).
- infra/ansible/roles/tempo : pin v2.7.1 monolithic, local-disk backend
  (S3 deferred to v1.1), 14d retention.
- infra/ansible/playbooks/observability.yml : provisions both Incus
  containers + applies common baseline + roles in order.
- inventory/lab.yml : new groups observability, otel_collectors, tempo.
- config/grafana/dashboards/service-map.json : node graph + 4 hot-path
  span tables + collector throughput/queue panels.
- docs/ENV_VARIABLES.md §30 : 4 OTEL_* env vars documented.

Acceptance criterion (Day 9) : login → span visible in Tempo UI. Lab
deployment to validate with `ansible-playbook -i inventory/lab.yml
playbooks/observability.yml` once roles/postgres_ha is up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 01:15:11 +02:00
senke
bf31a91ae6 feat(infra): pgbackrest role + dr-drill + Prometheus backup alerts (W2 Day 8)
Some checks failed
Veza CI / Frontend (Web) (push) Failing after 16m6s
Veza CI / Notify on failure (push) Successful in 11s
E2E Playwright / e2e (full) (push) Successful in 19m59s
Veza CI / Rust (Stream Server) (push) Successful in 4m57s
Security Scan / Secret Scanning (gitleaks) (push) Successful in 49s
Veza CI / Backend (Go) (push) Successful in 6m4s
ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 8 deliverable:
  - Postgres backups land in MinIO via pgbackrest
  - dr-drill restores them weekly into an ephemeral Incus container
    and asserts the data round-trips
  - Prometheus alerts fire when the drill fails OR when the timer
    has stopped firing for >8 days

Cadence:
  full   — weekly  (Sun 02:00 UTC, systemd timer)
  diff   — daily   (Mon-Sat 02:00 UTC, systemd timer)
  WAL    — continuous (postgres archive_command, archive_timeout=60s)
  drill  — weekly  (Sun 04:00 UTC — runs 2h after the Sun full so
           the restore exercises fresh data)

RPO ≈ 1 min (archive_timeout). RTO ≤ 30 min (drill measures actual
restore wall-clock).

Files:
  infra/ansible/roles/pgbackrest/
    defaults/main.yml — repo1-* config (MinIO/S3, path-style,
      aes-256-cbc encryption, vault-backed creds), retention 4 full
      / 7 diff / 4 archive cycles, zstd@3 compression. The role's
      first task asserts the placeholder secrets are gone — refuses
      to apply until the vault carries real keys.
    tasks/main.yml — install pgbackrest, render
      /etc/pgbackrest/pgbackrest.conf, set archive_command on the
      postgres instance via ALTER SYSTEM, detect role at runtime
      via `pg_autoctl show state --json`, stanza-create from primary
      only, render + enable systemd timers (full + diff + drill).
    templates/pgbackrest.conf.j2 — global + per-stanza sections;
      pg1-path defaults to the pg_auto_failover state dir so the
      role plugs straight into the Day 6 formation.
    templates/pgbackrest-{full,diff,drill}.{service,timer}.j2 —
      systemd units. Backup services run as `postgres`,
      drill service runs as `root` (needs `incus`).
      RandomizedDelaySec on every timer to absorb clock skew + node
      collision risk.
    README.md — RPO/RTO guarantees, vault setup, repo wiring,
      operational cheatsheet (info / check / manual backup),
      restore procedure documented separately as the dr-drill.

  scripts/dr-drill.sh
    Acceptance script for the day. Sequence:
      0. pre-flight: required tools, latest backup metadata visible
      1. launch ephemeral `pg-restore-drill` Incus container
      2. install postgres + pgbackrest inside, push the SAME
         pgbackrest.conf as the host (read-only against the bucket
         by pgbackrest semantics — the same s3 keys get reused so
         the drill exercises the production credential path)
      3. `pgbackrest restore` — full + WAL replay
      4. start postgres, wait for pg_isready
      5. smoke query: SELECT count(*) FROM users — must be ≥ MIN_USERS_EXPECTED
      6. write veza_backup_drill_* metrics to the textfile-collector
      7. teardown (or --keep for postmortem inspection)
    Exit codes 0/1/2 (pass / drill failure / env problem) so a
    Prometheus runner can plug in directly.

  config/prometheus/alert_rules.yml — new `veza_backup` group:
    - BackupRestoreDrillFailed (critical, 5m): the last drill
      reported success=0. Pages because a backup we haven't proved
      restorable is dette technique waiting for a disaster.
    - BackupRestoreDrillStale (warning, 1h after >8 days): the
      drill timer has stopped firing. Catches a broken cron / unit
      / runner before the failure-mode alert above ever sees data.
    Both annotations include a runbook_url stub
    (veza.fr/runbooks/...) — those land alongside W2 day 10's
    SLO runbook batch.

  infra/ansible/playbooks/postgres_ha.yml
    Two new plays:
      6. apply pgbackrest role to postgres_ha_nodes (install +
         config + full/diff timers on every data node;
         pgbackrest's repo lock arbitrates collision)
      7. install dr-drill on the incus_hosts group (push
         /usr/local/bin/dr-drill.sh + render drill timer + ensure
         /var/lib/node_exporter/textfile_collector exists)

Acceptance verified locally:
  $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \
      --syntax-check
  playbook: playbooks/postgres_ha.yml          ← clean
  $ python3 -c "import yaml; yaml.safe_load(open('config/prometheus/alert_rules.yml'))"
  YAML OK
  $ bash -n scripts/dr-drill.sh
  syntax OK

Real apply + drill needs the lab R720 + a populated MinIO bucket
+ the secrets in vault — operator's call.

Out of scope (deferred per ROADMAP §2):
  - Off-site backup replica (B2 / Bunny.net) — v1.1+
  - Logical export pipeline for RGPD per-user dumps — separate
    feature track, not a backup-system concern
  - PITR admin UI — CLI-only via `--type=time` for v1.0
  - pgbackrest_exporter Prometheus integration — W2 day 9
    alongside the OTel collector

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 00:51:00 +02:00
senke
94dfc80b73 feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F
Five Prometheus gauges + reconciler metrics + Grafana dashboard +
three alert rules. Closes axis-1 P1.8 and adds observability for
item C's reconciler (user review: "F should include reconciler_*
metrics, otherwise tag is blind on the worker we just shipped").

Gauges (veza_ledger_, sampled every 60s):
  * orphan_refund_rows — THE canary. Pending refunds with empty
    hyperswitch_refund_id older than 5m = Phase 2 crash in
    RefundOrder. Alert: > 0 for 5m → page.
  * stuck_orders_pending — order pending > 30m with non-empty
    payment_id. Alert: > 0 for 10m → page.
  * stuck_refunds_pending — refund pending > 30m with hs_id.
  * failed_transfers_at_max_retry — permanently_failed rows.
  * reversal_pending_transfers — item B rows stuck > 30m.

Reconciler metrics (veza_reconciler_):
  * actions_total{phase} — counter by phase.
  * orphan_refunds_total — two-phase-bug canary.
  * sweep_duration_seconds — exponential histogram.
  * last_run_timestamp — alert: stale > 2h → page (worker dead).

Implementation notes:
  * Sampler thresholds hardcoded to match reconciler defaults —
    intentional mismatch allowed (alerts fire while reconciler
    already working = correct behavior).
  * Query error sets gauge to -1 (sentinel for "sampler broken").
  * marketplace package routes through monitoring recorders so it
    doesn't import prometheus directly.
  * Sampler runs regardless of Hyperswitch enablement; gauges
    default 0 when pipeline idle.
  * Graceful shutdown wired in cmd/api/main.go.

Alert rules in config/alertmanager/ledger.yml with runbook
pointers + detailed descriptions — each alert explains WHAT
happened, WHY the reconciler may not resolve it, and WHERE to
look first.

Grafana dashboard config/grafana/dashboards/ledger-health.json —
top row = 5 stat panels (orphan first, color-coded red on > 0),
middle row = trend timeseries + reconciler action rate by phase,
bottom row = sweep duration p50/p95/p99 + seconds-since-last-tick
+ orphan cumulative.

Tests — 6 cases, all green (sqlite :memory:):
  * CountsStuckOrdersPending (includes the filter on
    non-empty payment_id)
  * StuckOrdersZeroWhenAllCompleted
  * CountsOrphanRefunds (THE canary)
  * CountsStuckRefundsWithHsID (gauge-orthogonality check)
  * CountsFailedAndReversalPendingTransfers
  * ReconcilerRecorders (counter + gauge shape)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 03:40:14 +02:00
senke
113210734c chore(infra): J6 — mark 3 dormant docker-compose files as deprecated
Audit cross-checked against active composes shows three dormant compose
files that duplicate functionality already covered by the canonical
docker-compose.{,dev,prod,staging,test}.yml at the repo root. None are
referenced from Make targets, scripts, or CI workflows. They have
diverged from the active set (different ports, older Postgres version,
no shared volume names, etc.) and are a footgun for new contributors.

Files marked DEPRECATED with a header pointing at the canonical compose
to use instead:

  veza-stream-server/docker-compose.yml
    Standalone stream-server compose. Same service is provided by the
    root docker-compose.yml under the `docker-dev` profile.

  infra/docker-compose.lab.yml
    Lab Postgres on default port 5432. Conflicts with a host Postgres on
    most setups; root docker-compose.dev.yml uses non-default ports for
    a reason.

  config/docker/docker-compose.local.yml
    Local Postgres 15 variant on port 5433. Redundant with root
    docker-compose.dev.yml (Postgres 16, project-wide port mapping).

Not in this commit (intentionally limited J6 scope, per audit plan
"verify, don't refactor"):

  - No `extends:` consolidation across the active composes — that is a
    1-2 day refactor on its own and not a v1.0.4 concern.
  - The five active composes were syntactically validated locally
    (docker compose config); production and staging both require
    operator-injected env vars (DB_PASS, S3_*, RABBITMQ_PASS, etc.)
    which is the intended behavior, not a bug.
  - Cross-compose audit confirms zero references to the removed
    chat-server or any other dead service / image. Only one residual
    deprecation warning across all active composes: the obsolete
    `version:` field on docker-compose.{prod,test,test}.yml — cosmetic,
    not blocking.
  - Test suite verification (Go / Rust / Vitest) deferred to Forgejo CI
    rather than re-running locally. The pre-push hook + remote pipeline
    will gate the next push.

Follow-up candidates (not blocking v1.0.4):
  - Delete the three deprecated files once a 2-month grace period
    confirms no local dev workflow references them.
  - Drop the obsolete `version:` field across the active composes.

Refs: AUDIT_REPORT.md §6.1, §10 P7
2026-04-15 12:58:39 +02:00
senke
d5bfe4a558 docs: add project documentation, logging config, status script
Some checks failed
Backend API CI / test-unit (push) Failing after 0s
Backend API CI / test-integration (push) Failing after 0s
Frontend CI / test (push) Failing after 0s
Storybook Audit / Build & audit Storybook (push) Failing after 0s
- docs/VEZA_PROJECT_DOCUMENTATION.md
- config/logging.toml
- status.sh utility script

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 11:36:36 +01:00
senke
c0e2fe2e12 fix(v0.12.6.1): remediate remaining 15 MEDIUM + LOW pentest findings
MEDIUM-002: Remove manual X-Forwarded-For parsing in metrics_protection.go,
  use c.ClientIP() only (respects SetTrustedProxies)
MEDIUM-003: Pin ClamAV Docker image to 1.4 across all compose files
MEDIUM-004: Add clampLimit(100) to 15+ handlers that parsed limit directly
MEDIUM-006: Remove unsafe-eval from CSP script-src on Swagger routes
MEDIUM-007: Pin all GitHub Actions to SHA in 11 workflow files
MEDIUM-008: Replace rabbitmq:3-management-alpine with rabbitmq:3-alpine in prod
MEDIUM-009: Add trial-already-used check in subscription service
MEDIUM-010: Add 60s periodic token re-validation to WebSocket connections
MEDIUM-011: Mask email in auth handler logs with maskEmail() helper
MEDIUM-012: Add k-anonymity threshold (k=5) to playback analytics stats
LOW-001: Align frontend password policy to 12 chars (matching backend)
LOW-003: Replace deprecated dotenv with dotenvy crate in Rust stream server
LOW-004: Enable xpack.security in Elasticsearch dev/local compose files
LOW-005: Accept context.Context in CleanupExpiredSessions instead of Background()
LOW-002: Noted — Hyperswitch version update deferred (requires payment integration tests)

29/30 findings remediated. 1 noted (LOW-002).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 06:13:38 +01:00
senke
f5bca2b642 v0.9.5 2026-03-06 10:02:53 +01:00
senke
65375a61aa chore(release): v0.952 — Observe (Grafana v1-overview, Prometheus alert_rules_v1) 2026-03-02 19:08:55 +01:00
senke
83ed4f315b chore(release): v0.602 — Payout, Dette Technique & Tests E2E
Some checks failed
Backend API CI / test-unit (push) Failing after 0s
Backend API CI / test-integration (push) Failing after 0s
Frontend CI / test (push) Failing after 0s
Storybook Audit / Build & audit Storybook (push) Failing after 0s
- Stripe Connect: onboarding, balance, SellerDashboardView
- Interceptors: auth.ts, error.ts extracted, facade
- Grafana: dashboards enriched (p50, top endpoints, 4xx, WS, commerce)
- E2E commerce: product->order->review->invoice
- SMOKE_TEST_V0602, RETROSPECTIVE_V0602, PAYOUT_MANUAL
- Archive V0_602 scope, V0_603 placeholder, SCOPE_CONTROL v0.603
- Fix sanitizer regex (Go no backreferences)
- Marketplace test schema: product_licenses, product_images, orders, licenses
2026-02-23 22:32:01 +01:00
senke
30bc31f3a6 feat(monitoring): add Alertmanager with Slack notifications
- config/alertmanager/alertmanager.yml: route, slack-default and null receivers
- config/prometheus.yml: alerting.alertmanagers -> alertmanager:9093
- docker-compose.prod.yml: alertmanager service (port 9093)
2026-02-23 19:54:55 +01:00
senke
c002e74031 feat(monitoring): add 3 Grafana dashboards (API, Chat, Commerce)
- api-overview.json: request rate, p95 latency, 5xx errors, DB pool
- chat-overview.json: WebSocket upgrade rate, chat API
- commerce-overview.json: marketplace/commerce/orders metrics
- system-overview.json: replaces veza-dashboard.json
2026-02-23 19:54:01 +01:00
senke
0ff8a85684 feat(infra): blue-green deployment via HAProxy
- HAProxy: api/stream/web backends with blue+green servers (backup)
- docker-compose.prod: backend-api-blue/green, stream-server-blue/green, web-blue/green
- haproxy-blue.cfg, haproxy-green.cfg: config variants for active stack
- scripts/deploy-blue-green.sh: switch traffic via config copy + HUP reload
2026-02-23 19:52:19 +01:00
senke
279a10d317 chore(cleanup): remove veza-chat-server directory and all operational references
Chat functionality is now fully handled by the Go backend (since v0.502).
Remove the deprecated Rust chat server and all its references from:
- CI/CD workflows (ci.yml, cd.yml, rust-ci.yml, chat-ci.yml)
- Monitoring & proxy config (prometheus, caddy, haproxy)
- Incus deployment scripts and documentation
- Monorepo config (package.json, dependabot, GH templates)
2026-02-22 21:13:00 +01:00
senke
6b25ccc9da feat(monitoring): add Prometheus alerting rules for critical conditions
INF-08: Alert rules for service_down, high_error_rate (>5%),
high_latency (P99>2s), and redis_unreachable. Enabled rule_files
in prometheus.yml.
2026-02-22 17:36:07 +01:00
senke
3e0e1b5286 feat(infra): complete staging compose with chat, stream, and reverse proxy
INF-07: Added chat-server, stream-server, Caddy reverse proxy,
and healthchecks for all services in staging compose.
2026-02-22 17:36:03 +01:00
senke
98f6db3a1d fix(streaming): ensure HLS audio chain works end-to-end
- HAProxy: route /hls to stream server
- Vite proxy: /ws, /stream, /hls for dev
- HLS_BASE_URL: empty when STREAM_URL relative (proxy)
- FEATURE_STATUS: HLS_STREAMING operational
2026-02-18 12:42:42 +01:00
senke
b657776892 fix(infra): HAProxy HTTPS and stats security
P1.1 - Enable HTTPS in HAProxy for production:
- HTTP to HTTPS redirect (301)
- HTTPS frontend on port 443 with veza.pem
- config/ssl/ structure with README and generate-ssl-cert.sh
- docker-compose.prod.yml volume for certs

P1.3 - Restrict HAProxy stats to internal network:
- ACL from_internal (127.0.0.1, 172.20.0.0/16)
- stats admin if from_internal

Also: remove errorfile directives (use HAProxy built-in defaults)
2026-02-15 15:58:51 +01:00
senke
ae586f6134 Phase 2 stabilisation: code mort, Modal→Dialog, feature flags, tests, router split, Rust legacy
Bloc A - Code mort:
- Suppression Studio (components, views, features)
- Suppression gamification + services mock (projectService, storageService, gamificationService)
- Mise à jour Sidebar, Navbar, locales

Bloc B - Frontend:
- Suppression modal.tsx deprecated, Modal.stories (doublon Dialog)
- Feature flags: PLAYLIST_SEARCH, PLAYLIST_RECOMMENDATIONS, ROLE_MANAGEMENT = true
- Suppression 19 tests orphelins, retrait exclusions vitest.config

Bloc C - Backend:
- Extraction routes_auth.go depuis router.go

Bloc D - Rust:
- Suppression security_legacy.rs (code mort, patterns déjà dans security/)
2026-02-14 17:23:32 +01:00
senke
6e0d4457d9 chore(docker): document docker-compose file usage and purpose
Add config/docker/README.md with:
- Table of all remaining docker-compose files and their purposes
- Usage commands for each environment
- List of deleted deprecated files (from C9)
- Required environment variables for production deployment

Addresses audit finding: debt item 8 (12 docker-compose files confusion).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-11 22:53:24 +01:00
senke
0a269ed664 chore: remove dead code, backups, and deprecated docker-compose files
Removed files:
- apps/web/src/utils/storeSelectors.ts.backup (committed backup file)
- apps/web/desy/ (69 files, unused legacy design system)
- docker-compose.production.yml (root, superseded by docker-compose.prod.yml)
- config/docker/docker-compose.production.yml (deprecated copy)
- veza-stream-server/docker-compose.production.yml (deprecated copy)

Annotated ghost features in MSW handlers:
- Education endpoints marked as GHOST FEATURE (no backend)
- Gamification endpoints marked as GHOST FEATURE (no backend)

Not removed (out of scope for this commit):
- veza-desktop/ and veza-mobile/ (separate issue)
- Root-level audit markdown reports (product owner decision)

Addresses audit findings: debt items 12-18 (dead code, ghost features).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-11 22:52:59 +01:00
senke
f53b7f7d8a chore: update docker-compose, make, tmt config
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-11 22:18:57 +01:00
senke
44ddd3b858 chore(incus): add env template, document setup
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-11 19:49:01 +01:00
senke
d7bb127920 fix(security): stop tracking veza-stream-server/.env and config/incus env files
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-11 19:48:51 +01:00
senke
023b8a89c6 fix: Corriger URL Swagger et finaliser implémentation DeveloperPage
- Ajouter fallback pour Swagger UI si doc.json ne fonctionne pas
- Améliorer message d'erreur avec bouton pour ouvrir Swagger UI directement
- Les fonctionnalités API Keys et Usage Stats sont maintenant complètes et fonctionnelles
- Tous les onglets de DeveloperPage sont maintenant implémentés
2026-01-18 13:55:28 +01:00
senke
f0ba7de543 state-ownership: delete unused optimisticStoreUpdates.ts file
- Deleted apps/web/src/utils/optimisticStoreUpdates.ts (unused file)
- File was unused - no imports found in codebase
- Mutations already use React Query's onMutate pattern
- No TypeScript errors after deletion
- Actions 4.4.1.2 and 4.4.1.3 complete
2026-01-15 19:26:53 +01:00
senke
76d95ecfb4 incus deployement fully implemented, Makefile updated and make fmt ran 2026-01-13 19:47:57 +01:00
senke
f74b020d4b api-contracts: install openapi-generator-cli and create type generation script
- Completed Action 1.1.2.1: Installed @openapitools/openapi-generator-cli
- Completed Action 1.1.2.2: Created generate-types.sh script
- Added swagger annotations to cmd/modern-server/main.go
- Regenerated swagger.yaml with proper info section
- Successfully generated TypeScript types to src/types/generated/

The script generates types from veza-backend-api/openapi.yaml using
typescript-axios generator and creates barrel exports.
2026-01-11 16:30:43 +01:00
okinrev
327ac36a30 BASE: completing the initial repo state 2025-12-03 22:56:50 +01:00