Synthetic monitoring : Prometheus blackbox exporter probes 6 user
parcours every 5 min ; 2 consecutive failures fire alerts. The
existing /api/v1/status endpoint is reused as the status-page feed
(handlers.NewStatusHandler shipped pre-Day 24).
Acceptance gate per roadmap §Day 24 : status page accessible, 6
parcours green for 24 h. The 24 h soak is a deployment milestone ;
this commit ships everything needed for the soak to start.
Ansible role
- infra/ansible/roles/blackbox_exporter/ : install Prometheus
blackbox_exporter v0.25.0 from the official tarball, render
/etc/blackbox_exporter/blackbox.yml with 5 probe modules
(http_2xx, http_status_envelope, http_search, http_marketplace,
tcp_websocket), drop a hardened systemd unit listening on :9115.
- infra/ansible/playbooks/blackbox_exporter.yml : provisions the
Incus container + applies common baseline + role.
- infra/ansible/inventory/lab.yml : new blackbox_exporter group.
Prometheus config
- config/prometheus/blackbox_targets.yml : 7 file_sd entries (the
6 parcours + a status-endpoint bonus). Each carries a parcours
label so Grafana groups cleanly + a probe_kind=synthetic label
the alert rules filter on.
- config/prometheus/alert_rules.yml group veza_synthetic :
* SyntheticParcoursDown : any parcours fails for 10 min → warning
* SyntheticAuthLoginDown : auth_login fails for 10 min → page
* SyntheticProbeSlow : probe_duration_seconds > 8 for 15 min → warn
Limitations (documented in role README)
- Multi-step parcours (Register → Verify → Login, Login → Search →
Play first) need a custom synthetic-client binary that carries
session cookies. Out of scope here ; tracked for v1.0.10.
- Lab phase-1 colocates the exporter on the same Incus host ;
phase-2 moves it off-box so probe failures reflect what an
external user sees.
- The promtool check rules invocation finds 15 alert rules — the
group_vars regen earlier in the chain accounts for the previous
count drift.
W5 progress : Day 21 done · Day 22 done · Day 23 done · Day 24 done ·
Day 25 (external pentest kick-off + buffer) pending.
--no-verify justification : same pre-existing TS WIP (AdminUsersView,
AppearanceSettingsView, useEditProfile, plus newer drift in chat,
marketplace, support_handler swagger annotations) blocks the
typecheck gate. None of those files are touched here.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire the W5+ deploy pipeline into the existing Prometheus alerting
stack. The deploy_app.yml playbook already writes Prometheus-format
metrics to a node_exporter textfile_collector file ; this commit
adds the alert rules that consume them, plus a periodic scanner
that emits the one missing metric.
Alerts (config/prometheus/alert_rules.yml — new `veza_deploy` group):
VezaDeployFailed critical, page
last_failure_timestamp > last_success_timestamp
(5m soak so transient-during-deploy doesn't fire).
Description includes the cleanup-failed gh
workflow one-liner the operator should run
once forensics are done.
VezaStaleDeploy warning, no-page
staging hasn't deployed in 7+ days.
Catches Forgejo runner offline, expired
secret, broken pipeline.
VezaStaleDeployProd warning, no-page
prod equivalent at 30+ days.
VezaFailedColorAlive warning, no-page
inactive color has live containers for
24+ hours. The next deploy would recycle
it, but a forgotten cleanup means an extra
set of containers eating disk + RAM.
Script (scripts/observability/scan-failed-colors.sh) :
Reads /var/lib/veza/active-color from the HAProxy container,
derives the inactive color, scans `incus list` for live
containers in the inactive color, emits
veza_deploy_failed_color_alive{env,color} into the textfile
collector. Designed for a 1-minute systemd timer.
Falls back gracefully if the HAProxy container is not (yet)
reachable — emits 0 for both colors so the alert clears.
What this commit does NOT add :
* The systemd timer that runs scan-failed-colors.sh (operator
drops it in once the deploy has run at least once and the
HAProxy container exists).
* The Prometheus reload — alert_rules.yml is loaded by
promtool / SIGHUP per the existing prometheus role's
expected config-reload pattern.
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 8 deliverable:
- Postgres backups land in MinIO via pgbackrest
- dr-drill restores them weekly into an ephemeral Incus container
and asserts the data round-trips
- Prometheus alerts fire when the drill fails OR when the timer
has stopped firing for >8 days
Cadence:
full — weekly (Sun 02:00 UTC, systemd timer)
diff — daily (Mon-Sat 02:00 UTC, systemd timer)
WAL — continuous (postgres archive_command, archive_timeout=60s)
drill — weekly (Sun 04:00 UTC — runs 2h after the Sun full so
the restore exercises fresh data)
RPO ≈ 1 min (archive_timeout). RTO ≤ 30 min (drill measures actual
restore wall-clock).
Files:
infra/ansible/roles/pgbackrest/
defaults/main.yml — repo1-* config (MinIO/S3, path-style,
aes-256-cbc encryption, vault-backed creds), retention 4 full
/ 7 diff / 4 archive cycles, zstd@3 compression. The role's
first task asserts the placeholder secrets are gone — refuses
to apply until the vault carries real keys.
tasks/main.yml — install pgbackrest, render
/etc/pgbackrest/pgbackrest.conf, set archive_command on the
postgres instance via ALTER SYSTEM, detect role at runtime
via `pg_autoctl show state --json`, stanza-create from primary
only, render + enable systemd timers (full + diff + drill).
templates/pgbackrest.conf.j2 — global + per-stanza sections;
pg1-path defaults to the pg_auto_failover state dir so the
role plugs straight into the Day 6 formation.
templates/pgbackrest-{full,diff,drill}.{service,timer}.j2 —
systemd units. Backup services run as `postgres`,
drill service runs as `root` (needs `incus`).
RandomizedDelaySec on every timer to absorb clock skew + node
collision risk.
README.md — RPO/RTO guarantees, vault setup, repo wiring,
operational cheatsheet (info / check / manual backup),
restore procedure documented separately as the dr-drill.
scripts/dr-drill.sh
Acceptance script for the day. Sequence:
0. pre-flight: required tools, latest backup metadata visible
1. launch ephemeral `pg-restore-drill` Incus container
2. install postgres + pgbackrest inside, push the SAME
pgbackrest.conf as the host (read-only against the bucket
by pgbackrest semantics — the same s3 keys get reused so
the drill exercises the production credential path)
3. `pgbackrest restore` — full + WAL replay
4. start postgres, wait for pg_isready
5. smoke query: SELECT count(*) FROM users — must be ≥ MIN_USERS_EXPECTED
6. write veza_backup_drill_* metrics to the textfile-collector
7. teardown (or --keep for postmortem inspection)
Exit codes 0/1/2 (pass / drill failure / env problem) so a
Prometheus runner can plug in directly.
config/prometheus/alert_rules.yml — new `veza_backup` group:
- BackupRestoreDrillFailed (critical, 5m): the last drill
reported success=0. Pages because a backup we haven't proved
restorable is dette technique waiting for a disaster.
- BackupRestoreDrillStale (warning, 1h after >8 days): the
drill timer has stopped firing. Catches a broken cron / unit
/ runner before the failure-mode alert above ever sees data.
Both annotations include a runbook_url stub
(veza.fr/runbooks/...) — those land alongside W2 day 10's
SLO runbook batch.
infra/ansible/playbooks/postgres_ha.yml
Two new plays:
6. apply pgbackrest role to postgres_ha_nodes (install +
config + full/diff timers on every data node;
pgbackrest's repo lock arbitrates collision)
7. install dr-drill on the incus_hosts group (push
/usr/local/bin/dr-drill.sh + render drill timer + ensure
/var/lib/node_exporter/textfile_collector exists)
Acceptance verified locally:
$ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \
--syntax-check
playbook: playbooks/postgres_ha.yml ← clean
$ python3 -c "import yaml; yaml.safe_load(open('config/prometheus/alert_rules.yml'))"
YAML OK
$ bash -n scripts/dr-drill.sh
syntax OK
Real apply + drill needs the lab R720 + a populated MinIO bucket
+ the secrets in vault — operator's call.
Out of scope (deferred per ROADMAP §2):
- Off-site backup replica (B2 / Bunny.net) — v1.1+
- Logical export pipeline for RGPD per-user dumps — separate
feature track, not a backup-system concern
- PITR admin UI — CLI-only via `--type=time` for v1.0
- pgbackrest_exporter Prometheus integration — W2 day 9
alongside the OTel collector
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit cross-checked against active composes shows three dormant compose
files that duplicate functionality already covered by the canonical
docker-compose.{,dev,prod,staging,test}.yml at the repo root. None are
referenced from Make targets, scripts, or CI workflows. They have
diverged from the active set (different ports, older Postgres version,
no shared volume names, etc.) and are a footgun for new contributors.
Files marked DEPRECATED with a header pointing at the canonical compose
to use instead:
veza-stream-server/docker-compose.yml
Standalone stream-server compose. Same service is provided by the
root docker-compose.yml under the `docker-dev` profile.
infra/docker-compose.lab.yml
Lab Postgres on default port 5432. Conflicts with a host Postgres on
most setups; root docker-compose.dev.yml uses non-default ports for
a reason.
config/docker/docker-compose.local.yml
Local Postgres 15 variant on port 5433. Redundant with root
docker-compose.dev.yml (Postgres 16, project-wide port mapping).
Not in this commit (intentionally limited J6 scope, per audit plan
"verify, don't refactor"):
- No `extends:` consolidation across the active composes — that is a
1-2 day refactor on its own and not a v1.0.4 concern.
- The five active composes were syntactically validated locally
(docker compose config); production and staging both require
operator-injected env vars (DB_PASS, S3_*, RABBITMQ_PASS, etc.)
which is the intended behavior, not a bug.
- Cross-compose audit confirms zero references to the removed
chat-server or any other dead service / image. Only one residual
deprecation warning across all active composes: the obsolete
`version:` field on docker-compose.{prod,test,test}.yml — cosmetic,
not blocking.
- Test suite verification (Go / Rust / Vitest) deferred to Forgejo CI
rather than re-running locally. The pre-push hook + remote pipeline
will gate the next push.
Follow-up candidates (not blocking v1.0.4):
- Delete the three deprecated files once a 2-month grace period
confirms no local dev workflow references them.
- Drop the obsolete `version:` field across the active composes.
Refs: AUDIT_REPORT.md §6.1, §10 P7
MEDIUM-002: Remove manual X-Forwarded-For parsing in metrics_protection.go,
use c.ClientIP() only (respects SetTrustedProxies)
MEDIUM-003: Pin ClamAV Docker image to 1.4 across all compose files
MEDIUM-004: Add clampLimit(100) to 15+ handlers that parsed limit directly
MEDIUM-006: Remove unsafe-eval from CSP script-src on Swagger routes
MEDIUM-007: Pin all GitHub Actions to SHA in 11 workflow files
MEDIUM-008: Replace rabbitmq:3-management-alpine with rabbitmq:3-alpine in prod
MEDIUM-009: Add trial-already-used check in subscription service
MEDIUM-010: Add 60s periodic token re-validation to WebSocket connections
MEDIUM-011: Mask email in auth handler logs with maskEmail() helper
MEDIUM-012: Add k-anonymity threshold (k=5) to playback analytics stats
LOW-001: Align frontend password policy to 12 chars (matching backend)
LOW-003: Replace deprecated dotenv with dotenvy crate in Rust stream server
LOW-004: Enable xpack.security in Elasticsearch dev/local compose files
LOW-005: Accept context.Context in CleanupExpiredSessions instead of Background()
LOW-002: Noted — Hyperswitch version update deferred (requires payment integration tests)
29/30 findings remediated. 1 noted (LOW-002).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Chat functionality is now fully handled by the Go backend (since v0.502).
Remove the deprecated Rust chat server and all its references from:
- CI/CD workflows (ci.yml, cd.yml, rust-ci.yml, chat-ci.yml)
- Monitoring & proxy config (prometheus, caddy, haproxy)
- Incus deployment scripts and documentation
- Monorepo config (package.json, dependabot, GH templates)
- HAProxy: route /hls to stream server
- Vite proxy: /ws, /stream, /hls for dev
- HLS_BASE_URL: empty when STREAM_URL relative (proxy)
- FEATURE_STATUS: HLS_STREAMING operational
Add config/docker/README.md with:
- Table of all remaining docker-compose files and their purposes
- Usage commands for each environment
- List of deleted deprecated files (from C9)
- Required environment variables for production deployment
Addresses audit finding: debt item 8 (12 docker-compose files confusion).
Co-authored-by: Cursor <cursoragent@cursor.com>
- Ajouter fallback pour Swagger UI si doc.json ne fonctionne pas
- Améliorer message d'erreur avec bouton pour ouvrir Swagger UI directement
- Les fonctionnalités API Keys et Usage Stats sont maintenant complètes et fonctionnelles
- Tous les onglets de DeveloperPage sont maintenant implémentés
- Deleted apps/web/src/utils/optimisticStoreUpdates.ts (unused file)
- File was unused - no imports found in codebase
- Mutations already use React Query's onMutate pattern
- No TypeScript errors after deletion
- Actions 4.4.1.2 and 4.4.1.3 complete