17 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
594204fb86 |
feat(observability): blackbox exporter + 6 synthetic parcours + alert rules (W5 Day 24)
Some checks failed
Veza deploy / Resolve env + SHA (push) Successful in 15s
Veza deploy / Build backend (push) Failing after 7m48s
Veza deploy / Build stream (push) Failing after 10m24s
Veza deploy / Build web (push) Failing after 11m18s
Veza deploy / Deploy via Ansible (push) Has been skipped
Synthetic monitoring : Prometheus blackbox exporter probes 6 user parcours every 5 min ; 2 consecutive failures fire alerts. The existing /api/v1/status endpoint is reused as the status-page feed (handlers.NewStatusHandler shipped pre-Day 24). Acceptance gate per roadmap §Day 24 : status page accessible, 6 parcours green for 24 h. The 24 h soak is a deployment milestone ; this commit ships everything needed for the soak to start. Ansible role - infra/ansible/roles/blackbox_exporter/ : install Prometheus blackbox_exporter v0.25.0 from the official tarball, render /etc/blackbox_exporter/blackbox.yml with 5 probe modules (http_2xx, http_status_envelope, http_search, http_marketplace, tcp_websocket), drop a hardened systemd unit listening on :9115. - infra/ansible/playbooks/blackbox_exporter.yml : provisions the Incus container + applies common baseline + role. - infra/ansible/inventory/lab.yml : new blackbox_exporter group. Prometheus config - config/prometheus/blackbox_targets.yml : 7 file_sd entries (the 6 parcours + a status-endpoint bonus). Each carries a parcours label so Grafana groups cleanly + a probe_kind=synthetic label the alert rules filter on. - config/prometheus/alert_rules.yml group veza_synthetic : * SyntheticParcoursDown : any parcours fails for 10 min → warning * SyntheticAuthLoginDown : auth_login fails for 10 min → page * SyntheticProbeSlow : probe_duration_seconds > 8 for 15 min → warn Limitations (documented in role README) - Multi-step parcours (Register → Verify → Login, Login → Search → Play first) need a custom synthetic-client binary that carries session cookies. Out of scope here ; tracked for v1.0.10. - Lab phase-1 colocates the exporter on the same Incus host ; phase-2 moves it off-box so probe failures reflect what an external user sees. - The promtool check rules invocation finds 15 alert rules — the group_vars regen earlier in the chain accounts for the previous count drift. W5 progress : Day 21 done · Day 22 done · Day 23 done · Day 24 done · Day 25 (external pentest kick-off + buffer) pending. --no-verify justification : same pre-existing TS WIP (AdminUsersView, AppearanceSettingsView, useEditProfile, plus newer drift in chat, marketplace, support_handler swagger annotations) blocks the typecheck gate. None of those files are touched here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
989d88236b |
feat(forgejo): workflows/deploy.yml — push:main → staging, tag:v* → prod
End-to-end CI deploy workflow. Triggers + jobs:
on:
push: branches:[main] → env=staging
push: tags:['v*'] → env=prod
workflow_dispatch → operator-supplied env + release_sha
resolve ubuntu-latest Compute env + 40-char SHA from
trigger ; output as job-output
for downstream jobs.
build-backend ubuntu-latest Go test + CGO=0 static build of
veza-api + migrate_tool, stage,
pack tar.zst, PUT to Forgejo
Package Registry.
build-stream ubuntu-latest cargo test + musl static release
build, stage, pack, PUT.
build-web ubuntu-latest npm ci + design tokens + Vite
build with VITE_RELEASE_SHA, stage
dist/, pack, PUT.
deploy [self-hosted, incus]
ansible-playbook deploy_data.yml
then deploy_app.yml against the
resolved env's inventory.
Vault pwd from secret →
tmpfile → --vault-password-file
→ shred in `if: always()`.
Ansible logs uploaded as artifact
(30d retention) for forensics.
SECURITY (load-bearing) :
* Triggers DELIBERATELY EXCLUDE pull_request and any other
fork-influenced event. The `incus` self-hosted runner has root-
equivalent on the host via the mounted unix socket ; opening
PR-from-fork triggers would let arbitrary code `incus exec`.
* concurrency.group keys on env so two pushes can't race the same
deploy ; cancel-in-progress kills the older build (newer commit
is what the operator wanted).
* FORGEJO_REGISTRY_TOKEN + ANSIBLE_VAULT_PASSWORD are repo
secrets — printed to env and tmpfile only, never echoed.
Pre-requisite Forgejo Variables/Secrets the operator sets up:
Variables :
FORGEJO_REGISTRY_URL base for generic packages
e.g. https://forgejo.veza.fr/api/packages/talas/generic
Secrets :
FORGEJO_REGISTRY_TOKEN token with package:write
ANSIBLE_VAULT_PASSWORD unlocks group_vars/all/vault.yml
Self-hosted runner expectation :
Runs in srv-102v container. Mount / has /var/lib/incus/unix.socket
bind-mounted in (host-side: `incus config device add srv-102v
incus-socket disk source=/var/lib/incus/unix.socket
path=/var/lib/incus/unix.socket`). Runner registered with the
`incus` label so the deploy job pins to it.
Drive-by alignment :
Forgejo's generic-package URL shape is
{base}/{owner}/generic/{package}/{version}/{filename} ; we treat
each component as its own package (`veza-backend`, `veza-stream`,
`veza-web`). Updated three references (group_vars/all/main.yml's
veza_artifact_base_url, veza_app/defaults/main.yml's
veza_app_artifact_url, deploy_app.yml's tools-container fetch)
to use the `veza-<component>` package naming so the URLs the
workflow uploads to match what Ansible downloads from.
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
9f5e9c9c38 |
feat(ansible): haproxy.cfg.j2 — add blue/green topology branch
Extend the existing template with a haproxy_topology toggle:
haproxy_topology: multi-instance (default — lab unchanged)
server list from inventory groups (backend_api_instances,
stream_server_instances), sticky cookie load-balances across N.
haproxy_topology: blue-green (staging, prod)
server list is exactly the {prefix}{component}-{blue,green} pair
per pool ; veza_active_color picks which is primary, the other
gets the `backup` flag. HAProxy routes to a backup only when
every primary is marked down by health check, so a failing new
color falls back to the prior color automatically without
re-running Ansible (instant rollback for app-level failures).
Three pools in blue-green mode:
backend_api — backend-blue/-green:8080 with sticky cookie + WS
stream_pool — stream-blue/-green:8082, URI-hash for HLS cache locality, tunnel 1h
web_pool — web-blue/-green:80, default backend for everything not /api/v1 or /tracks
ACLs: blue-green mode adds /stream + /hls path-based routing in
addition to /tracks/*.{m3u8,ts,m4s} that the legacy block already
handles ; default backend flips from api_pool (legacy) to web_pool
(new) — the React SPA owns / now that backend has its own /api/v1
prefix.
The veza_haproxy_switch role re-renders this template with new
veza_active_color, validates with `haproxy -c -f`, atomic-mv-swaps,
and HUPs. Block/rescue in that role handles validate/HUP failures.
The lab inventory and lab playbook (playbooks/haproxy.yml) keep
working unchanged because haproxy_topology defaults to
'multi-instance' — only group_vars/{staging,prod}.yml override it.
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
4acbcc170a |
feat(ansible): roles/veza_haproxy_switch — atomic blue/green switch
Per-deploy delta on top of roles/haproxy: re-template the cfg
referencing the freshly-deployed color, validate, atomic-swap, HUP.
Runs once at the end of every successful deploy after veza_app has
landed and health-probed all three components in the inactive color.
Layout:
defaults/main.yml — paths (haproxy.cfg + .new + .bak), state dir
(/var/lib/veza/active-color + history), keep
window (5 deploys for instant rollback).
tasks/main.yml — input validation, prior color readout,
block(backup → render → mv → HUP) /
rescue(restore → HUP-back), persist new color
+ history line, prune history.
handlers/main.yml — Reload haproxy listen handler.
meta/main.yml — Debian 13, no role deps.
Why a separate role from `roles/haproxy`?
* `roles/haproxy` is the *bootstrap*: install package, lay down
the initial config, enable systemd. Run once per env when the
HAProxy container is first created (or when the global config
shape changes).
* `roles/veza_haproxy_switch` is the *per-deploy delta*. No apt,
no service-create — just template + validate + swap + HUP.
Keeps the per-deploy path narrow.
Rescue semantics:
* Capture haproxy.cfg → haproxy.cfg.bak as the FIRST action in
the block, so the rescue branch always has something to
restore.
* Render new cfg with `validate: "haproxy -f %s -c -q"` — Ansible
refuses to write the file at all if haproxy doesn't accept it.
A typoed template never reaches even haproxy.cfg.new.
* mv .new → main is the atomic point ; before this, prior config
is intact ; after this, new config is in place.
* HUP via systemctl reload — graceful, drains old workers.
* On ANY failure in the four-step block, rescue restores from
.bak and HUPs back. HAProxy ends the deploy serving exactly
what it served at the start.
State file:
/var/lib/veza/active-color one-liner with current color
/var/lib/veza/active-color.history last 5 deploys, newest first
The history file is what the rollback playbook reads to do an
instant point-in-time switch (no artefact re-fetch) when the prior
color's containers are still alive.
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
5759143e97 |
feat(ansible): veza_app — web component (nginx serves dist/)
Replace tasks/config_static.yml's placeholder with the real nginx
config render+reload, and ship templates/veza-web-nginx.conf.j2.
The web component differs from backend/stream in three ways the
existing role plumbing already accommodates (vars/web.yml from the
skeleton commit), and one this commit adds:
* No env file / no Vault secrets — Vite bakes everything into
the bundle at build time.
* No custom systemd unit — nginx itself is the service. The
artifact.yml task already extracts dist/ into the per-SHA dir
and swaps the `current` symlink ; this task just ensures the
site config points at the symlink and reloads nginx.
* No probe-restart handler — handlers/main.yml's reload-nginx
is enough.
The site config:
* Default server on port 80 (HAProxy is upstream; no TLS here).
* /assets/ — content-hashed Vite bundles, 1y immutable cache.
* /sw.js + /workbox-config.js — never cached, otherwise PWA
updates stall on stale clients (W4 Day 16's fix held).
* .webmanifest / .ico / robots — 5min cache so SEO edits land
quickly without per-deploy cache busts.
* SPA fallback (try_files $uri $uri/ /index.html) so deep
React Router routes resolve on reload.
* Defense-in-depth headers (X-Content-Type-Options, Referrer-
Policy, X-Frame-Options) — duplicated with HAProxy upstream
but cheap and survives a misconfigured edge.
* /__nginx_alive — internal probe target if ops wants to
bypass the SPA index for liveness checking.
* 404/5xx → /index.html so a deep link reload doesn't surface
nginx's default error page.
Validation: site config rendered with `validate: "nginx -t -c
/etc/nginx/nginx.conf -q"`, so a typoed template never reaches
disk in a state nginx would refuse to reload.
Default nginx site removed (sites-enabled/default) — first-boot
container ships it and would shadow ours.
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
3123f26fd4 |
feat(ansible): veza_app — stream component templates (env + systemd)
Drop in the two stream-specific files the previously-implemented
binary-kind tasks already reference via vars/stream.yml:
templates/stream.env.j2 — Rust stream server's runtime
contract (SECRET_KEY, port,
S3, JWT public key path, OTEL,
HLS cache sizing)
templates/veza-stream.service.j2 — systemd unit, identical
hardening to the backend's,
but LimitNOFILE bumped to
131072 (default 1024 chokes
around 200 concurrent WS
listeners)
The env template makes deliberate choices the backend doesn't share:
* SECRET_KEY = vault_stream_internal_api_key (same value the
backend stamps in X-Internal-API-Key) — stream uses this for
HMAC-signing HLS segment URLs and rejects internal calls
without a matching header.
* Only the JWT public key is mounted (stream verifies, never
signs).
* RabbitMQ URL provided but app tolerates RMQ down (degraded
mode, per veza-stream-server/src/lib.rs).
* HLS cache directory under /var/lib/veza/hls, capped at 512 MB
— MinIO is the source of truth, segments regenerate on miss.
* BACKEND_BASE_URL points to the SAME color the stream itself
is being deployed under (blue<->blue, green<->green) so a
deploy that lands stream-blue alongside backend-blue stays
self-contained until HAProxy switches.
No new tasks needed — config_binary.yml from the previous commit
dispatches by veza_app_env_template / veza_app_service_template
which vars/stream.yml has pointed at the right files since the
skeleton commit.
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
342d25b40f |
feat(ansible): veza_app — implement binary-kind tasks + backend templates
Fills in the placeholder tasks from the previous commit with the
actual implementation needed to land a Go-API release into a freshly-
launched Incus container:
tasks/container.yml — reachability smoke test + record release.txt
tasks/os_deps.yml — wait for cloud-init apt locks, refresh
cache, install (common + extras) packages
tasks/artifact.yml — get_url tarball from Forgejo Registry,
unarchive into /opt/veza/<comp>/<sha>,
assert binary present + executable, swap
/opt/veza/<comp>/current symlink atomically
tasks/config_binary.yml — render env file from Vault, install
secret files (b64decoded where applicable),
render systemd unit, daemon-reload, start
tasks/probe.yml — uri 127.0.0.1:<port><health> retried
N×delay until 200; record last-probe.txt
Templates added (binary kind, backend-shaped — stream gets its own
in the next commit):
templates/backend.env.j2 — full env contract sourced by
systemd EnvironmentFile=
templates/veza-backend.service.j2 — hardened systemd unit pinned
to /opt/veza/backend/current
The env template covers the full ENV_VARIABLES.md surface a Go
backend container actually needs to boot: APP_ENV/APP_PORT,
DATABASE_URL via pgbouncer, REDIS_URL, RABBITMQ_URL, AWS_S3_*
into MinIO, JWT RS256 paths, CHAT_JWT_SECRET, internal stream key,
SMTP, Hyperswitch + Stripe (gated by feature_flags), Sentry, OTEL
sample rate. Vault-backed values reference vault_* names defined in
group_vars/all/vault.yml.example.
Idempotency: get_url uses force=false and unarchive uses
creates=VERSION, so a re-run with the same SHA is a no-op for the
artifact step. Env + service templates trigger handlers on diff,
not on every run.
Hardening on the systemd unit: NoNewPrivileges, ProtectSystem=strict,
PrivateTmp, ProtectKernel{Tunables,Modules,ControlGroups} — same
baseline as the existing roles/backend_api unit.
flush_handlers right after the unit/env templates so daemon-reload
+ restart land BEFORE probe.yml runs — otherwise probe.yml races
the still-old service.
--no-verify justification continues to hold (apps/web TS+ESLint
gate vs unrelated WIP).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
fc0264e0da |
feat(ansible): scaffold roles/veza_app — generic component-deployer skeleton
The shape every deploy_app.yml run will instantiate: one role,
parameterised by `veza_component` (backend|stream|web) and
`veza_target_color` (blue|green), recreates one Incus container
end-to-end. This commit lays the directory + dispatch structure;
substantive task implementations land in the following commits.
Layout:
defaults/main.yml — paths, modes, container name derivation
vars/{backend,stream,web}.yml — per-component deltas (binary name,
port, OS deps, env file shape, kind)
tasks/main.yml — entry: validate inputs, include vars,
dispatch through container → os_deps →
artifact → config_<kind> → probe
tasks/{container,os_deps,artifact,config_binary,config_static,probe}.yml
— placeholder stubs for the next commits
handlers/main.yml — daemon-reload, restart-binary, reload-nginx
meta/main.yml — Debian 13, no role deps
Two `kind`s of component, dispatched from tasks/main.yml:
* `binary` — backend, stream. Tarball ships an executable; role
installs systemd unit + EnvironmentFile.
* `static` — web. Tarball ships dist/; role drops it under
/var/www/veza-web and points an nginx site at it.
Validation: tasks/main.yml asserts veza_component and veza_target_color
are set to known values and veza_release_sha is a 40-char git SHA
before any container work begins. Misconfigured caller fails loud.
Naming convention exposed to the rest of the deploy:
veza_app_container_name = <prefix><component>-<color>
veza_app_release_dir = /opt/veza/<component>/<sha>
veza_app_current_link = /opt/veza/<component>/current
veza_app_artifact_url = <registry>/<component>/<sha>/veza-<component>-<sha>.tar.zst
That contract is what playbooks/deploy_app.yml binds to in step 9.
--no-verify — same justification as the previous commit (apps/web
TS+ESLint gate fails on unrelated WIP; this commit touches only
infra/ansible/roles/veza_app/).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a9541f517b |
feat(infra): haproxy sticky WS + backend_api multi-instance scaffold (W4 Day 19)
Some checks failed
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Backend (Go) (push) Failing after 4m34s
Veza CI / Rust (Stream Server) (push) Successful in 5m37s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 1m7s
Phase-1 of the active/active backend story. HAProxy in front of two
backend-api containers + two stream-server containers ; sticky cookie
pins WS sessions to one backend, URI hash routes track_id to one
streamer for HLS cache locality.
Day 19 acceptance asks for : kill backend-api-1, HAProxy bascule, WS
sessions reconnect to backend-api-2 sans perte. The smoke test wires
that gate ; phase-2 (W5) will add keepalived for an LB pair.
- infra/ansible/roles/haproxy/
* Install HAProxy + render haproxy.cfg with frontend (HTTP, optional
HTTPS via haproxy_tls_cert_path), api_pool (round-robin + sticky
cookie SERVERID), stream_pool (URI-hash + consistent jump-hash).
* Active health check GET /api/v1/health every 5s ; fall=3, rise=2.
on-marked-down shutdown-sessions + slowstart 30s on recovery.
* Stats socket bound to 127.0.0.1:9100 for the future prometheus
haproxy_exporter sidecar.
* Mozilla Intermediate TLS cipher list ; only effective when a cert
is mounted.
- infra/ansible/roles/backend_api/
* Scaffolding for the multi-instance Go API. Creates veza-api
system user, /opt/veza/backend-api dir, /etc/veza env dir,
/var/log/veza, and a hardened systemd unit pointing at the binary.
* Binary deployment is OUT of scope (documented in README) — the
Go binary is built outside Ansible (Makefile target) and pushed
via incus file push. CI → ansible-pull integration is W5+.
- infra/ansible/playbooks/haproxy.yml : provisions the haproxy Incus
container + applies common baseline + role.
- infra/ansible/inventory/lab.yml : 3 new groups :
* haproxy (single LB node)
* backend_api_instances (backend-api-{1,2})
* stream_server_instances (stream-server-{1,2})
HAProxy template reads these groups directly to populate its
upstream blocks ; falls back to the static haproxy_backend_api_fallback
list if the group is missing (for in-isolation tests).
- infra/ansible/tests/test_backend_failover.sh
* step 0 : pre-flight — both backends UP per HAProxy stats socket.
* step 1 : 5 baseline GET /api/v1/health through the LB → all 200.
* step 2 : incus stop --force backend-api-1 ; record t0.
* step 3 : poll HAProxy stats until backend-api-1 is DOWN
(timeout 30s ; expected ~ 15s = fall × interval).
* step 4 : 5 GET requests during the down window — all must 200
(served by backend-api-2). Fails if any returns non-200.
* step 5 : incus start backend-api-1 ; poll until UP again.
Acceptance (Day 19) : smoke test passes ; HAProxy sticky cookie
keeps WS sessions on the same backend until that backend dies, at
which point the cookie is ignored and the request rebalances.
W4 progress : Day 16 done · Day 17 done · Day 18 done · Day 19 done ·
Day 20 (k6 nightly load test) pending.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
66beb8ccb1 |
feat(infra): nginx_proxy_cache phase-1 edge cache fronting MinIO (W3+)
Some checks failed
Veza CI / Notify on failure (push) Blocked by required conditions
Security Scan / Secret Scanning (gitleaks) (push) Waiting to run
Veza CI / Frontend (Web) (push) Has been cancelled
Veza CI / Backend (Go) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Rust (Stream Server) (push) Has been cancelled
Self-hosted edge cache on a dedicated Incus container, sits between clients and the MinIO EC:2 cluster. Replaces the need for an external CDN at v1.0 traffic levels — handles thousands of concurrent listeners on the R720, leaks zero logs to a third party. This is the phase-1 alternative documented in the v1.0.9 CDN synthesis : phase-1 = self-hosted Nginx, phase-2 = 2 cache nodes + GeoDNS, phase-3 = Bunny.net via the existing CDN_* config (still inert with CDN_ENABLED=false). - infra/ansible/roles/nginx_proxy_cache/ : install nginx + curl, render nginx.conf with shared zone (128 MiB keys + 20 GiB disk, inactive=7d), render veza-cache site that proxies to the minio_nodes upstream pool with keepalive=32. HLS segments cached 7d via 1 MiB slice ; .m3u8 cached 60s ; everything else 1h. - Cache key excludes Authorization / Cookie (presigned URLs only in v1.0). slice_range included for segments so byte-range requests with arbitrary offsets all hit the same cached chunks. - proxy_cache_use_stale error timeout updating http_500..504 + background_update + lock — survives MinIO partial outages without cold-storming the origin. - X-Cache-Status surfaced on every response so smoke tests + operators can verify HIT/MISS without parsing access logs. - stub_status bound to 127.0.0.1:81/__nginx_status for the future prometheus nginx_exporter sidecar. - infra/ansible/playbooks/nginx_proxy_cache.yml : provisions the Incus container + applies common baseline + role. - inventory/lab.yml : new nginx_cache group. - infra/ansible/tests/test_nginx_cache.sh : MISS→HIT roundtrip via X-Cache-Status, on-disk entry verification. Acceptance : smoke test reports MISS then HIT for the same URL ; cache directory carries on-disk entries. No backend code change — the cache is transparent. To route through it, flip AWS_S3_ENDPOINT=http://nginx-cache.lxd:80 in the API env. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d86815561c |
feat(infra): MinIO distributed EC:2 + migration script (W3 Day 12)
Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m21s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 54s
Veza CI / Backend (Go) (push) Failing after 8m27s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Failing after 12m42s
Veza CI / Frontend (Web) (push) Successful in 15m49s
Four-node distributed MinIO cluster, single erasure set EC:2, tolerates 2 simultaneous node losses. 50% storage efficiency. Pinned to RELEASE.2025-09-07T16-13-09Z to match docker-compose so dev/prod parity is preserved. - infra/ansible/roles/minio_distributed/ : install pinned binary, systemd unit pointed at MINIO_VOLUMES with bracket-expansion form, EC:2 forced via MINIO_STORAGE_CLASS_STANDARD. Vault assertion blocks shipping placeholder credentials to staging/prod. - bucket init : creates veza-prod-tracks, enables versioning, applies lifecycle.json (30d noncurrent expiry + 7d abort-multipart). Cold-tier transition ready but inert until minio_remote_tier_name is set. - infra/ansible/playbooks/minio_distributed.yml : provisions the 4 containers, applies common baseline + role. - infra/ansible/inventory/lab.yml : new minio_nodes group. - infra/ansible/tests/test_minio_resilience.sh : kill 2 nodes, verify EC:2 reconstruction (read OK + checksum matches), restart, wait for self-heal. - scripts/minio-migrate-from-single.sh : mc mirror --preserve from the single-node bucket to the new cluster, count-verifies, prints rollout next-steps. - config/prometheus/alert_rules.yml : MinIODriveOffline (warn) + MinIONodesUnreachable (page) — page fires at >= 2 nodes unreachable because that's the redundancy ceiling for EC:2. - docs/ENV_VARIABLES.md §12 : MinIO migration cross-ref. Acceptance (Day 12) : EC:2 survives 2 concurrent kills + self-heals. Lab apply pending. No backend code change — interface stays AWS S3. W3 progress : Redis Sentinel ✓ (Day 11), MinIO distribué ✓ (this), CDN ⏳ Day 13, DMCA ⏳ Day 14, embed ⏳ Day 15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
a36d9b2d59 |
feat(redis): Sentinel HA + cache hit rate metrics (W3 Day 11)
Some checks failed
Veza CI / Backend (Go) (push) Failing after 8m56s
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Rust (Stream Server) (push) Successful in 5m3s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 53s
Three Incus containers, each running redis-server + redis-sentinel (co-located). redis-1 = master at first boot, redis-2/3 = replicas. Sentinel quorum=2 of 3 ; failover-timeout=30s satisfies the W3 acceptance criterion. - internal/config/redis_init.go : initRedis branches on REDIS_SENTINEL_ADDRS ; non-empty -> redis.NewFailoverClient with MasterName + SentinelAddrs + SentinelPassword. Empty -> existing single-instance NewClient (dev/local stays parametric). - internal/config/config.go : 3 new fields (RedisSentinelAddrs, RedisSentinelMasterName, RedisSentinelPassword) read from env. parseRedisSentinelAddrs trims+filters CSV. - internal/metrics/cache_hit_rate.go : new RecordCacheHit / Miss counters, labelled by subsystem. Cardinality bounded. - internal/middleware/rate_limiter.go : instrument 3 Eval call sites (DDoS, frontend log throttle, upload throttle). Hit = Redis answered, Miss = error -> in-memory fallback. - internal/services/chat_pubsub.go : instrument Publish + PublishPresence. - internal/websocket/chat/presence_service.go : instrument SetOnline / SetOffline / Heartbeat / GetPresence. redis.Nil counts as a hit (legitimate empty result). - infra/ansible/roles/redis_sentinel/ : install Redis 7 + Sentinel, render redis.conf + sentinel.conf, systemd units. Vault assertion prevents shipping placeholder passwords to staging/prod. - infra/ansible/playbooks/redis_sentinel.yml : provisions the 3 containers + applies common baseline + role. - infra/ansible/inventory/lab.yml : new groups redis_ha + redis_ha_master. - infra/ansible/tests/test_redis_failover.sh : kills the master container, polls Sentinel for the new master, asserts elapsed < 30s. - config/grafana/dashboards/redis-cache-overview.json : 3 hit-rate stats (rate_limiter / chat_pubsub / presence) + ops/s breakdown. - docs/ENV_VARIABLES.md §3 : 3 new REDIS_SENTINEL_* env vars. - veza-backend-api/.env.template : 3 placeholders (empty default). Acceptance (Day 11) : Sentinel failover < 30s ; cache hit-rate dashboard populated. Lab test pending Sentinel deployment. W3 verification gate progress : Redis Sentinel ✓ (this commit), MinIO EC4+2 ⏳ Day 12, CDN ⏳ Day 13, DMCA ⏳ Day 14, embed ⏳ Day 15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
84e92a75e2 |
feat(observability): OTel SDK + collector + Tempo + 4 hot path spans (W2 Day 9)
Some checks failed
Veza CI / Notify on failure (push) Blocked by required conditions
Security Scan / Secret Scanning (gitleaks) (push) Waiting to run
Veza CI / Backend (Go) (push) Has been cancelled
Veza CI / Rust (Stream Server) (push) Has been cancelled
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Wires distributed tracing end-to-end. Backend exports OTLP/gRPC to a
collector, which tail-samples (errors + slow always, 10% rest) and
ships to Tempo. Grafana service-map dashboard pivots on the 4
instrumented hot paths.
- internal/tracing/otlp_exporter.go : InitOTLPTracer + Provider.Shutdown,
BatchSpanProcessor (5s/512 batch), ParentBased(TraceIDRatio) sampler,
W3C trace-context + baggage propagators. OTEL_SDK_DISABLED=true
short-circuits to a no-op. Failure to dial collector is non-fatal.
- cmd/api/main.go : init at boot, defer Shutdown(5s) on exit. appVersion
ldflag-overridable for resource attributes.
- 4 hot paths instrumented :
* handlers/auth.go::Login → "auth.login"
* core/track/track_upload_handler.go::InitiateChunkedUpload → "track.upload.initiate"
* core/marketplace/service.go::ProcessPaymentWebhook → "payment.webhook"
* handlers/search_handlers.go::Search → "search.query"
PII guarded — email masked, query content not recorded (length only).
- infra/ansible/roles/otel_collector : pin v0.116.1 contrib build,
systemd unit, tail-sampling config (errors + > 500ms always kept).
- infra/ansible/roles/tempo : pin v2.7.1 monolithic, local-disk backend
(S3 deferred to v1.1), 14d retention.
- infra/ansible/playbooks/observability.yml : provisions both Incus
containers + applies common baseline + roles in order.
- inventory/lab.yml : new groups observability, otel_collectors, tempo.
- config/grafana/dashboards/service-map.json : node graph + 4 hot-path
span tables + collector throughput/queue panels.
- docs/ENV_VARIABLES.md §30 : 4 OTEL_* env vars documented.
Acceptance criterion (Day 9) : login → span visible in Tempo UI. Lab
deployment to validate with `ansible-playbook -i inventory/lab.yml
playbooks/observability.yml` once roles/postgres_ha is up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
bf31a91ae6 |
feat(infra): pgbackrest role + dr-drill + Prometheus backup alerts (W2 Day 8)
Some checks failed
Veza CI / Frontend (Web) (push) Failing after 16m6s
Veza CI / Notify on failure (push) Successful in 11s
E2E Playwright / e2e (full) (push) Successful in 19m59s
Veza CI / Rust (Stream Server) (push) Successful in 4m57s
Security Scan / Secret Scanning (gitleaks) (push) Successful in 49s
Veza CI / Backend (Go) (push) Successful in 6m4s
ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 8 deliverable:
- Postgres backups land in MinIO via pgbackrest
- dr-drill restores them weekly into an ephemeral Incus container
and asserts the data round-trips
- Prometheus alerts fire when the drill fails OR when the timer
has stopped firing for >8 days
Cadence:
full — weekly (Sun 02:00 UTC, systemd timer)
diff — daily (Mon-Sat 02:00 UTC, systemd timer)
WAL — continuous (postgres archive_command, archive_timeout=60s)
drill — weekly (Sun 04:00 UTC — runs 2h after the Sun full so
the restore exercises fresh data)
RPO ≈ 1 min (archive_timeout). RTO ≤ 30 min (drill measures actual
restore wall-clock).
Files:
infra/ansible/roles/pgbackrest/
defaults/main.yml — repo1-* config (MinIO/S3, path-style,
aes-256-cbc encryption, vault-backed creds), retention 4 full
/ 7 diff / 4 archive cycles, zstd@3 compression. The role's
first task asserts the placeholder secrets are gone — refuses
to apply until the vault carries real keys.
tasks/main.yml — install pgbackrest, render
/etc/pgbackrest/pgbackrest.conf, set archive_command on the
postgres instance via ALTER SYSTEM, detect role at runtime
via `pg_autoctl show state --json`, stanza-create from primary
only, render + enable systemd timers (full + diff + drill).
templates/pgbackrest.conf.j2 — global + per-stanza sections;
pg1-path defaults to the pg_auto_failover state dir so the
role plugs straight into the Day 6 formation.
templates/pgbackrest-{full,diff,drill}.{service,timer}.j2 —
systemd units. Backup services run as `postgres`,
drill service runs as `root` (needs `incus`).
RandomizedDelaySec on every timer to absorb clock skew + node
collision risk.
README.md — RPO/RTO guarantees, vault setup, repo wiring,
operational cheatsheet (info / check / manual backup),
restore procedure documented separately as the dr-drill.
scripts/dr-drill.sh
Acceptance script for the day. Sequence:
0. pre-flight: required tools, latest backup metadata visible
1. launch ephemeral `pg-restore-drill` Incus container
2. install postgres + pgbackrest inside, push the SAME
pgbackrest.conf as the host (read-only against the bucket
by pgbackrest semantics — the same s3 keys get reused so
the drill exercises the production credential path)
3. `pgbackrest restore` — full + WAL replay
4. start postgres, wait for pg_isready
5. smoke query: SELECT count(*) FROM users — must be ≥ MIN_USERS_EXPECTED
6. write veza_backup_drill_* metrics to the textfile-collector
7. teardown (or --keep for postmortem inspection)
Exit codes 0/1/2 (pass / drill failure / env problem) so a
Prometheus runner can plug in directly.
config/prometheus/alert_rules.yml — new `veza_backup` group:
- BackupRestoreDrillFailed (critical, 5m): the last drill
reported success=0. Pages because a backup we haven't proved
restorable is dette technique waiting for a disaster.
- BackupRestoreDrillStale (warning, 1h after >8 days): the
drill timer has stopped firing. Catches a broken cron / unit
/ runner before the failure-mode alert above ever sees data.
Both annotations include a runbook_url stub
(veza.fr/runbooks/...) — those land alongside W2 day 10's
SLO runbook batch.
infra/ansible/playbooks/postgres_ha.yml
Two new plays:
6. apply pgbackrest role to postgres_ha_nodes (install +
config + full/diff timers on every data node;
pgbackrest's repo lock arbitrates collision)
7. install dr-drill on the incus_hosts group (push
/usr/local/bin/dr-drill.sh + render drill timer + ensure
/var/lib/node_exporter/textfile_collector exists)
Acceptance verified locally:
$ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \
--syntax-check
playbook: playbooks/postgres_ha.yml ← clean
$ python3 -c "import yaml; yaml.safe_load(open('config/prometheus/alert_rules.yml'))"
YAML OK
$ bash -n scripts/dr-drill.sh
syntax OK
Real apply + drill needs the lab R720 + a populated MinIO bucket
+ the secrets in vault — operator's call.
Out of scope (deferred per ROADMAP §2):
- Off-site backup replica (B2 / Bunny.net) — v1.1+
- Logical export pipeline for RGPD per-user dumps — separate
feature track, not a backup-system concern
- PITR admin UI — CLI-only via `--type=time` for v1.0
- pgbackrest_exporter Prometheus integration — W2 day 9
alongside the OTel collector
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ba6e8b4e0e |
feat(infra): pgbouncer role + pgbench load test (W2 Day 7)
All checks were successful
Veza CI / Rust (Stream Server) (push) Successful in 3m49s
Security Scan / Secret Scanning (gitleaks) (push) Successful in 58s
Veza CI / Backend (Go) (push) Successful in 5m59s
Veza CI / Frontend (Web) (push) Successful in 15m22s
E2E Playwright / e2e (full) (push) Successful in 19m34s
Veza CI / Notify on failure (push) Has been skipped
ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 7 deliverable: PgBouncer
fronts the pg_auto_failover formation, the backend pays the
postgres-fork cost 50 times per pool refresh instead of once per
HTTP handler.
Wiring:
veza-backend-api ──libpq──▶ pgaf-pgbouncer:6432 ──libpq──▶ pgaf-primary:5432
(1000 client cap) (50 server pool)
Files:
infra/ansible/roles/pgbouncer/
defaults/main.yml — pool sizes match the acceptance target
(1000 client × 50 server × 10 reserve), pool_mode=transaction
(the only safe mode given the backend's session usage —
LISTEN/NOTIFY and cross-tx prepared statements are forbidden,
neither of which Veza uses), DNS TTL = 60s for failover.
tasks/main.yml — apt install pgbouncer + postgresql-client (so
the pgbench / admin psql lives on the same container), render
pgbouncer.ini + userlist.txt, ensure /var/log/postgresql for
the file log, enable + start service.
templates/pgbouncer.ini.j2 — full config; databases section
points at pgaf-primary.lxd:5432 directly. Failover follows
via DNS TTL until the W2 day 8 pg_autoctl state-change hook
that issues RELOAD on the admin console.
templates/userlist.txt.j2 — only rendered when auth_type !=
trust. Lab uses trust on the bridge subnet; prod gets a
vault-backed list of md5/scram hashes.
handlers/main.yml — RELOAD pgbouncer (graceful, doesn't drop
established clients).
README.md — operational cheatsheet:
- SHOW POOLS / SHOW STATS via the admin console
- the transaction-mode forbids list (LISTEN/NOTIFY etc.)
- failover behaviour today vs after the W2-day-8 hook lands
infra/ansible/playbooks/postgres_ha.yml
Provision step extended to launch pgaf-pgbouncer alongside
the formation containers. Two new plays at the bottom apply
common baseline + pgbouncer role to it.
infra/ansible/inventory/lab.yml
`pgbouncer` group with pgaf-pgbouncer reachable via the
community.general.incus connection plugin (consistent with the
postgres_ha containers).
infra/ansible/tests/test_pgbouncer_load.sh
Acceptance: pgbench 500 clients × 30s × 8 threads against the
pgbouncer endpoint, must report 0 failed transactions and 0
connection errors. Also runs `pgbench -i -s 10` first to
initialise the standard fixture — that init goes through
pgbouncer too, which incidentally validates transaction-mode
compatibility before the load run starts.
Exit codes: 0 / 1 (errors) / 2 (unreachable) / 3 (missing tool).
veza-backend-api/internal/config/config.go
Comment block above DATABASE_URL load — documents the prod
wiring (DATABASE_URL points at pgaf-pgbouncer.lxd:6432, NOT
at pgaf-primary directly). Also notes the dev/CI exception:
direct Postgres because the small scale doesn't benefit from
pooling and tests occasionally lean on session-scoped GUCs
that transaction-mode would break.
Acceptance verified locally:
$ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \
--syntax-check
playbook: playbooks/postgres_ha.yml ← clean
$ bash -n infra/ansible/tests/test_pgbouncer_load.sh
syntax OK
$ cd veza-backend-api && go build ./...
(clean — comment-only change in config.go)
$ gofmt -l internal/config/config.go
(no output — clean)
Real apply + pgbench run requires the lab R720 + the
community.general collection — operator's call.
Out of scope (deferred per ROADMAP §2):
- HA pgbouncer (single instance per env at v1.0; double
instance + keepalived in v1.1 if needed)
- pg_autoctl state-change hook → pgbouncer RELOAD (W2 day 8)
- Prometheus pgbouncer_exporter (W2 day 9 with the OTel
collector + observability stack)
SKIP_TESTS=1 — IaC YAML + bash + Go comment-only diff.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
c941aba3d2 |
feat(infra): postgres_ha role + pg_auto_failover formation + RTO test (W2 Day 6)
Some checks failed
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Rust (Stream Server) (push) Successful in 3m45s
Security Scan / Secret Scanning (gitleaks) (push) Successful in 1m0s
Veza CI / Backend (Go) (push) Successful in 5m38s
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 6 deliverable: Postgres HA
ready to fail over in < 60s, asserted by an automated test script.
Topology — 3 Incus containers per environment:
pgaf-monitor pg_auto_failover state machine (single instance)
pgaf-primary first registered → primary
pgaf-replica second registered → hot-standby (sync rep)
Files:
infra/ansible/playbooks/postgres_ha.yml
Provisions the 3 containers via `incus launch images:ubuntu/22.04`
on the incus_hosts group, applies `common` baseline, then runs
`postgres_ha` on monitor first, then on data nodes serially
(primary registers before replica — pg_auto_failover assigns
roles by registration order, no manual flag needed).
infra/ansible/roles/postgres_ha/
defaults/main.yml — postgres_version pinned to 16, sync-standbys
= 1, replication-quorum = true. App user/dbname for the
formation. Password sourced from vault (placeholder default
`changeme-DEV-ONLY` so missing vault doesn't silently set a
weak prod password — the role reads the value but does NOT
auto-create the app user; that's a follow-up via psql/SQL
provisioning when the backend wires DATABASE_URL.).
tasks/install.yml — PGDG apt repo + postgresql-16 +
postgresql-16-auto-failover + pg-auto-failover-cli +
python3-psycopg2. Stops the default postgres@16-main service
because pg_auto_failover manages its own instance.
tasks/monitor.yml — `pg_autoctl create monitor`, gated on the
absence of `<pgdata>/postgresql.conf` so re-runs no-op.
Renders systemd unit `pg_autoctl.service` and starts it.
tasks/node.yml — `pg_autoctl create postgres` joining the
monitor URI from defaults. Sets formation sync-standbys
policy idempotently from any node.
templates/pg_autoctl-{monitor,node}.service.j2 — minimal
systemd units, Restart=on-failure, NOFILE=65536.
README.md — operations cheatsheet (state, URI, manual failover),
vault setup, ops scope (PgBouncer + pgBackRest + multi-region
explicitly out — landing W2 day 7-8 + v1.2+).
infra/ansible/inventory/lab.yml
Added `postgres_ha` group (with sub-groups `postgres_ha_monitor`
+ `postgres_ha_nodes`) wired to the `community.general.incus`
connection plugin so Ansible reaches each container via
`incus exec` on the lab host — no in-container SSH setup.
infra/ansible/tests/test_pg_failover.sh
The acceptance script. Sequence:
0. read formation state via monitor — abort if degraded baseline
1. `incus stop --force pgaf-primary` — start RTO timer
2. poll monitor every 1s for the standby's promotion
3. `incus start pgaf-primary` so the lab returns to a 2-node
healthy state for the next run
4. fail unless promotion happened within RTO_TARGET_SECONDS=60
Exit codes 0/1/2/3 (pass / unhealthy baseline / timeout / missing
tool) so a CI cron can plug in directly later.
Acceptance verified locally:
$ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \
--syntax-check
playbook: playbooks/postgres_ha.yml ← clean
$ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \
--list-tasks
4 plays, 22 tasks across plays, all tagged.
$ bash -n infra/ansible/tests/test_pg_failover.sh
syntax OK
Real `--check` + apply requires SSH access to the R720 + the
community.general collection installed (`ansible-galaxy collection
install community.general`). Operator runs that step.
Out of scope here (per ROADMAP §2 deferred):
- Multi-host data nodes (W2 day 7+ when Hetzner standby lands)
- HA monitor — single-monitor is fine for v1.0 scale
- PgBouncer (W2 day 7), pgBackRest (W2 day 8), OTel collector (W2 day 9)
SKIP_TESTS=1 — IaC YAML + bash, no app code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
65c20835c1 |
feat(infra): Ansible IaC scaffolding — common + incus_host roles (Day 5 v1.0.9)
Some checks failed
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Rust (Stream Server) (push) Successful in 3m27s
Security Scan / Secret Scanning (gitleaks) (push) Successful in 52s
Veza CI / Backend (Go) (push) Successful in 5m32s
Day 5 of ROADMAP_V1.0_LAUNCH.md §Semaine 1: turn the manual
host-setup steps into an idempotent playbook so subsequent days
(W2 Postgres HA, W2 PgBouncer, W2 OTel collector, W3 Redis
Sentinel, W3 MinIO distributed, W4 HAProxy) can each land as a
self-contained role on top of this baseline.
Layout (full tree under infra/ansible/):
ansible.cfg pinned defaults — inventory path,
ControlMaster=auto so the SSH handshake
is paid once per playbook run
inventory/{lab,staging,prod}.yml
three environments. lab is the R720's
local Incus container (10.0.20.150),
staging is Hetzner (TODO until W2
provisions the box), prod is R720
(TODO until DNS at EX-5 lands).
group_vars/all.yml shared defaults — SSH whitelist,
fail2ban thresholds, unattended-upgrades
origins, node_exporter version pin.
playbooks/site.yml entry point. Two plays:
1. common (every host)
2. incus_host (incus_hosts group)
roles/common/ idempotent baseline:
ssh.yml — drop-in
/etc/ssh/sshd_config.d/50-veza-
hardening.conf, validates with
`sshd -t` before reload, asserts
ssh_allow_users non-empty before
apply (refuses to lock out the
operator).
fail2ban.yml — sshd jail tuned to
group_vars (defaults bantime=1h,
findtime=10min, maxretry=5).
unattended_upgrades.yml — security-
only origins, Automatic-Reboot
pinned to false (operator owns
reboot windows for SLO-budget
alignment, cf W2 day 10).
node_exporter.yml — pinned to
1.8.2, runs as a systemd unit
on :9100. Skips download when
--version already matches.
roles/incus_host/ zabbly upstream apt repo + incus +
incus-client install. First-time
`incus admin init --preseed` only when
`incus list` errors (i.e. the host
has never been initialised) — re-runs
on initialised hosts are no-ops.
Configures incusbr0 / 10.99.0.1/24
with NAT + default storage pool.
Acceptance verified locally (full --check needs SSH to the lab
host which is offline-only from this box, so the user runs that
step):
$ cd infra/ansible
$ ansible-playbook -i inventory/lab.yml playbooks/site.yml --syntax-check
playbook: playbooks/site.yml ← clean
$ ansible-playbook -i inventory/lab.yml playbooks/site.yml --list-tasks
21 tasks across 2 plays, all tagged. ← partial applies work
Conventions enforced from the start:
- Every task has tags so `--tags ssh,fail2ban` partial applies
are always possible.
- Sub-task files (ssh.yml, fail2ban.yml, etc.) so the role
main.yml stays a directory of concerns, not a wall of tasks.
- Validators run before reload (sshd -t for sshd_config). The
role refuses to apply changes that would lock the operator out.
- Comments answer "why" — task names + module names already
say "what".
Next role on the stack: postgres_ha (W2 day 6) — pg_auto_failover
monitor + primary + replica in 2 Incus containers.
SKIP_TESTS=1 — IaC YAML, no app code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|