feat(soft-launch): cohort tooling + email template + monitor + checklist

The soft-launch report doc (SOFT_LAUNCH_BETA_2026.md) had the narrative — cohort table, email body inline, monitoring list, acceptance gate. But the operational pieces were notes-to-self : "add migration if missing", "Typeform to-do", "schema TBD". The operator was supposed to assemble them on the day, which on a soft- launch day is the worst possible time. Added the missing 6 pieces so the day-of work is "tick boxes", not "build the tooling" : * migrations/990_beta_invites.sql — schema with code (16-char base32-ish), email, cohort label, used_at, expires_at + 30d default, sent_by FK with ON DELETE SET NULL. Three indexes : unique on code (signup-path lookup), cohort (post-launch attribution report), partial expires_at WHERE used_at IS NULL (cleanup cron). * scripts/soft-launch/validate-cohort.sh — sanity check on the operator's CSV : header form, malformed emails, duplicates, cohort distribution (≥50 total / ≥5 creators / ≥3 distinct labels), optional collision check against existing users. Exit codes 0 / 1 (block) / 2 (warn-but-proceed). Hard checks block, soft checks let the operator override with FORCE=1. * scripts/soft-launch/send-invitations.sh — split-phase : step 1 (default) inserts beta_invites rows + renders one .eml per recipient under scripts/soft-launch/out-<date>/ step 2 (SEND=1) dispatches via $SEND_CMD (msmtp by default) so the operator can review the rendered emls before sending 100 emails. Per-recipient transactional INSERT so a partial failure doesn't poison the table. Failed inserts logged with the offending email so the operator can rerun on the subset. * templates/email/beta_invite.eml.template — proper MIME multipart (text + HTML) eml ready for sendmail-compatible piping. French copy aligned with the éthique brand (no FOMO, no urgency manipulation, no "limited spots" framing). * scripts/soft-launch/monitor-checks.sh — polls the 6 acceptance- gate signals defined in SOFT_LAUNCH_BETA_2026.md §"Acceptance gate" : testers signed up, Sentry P1 events, status page, synthetic parcours, k6 nightly age, HIGH issues. Each gate independently emits ✅ / 🔴 / ⚪ (last for "couldn't check"). Verdict on stdout. LOOP=1 keeps polling every CHECK_INTERVAL seconds. Designed for cron + tmux, not for an interactive UI. * docs/SOFT_LAUNCH_BETA_2026_CHECKLIST.md — pre-flight gate that must reach 100% green before the first invitation goes out. T-72h section (database, cohort, email infra, redemption path, monitoring, comms), D-day section (last-hour, send, hour-1, every-4h), 18:00 UTC decision call section. Linked back to the bigger SOFT_LAUNCH_BETA_2026.md so the operator can navigate between the "what" (report) and the "how / has-everything- been-checked" (this checklist) without losing context. What still requires the operator on the day : - Build the cohort CSV (curate emails from real sources) - Create the Typeform feedback form ; paste its URL into the eml template once known - Configure msmtp / sendmail ($SEND_CMD) - Press the send button - Show up at 18:00 UTC for the decision call Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook
2026-04-30 22:38:12 +02:00 · 2026-04-30 22:32:05 +02:00 · 2026-04-30 22:29:35 +02:00 · 2026-04-30 22:27:14 +02:00 · 2026-04-30 16:31:29 +02:00 · 2026-04-30 16:27:37 +02:00
19 changed files with 1870 additions and 24 deletions
--- a/docs/PENTEST_SEND_PACKAGE.md
+++ b/docs/PENTEST_SEND_PACKAGE.md
@ -0,0 +1,187 @@
+# Pentest send package — v2026 engagement
+
+> Operational checklist for handing off the v1.0.9 pre-launch pentest
+> brief to the external team. Companion to `docs/PENTEST_SCOPE_2026.md`
+> (the technical scope) — this doc is purely "what you send, in what
+> order, via which channel."
+
+The scope doc is technical and reusable across engagements. This file
+is the per-engagement "send package" that wraps it: the email template,
+the credentials-delivery plan, the IP allow-list step, and the kick-off
+checklist.
+
+## The 5-step send sequence
+
+Run these in order. Each step has a check (✓) the operator ticks before
+moving to the next — out-of-order steps cause the engagement to stall.
+
+### Step 1 — counter-sign the NDA + authorisation letter
+
+- [ ] NDA template signed by the pentester firm and counter-signed by us.
+- [ ] Authorisation-to-test letter signed by Veza tech lead (limits the
+      scope to what's in `PENTEST_SCOPE_2026.md` §"In-scope assets" — the
+      letter MUST list the staging URL explicitly so a reviewer can map
+      pentester traffic to authorised activity).
+- [ ] Both PDFs uploaded to the shared 1Password vault (entry name :
+      `pentest-2026-legal`). Do **not** email PDFs.
+
+### Step 2 — provision pentester credentials
+
+- [ ] Run `bash scripts/pentest/seed-test-accounts.sh staging` (creates
+      the 3 accounts from `PENTEST_SCOPE_2026.md` §"Authentication
+      context", outputs random passwords).
+- [ ] Output passwords land in three 1Password entries :
+      `pentest-2026-listener`, `pentest-2026-creator`, `pentest-2026-admin`.
+      Each entry's "Notes" field includes the role and the MFA bypass
+      token if applicable.
+- [ ] Share each entry **read-only** with the pentester's 1Password
+      account using the firm's billing email. Do **not** put passwords
+      in chat, email, or shell history.
+- [ ] Set entry expiration to engagement-end + 7 days (so cleanup is
+      automatic if the team forgets to revoke).
+
+### Step 3 — allow-list the pentester's IP
+
+The Forgejo source-code mirror at `https://10.0.20.105:3000/senke/veza`
+is grey-box read-only access. The pentester needs their static
+egress IP allow-listed before they can `git clone`.
+
+- [ ] Pentester sends their static egress IP (PGP-signed mail, or
+      1Password Notes field).
+- [ ] SSH to `srv-102v` (Forgejo container) and add the IP to
+      `/etc/forgejo/allowlist.conf`.
+- [ ] `systemctl reload forgejo`.
+- [ ] Verify : `curl -I https://10.0.20.105:3000/senke/veza` from the
+      pentester IP returns 200 ; from any other IP, 403.
+
+(A future iteration could turn this into an Ansible playbook
+`infra/ansible/playbooks/pentest_allowlist_ip.yml`. For now the manual
+SSH path is fine — this happens once per engagement.)
+
+### Step 4 — send the kick-off email
+
+Use the template below. Replace the placeholders inside `<…>`. Send
+PGP-encrypted (the pentester's key is in their security.txt) to
+**both** their lead pentester and their project manager so the chain
+of responsibility is recorded.
+
+```text
+Subject : [PENTEST] Veza v1.0.9 pre-launch engagement — kick-off
+
+Hi <lead pentester first name>,
+
+Per the signed scope letter dated <YYYY-MM-DD>, the Veza v1.0.9
+pre-launch pentest engagement starts on <YYYY-MM-DD>. The brief is
+attached as PENTEST_SCOPE_2026.md (see also the rendered HTML at
+https://staging.veza.fr/legal/pentest-scope-2026.html).
+
+Quick links :
+
+  • Staging URL  : https://staging.veza.fr
+  • Source code  : https://10.0.20.105:3000/senke/veza
+                   (grey-box, read-only ; your egress IP <PENTESTER_IP>
+                    has been allow-listed as of <YYYY-MM-DD HH:MM UTC>.)
+  • Status page  : https://status.veza.fr (we'll lower the alert
+                    threshold during your engagement so the SOC isn't
+                    paged on every benign 401).
+  • Test accounts: shared with your firm's 1Password — entries
+                    pentest-2026-{listener,creator,admin}. Passwords
+                    expire <engagement_end + 7d>.
+
+Engagement window :
+
+  • Start  : <YYYY-MM-DD>
+  • End    : <YYYY-MM-DD>  (~10 business days)
+  • Re-test: 1 round, after our team's fix pass (typically 2 weeks
+              after the initial report)
+
+Communications :
+
+  • Async       : security@veza.fr (PGP fingerprint at
+                   https://veza.fr/.well-known/security.txt)
+  • Weekly sync : <weekday HH:MM TZ>, video link in the calendar invite
+  • Critical findings : phone the on-call number in the contract
+                         (HIGH severity = phone, not email)
+
+Expected deliverables :
+
+  • Initial findings report (markdown or PDF) at engagement end
+  • Re-test report after our fix pass
+  • Optional : exec-level summary slide deck
+
+Reach out if anything in PENTEST_SCOPE_2026.md is unclear before
+day 1. Otherwise — good hunting.
+
+Best,
+<Tech lead name>
+Veza
+```
+
+- [ ] Email PGP-signed and sent.
+- [ ] Calendar invite sent for the weekly sync.
+- [ ] Slack/Signal channel created for HIGH-severity escalation
+      (channel naming : `#pentest-2026-veza`).
+
+### Step 5 — lower the SOC alerting threshold
+
+During the engagement, automated scanners and authentication
+brute-force attempts WILL fire alerts. Tune them down so the on-call
+isn't paged on every legitimate pentester action.
+
+- [ ] In `config/prometheus/alert_rules.yml` → `HighErrorRate`,
+      `HighLatencyP99` : add a `for: 30m` override OR mute via
+      Alertmanager silence (recommended: silence rather than edit
+      rules so the change auto-expires at engagement end).
+- [ ] Silence URL : `https://prometheus.veza.fr/alertmanager/#/silences/new`
+      → matchers: `severity=warning`, comment: `pentest-2026 active`,
+      duration: `engagement_end + 24h`.
+- [ ] Subscribe the engagement Slack channel to the silence's
+      auto-removal so the SOC knows when the heightened alerting
+      resumes.
+
+## Reception checklist (after pentester confirms receipt)
+
+- [ ] Pentester replied to the kick-off email within 1 business day.
+- [ ] Pentester confirmed they can `git clone` the source repo.
+- [ ] Pentester confirmed they can log in as each of the 3 test
+      accounts.
+- [ ] Pentester confirmed the staging URL responds (`/api/v1/health`
+      returns 200).
+- [ ] First findings — even informational — start landing in the
+      shared report by end of engagement day 3 (a complete silence
+      until the final report is a process smell).
+
+If any reception checklist item fails after 24h, the engagement
+hasn't really started. Phone the firm's PM, don't email.
+
+## Post-engagement housekeeping
+
+- [ ] Findings report received → import into the issue tracker as
+      separate tickets, severity preserved, attribution
+      `external-pentest-2026`.
+- [ ] Fix pass scheduled and timeboxed (HIGH within 1 week, MEDIUM
+      within 4 weeks, LOW best-effort).
+- [ ] Re-test scheduled 2 weeks after fix-pass start.
+- [ ] Re-test report received → update the ticket statuses ; any
+      remaining unresolved finding above LOW blocks v2.0.0-public.
+- [ ] Test accounts' passwords manually rotated **the day the
+      engagement ends** (don't wait for 1Password's auto-expiry).
+- [ ] Pentester IP removed from Forgejo allow-list.
+- [ ] Alertmanager silence removed (should auto-remove, but verify).
+- [ ] Engagement folder zipped and stored at
+      `docs/archive/pentest-2026/` (kept 5 years for audit trail).
+- [ ] Public summary blog post drafted (no findings details, just the
+      "we did this, here's what we learned" framing). Reviewed by
+      legal before publish.
+
+## Linked artefacts
+
+- `docs/PENTEST_SCOPE_2026.md` — the technical scope (what's testable)
+- `docs/SECURITY_PRELAUNCH_AUDIT.md` — internal Day 21 audit (what we
+  already cleared)
+- `docs/archive/PENTEST_REPORT_VEZA_v0.12.6.md` — last engagement's
+  report, format reference for what to expect back
+- `scripts/pentest/seed-test-accounts.sh` — credential provisioning
+  helper (creates the 3 staging accounts referenced in the scope)
+- `docs/GO_NO_GO_CHECKLIST_v2.0.0_PUBLIC.md` — the row this engagement
+  unblocks
--- a/docs/SOFT_LAUNCH_BETA_2026_CHECKLIST.md
+++ b/docs/SOFT_LAUNCH_BETA_2026_CHECKLIST.md
@ -0,0 +1,150 @@
+# Soft-launch beta — pre-flight checklist
+
+> Operational checklist that must reach 100% green before the first
+> invitation goes out. Companion to `docs/SOFT_LAUNCH_BETA_2026.md`
+> (the bigger picture). This file is purely the "before you press
+> send, has every gate been verified?" view.
+
+The whole reason the soft-launch is "soft" is that it lets you catch
+infrastructure surprises with 50 testers instead of 50 000. To get
+that benefit, the infrastructure has to actually work BEFORE the
+invitations land. This checklist is the gate.
+
+## T-72h checklist (3 days before send)
+
+### Database
+
+- [ ] `migrations/990_beta_invites.sql` applied to staging.
+      Verify with :
+      ```bash
+      psql "$STAGING_DATABASE_URL" -c "SELECT count(*) FROM beta_invites;"
+      ```
+      Expected : `0` (table exists, empty).
+- [ ] Same migration applied to prod (whenever prod tag goes out).
+- [ ] Backup-freshness OK on both environments :
+      ```bash
+      pgbackrest --stanza=veza info | head -20
+      ```
+      Most recent full or diff < 24 h old.
+
+### Cohort CSV
+
+- [ ] CSV file built from the operator's chosen sources (mailing list +
+      contacts + community partners). Format per
+      `scripts/soft-launch/validate-cohort.sh` header.
+- [ ] `validate-cohort.sh` returns exit 0 (or exit 2 with explicit
+      operator acknowledgement of the warnings).
+- [ ] Distribution sanity : `≥ 5` creators, `≥ 20` listeners, `≥ 3`
+      distinct cohort labels, `≥ 50` total rows.
+
+### Email infrastructure
+
+- [ ] SMTP credentials live in the operator's machine `~/.msmtprc`
+      (or whatever `SEND_CMD` resolves to).
+- [ ] `templates/email/beta_invite.eml.template` reviewed — wording,
+      cohort variable, code variable.
+- [ ] Test send to operator's own email :
+      ```bash
+      echo "ops@veza.fr,test-cohort,ops@veza.fr" > /tmp/me.csv
+      DATABASE_URL=$STAGING_DATABASE_URL FRONTEND_URL=https://staging.veza.fr \
+        SEND=1 bash scripts/soft-launch/send-invitations.sh /tmp/me.csv
+      ```
+      Verify the eml renders correctly in your mail client (links
+      clickable, fonts loaded, no `{{TO_ADDR}}` literals leaking).
+
+### Backend invite-redemption path
+
+- [ ] Visit `https://staging.veza.fr/signup?invite=<test-code>`.
+      Expected : signup form pre-fills the code, refuses to submit
+      without it, marks the invite as `used_at = NOW()` after success.
+- [ ] Try an invalid code → form rejects with a clear error message.
+- [ ] Try the same code twice → second attempt rejects (one-time use).
+- [ ] Try an expired code → form rejects with "expired".
+
+### Acceptance-gate monitoring
+
+- [ ] Run `monitor-checks.sh` once on staging — every gate either ✅
+      or ⚪ (unknown), no 🔴.
+      ```bash
+      DATABASE_URL=$STAGING_DATABASE_URL \
+      SENTRY_AUTH_TOKEN=... \
+      PROM_URL=https://prom.veza.fr \
+      bash scripts/soft-launch/monitor-checks.sh
+      ```
+- [ ] Schedule the cron run (or tmux session) so the gate state is
+      visible during the bêta window without manual re-run.
+
+### Communications
+
+- [ ] Discord `#beta-feedback` channel created, ground rules pinned.
+- [ ] Typeform feedback form created ; URL pasted into
+      `templates/email/beta_invite.eml.template` if not already in the
+      cohort label.
+- [ ] Status page maintenance window declared for the duration —
+      "elevated alerting may occur during beta period."
+- [ ] Operators on duty for the day rota'd in the calendar (every 4 h
+      shift, primary + backup).
+
+## D-day checklist (the day of send)
+
+### Last hour before send
+
+- [ ] Most recent k6 nightly green (within 30 h).
+- [ ] No pending high-severity Sentry issue.
+- [ ] No PagerDuty incident open.
+- [ ] HAProxy + backend healthchecks green :
+      ```bash
+      curl -s https://staging.veza.fr/api/v1/health | jq .status
+      ```
+- [ ] MinIO drives all online ; pgBackRest drill ran successfully in
+      the last 7 days.
+
+### Send
+
+- [ ] `validate-cohort.sh` exit code 0 (or 2 with explicit override).
+- [ ] `send-invitations.sh` in DRY-RUN mode : eml output dir reviewed.
+- [ ] `send-invitations.sh` with `SEND=1` : dispatch.log reviewed
+      after run, `0` failed dispatches.
+- [ ] First three invitees received the email within 5 min (manual
+      check on three different domains : gmail / proton / one custom).
+
+### Hour 1 post-send
+
+- [ ] First sign-up landed (`SELECT count(*) FROM beta_invites WHERE
+      used_at IS NOT NULL;` returns ≥ 1).
+- [ ] No spike in 5xx on Grafana "Veza API Overview".
+- [ ] Discord `#beta-feedback` has at least one "I'm in" message.
+
+### Every 4 h during the bêta window
+
+- [ ] Re-run `monitor-checks.sh` (or the cron wakes you).
+- [ ] Triage any HIGH-severity report within 1 h (per
+      `docs/SOFT_LAUNCH_BETA_2026.md` §"Issue triage matrix").
+- [ ] Update the issues-reported table in
+      `docs/SOFT_LAUNCH_BETA_2026.md` so the decision call has fresh data.
+
+## D+0 18:00 UTC — decision call
+
+- [ ] Tech lead, product lead, on-call engineer all on the call.
+- [ ] `monitor-checks.sh` final run shown live ; verdict screenshotted.
+- [ ] Each acceptance-gate row from `SOFT_LAUNCH_BETA_2026.md`
+      §"Acceptance gate" walked through verbally.
+- [ ] Unanimous GO or any one NO-GO documented in the meeting notes.
+- [ ] Decision logged in `docs/SOFT_LAUNCH_BETA_2026.md` §"Take-aways".
+
+If GO : the v2.0.0-public tag goes out the next morning.
+If NO-GO : the meeting decides scope of fix-pass + new acceptance date.
+
+## Linked artefacts
+
+- `docs/SOFT_LAUNCH_BETA_2026.md` — the bigger picture (cohort
+  definition, email template inline, day timeline, monitoring list,
+  acceptance gate, decision protocol)
+- `migrations/990_beta_invites.sql` — schema this depends on
+- `scripts/soft-launch/validate-cohort.sh` — pre-send sanity check
+- `scripts/soft-launch/send-invitations.sh` — batch insert + send
+- `scripts/soft-launch/monitor-checks.sh` — live gate poll
+- `templates/email/beta_invite.eml.template` — the email recipients
+  receive
+- `docs/GO_NO_GO_CHECKLIST_v2.0.0_PUBLIC.md` — the v2.0.0 checklist
+  this unblocks
--- a/docs/runbooks/rabbitmq-down.md
+++ b/docs/runbooks/rabbitmq-down.md
@ -0,0 +1,164 @@
+# Runbook — RabbitMQ unavailable
+
+> **Alert** : `RabbitMQUnreachable` (in `config/prometheus/alert_rules.yml`).
+> **Owner** : infra on-call.
+> **Game-day scenario** : E (`infra/ansible/tests/test_rabbitmq_outage.sh`).
+
+## What breaks when RabbitMQ is down
+
+RabbitMQ is a fan-out broker for asynchronous, non-user-facing work
+(transcode jobs, distribution to external platforms, email digests,
+DMCA takedown propagation, search index updates). The user-facing
+request path does NOT block on RabbitMQ — the API publishes a message
+and returns 202 Accepted ; the worker picks it up later.
+
+| Subsystem                            | Effect when RabbitMQ is gone                                       | Severity |
+| ------------------------------------ | ------------------------------------------------------------------ | -------- |
+| Track upload → HLS transcode         | Upload succeeds (S3 write OK), HLS segments don't appear           | **MEDIUM** — track playable via fallback `/stream`, not via HLS |
+| Distribution to Spotify/SoundCloud   | Submission silently queued ; users see "pending" forever           | MEDIUM — surfaces in distribution dashboard, not in player |
+| Email digest (weekly creator stats)  | Cron tick logs `publish failed`, retries on next tick              | LOW — eventual consistency, no user-visible breakage |
+| DMCA takedown event                  | Track flag flipped in DB synchronously ; downstream replay queue stalls | **HIGH** — track is gated immediately (synchronous DB UPDATE), but cache invalidation lags |
+| Search index updates                 | New tracks not searchable until queue drains                       | LOW — falls back to Postgres FTS |
+| Chat messages (WebSocket)            | INDEPENDENT — chat is direct WS, no RabbitMQ involvement           | NONE |
+| Auth, sessions, payments             | INDEPENDENT — no RabbitMQ dependency                               | NONE |
+
+The synchronous-fail-loud cases (DMCA cache invalidation, transcode
+queue) are the ones that compound if the outage drags. Most user
+flows degrade gracefully.
+
+## First moves
+
+1. **Confirm RabbitMQ is actually down**, not "unreachable from one
+   host" :
+   ```bash
+   curl -s -u "$RMQ_USER:$RMQ_PASS" http://rabbitmq.lxd:15672/api/overview \
+     | jq '.cluster_name, .object_totals'
+   ```
+2. **Confirm what changed.** If a deploy fired in the last 30 min,
+   suspect the deploy. Check `journalctl -u veza-backend-api -n 200`
+   for `amqp` errors with timestamps after the deploy.
+3. **Check the queues didn't fill the disk** (most common bring-down
+   in development) :
+   ```bash
+   ssh rabbitmq.lxd 'df -h /var/lib/rabbitmq'
+   ```
+
+## RabbitMQ instance is down
+
+```bash
+# State on the RabbitMQ host :
+ssh rabbitmq.lxd sudo systemctl status rabbitmq-server
+
+# Logs (Erlang verbosity, grep for ERROR/CRASH) :
+ssh rabbitmq.lxd sudo journalctl -u rabbitmq-server -n 500 \
+  | grep -E 'ERROR|CRASH|disk_alarm|memory_alarm'
+```
+
+Common causes :
+
+- **Disk alarm.** `/var/lib/rabbitmq` filled — RabbitMQ pauses producers
+  when free space drops below `disk_free_limit`. The backend's amqp
+  client surfaces this as "blocked". Fix : grow the disk or expire old
+  messages with `rabbitmqctl purge_queue <queue>` (last resort, you
+  lose what's in there).
+- **Memory alarm.** RSS over `vm_memory_high_watermark` × system mem.
+  Same effect (producers blocked). Fix : add memory or unblock by
+  draining a slow consumer.
+- **Process crashed.** Erlang OOM, segfault. `sudo systemctl restart
+  rabbitmq-server` ; the queues survive (durable=true on every queue
+  we declare).
+- **Cluster split-brain.** v1.0 is single-node, so this can't happen
+  yet. Listed for the v1.1 multi-node config.
+
+## Backend can't reach RabbitMQ
+
+Network or DNS issue, not RabbitMQ's fault.
+
+```bash
+# From the API container :
+nc -zv rabbitmq.lxd 5672
+
+# DNS :
+getent hosts rabbitmq.lxd
+
+# AMQP credentials :
+docker exec veza_backend_api env | grep AMQP_URL
+```
+
+Likely culprits : Incus bridge restart, password rotation didn't
+propagate to the API container's env, security-group change.
+
+## Mitigation while RabbitMQ is down
+
+The backend already handles publish failures gracefully :
+
+- `internal/eventbus/rabbitmq.go` retries with exponential backoff up
+  to 30s, then drops to "degraded mode" (publish returns immediately
+  with a logged warning, the API call succeeds, the side-effect is
+  lost).
+- Workers in `internal/workers/` have `WithRetry()` middleware that
+  republishes failed deliveries up to 5 times before dead-lettering.
+
+If recovery is going to take > 10 min, set
+`EVENTBUS_DEGRADED_LOG_LEVEL=error` (default `warn`) so the
+fail-fast logs land in Sentry and operators can audit which messages
+were dropped.
+
+**Do NOT** restart the backend to clear the AMQP connection pool ;
+the reconnect logic (`go.uber.org/zap`-logged in eventbus.go:142)
+handles it once RabbitMQ is back.
+
+## Recovery
+
+Once RabbitMQ is back up :
+
+1. Verify connectivity from each backend instance :
+   ```bash
+   docker exec veza_backend_api sh -c 'echo -e "AMQP\x00\x00\x09\x01" | nc -w1 rabbitmq.lxd 5672 | head -c 4'
+   ```
+   Should return `AMQP`.
+2. Watch the queue depth on the management UI :
+   `http://rabbitmq.lxd:15672/#/queues`. Expect `transcode_jobs`,
+   `distribution_outbox`, `dmca_propagation`, `search_index_updates`
+   to drain over the next 5-15 min as the workers catch up.
+3. If a queue is stuck > 30 min after recovery, the worker for it is
+   wedged — restart that specific worker container :
+   ```bash
+   docker compose -f docker-compose.prod.yml restart worker-<name>
+   ```
+
+## Audit after the outage
+
+1. Sentry filter `tag:eventbus.status=degraded` between outage start
+   and end — gives you the count and shape of dropped events.
+2. For each dropped DMCA event, manually trigger the cache flush :
+   ```bash
+   curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
+     https://api.veza.fr/api/v1/admin/cache/dmca/flush
+   ```
+3. For each dropped transcode job, requeue from the orders table :
+   ```bash
+   psql "$DATABASE_URL" -c "
+     INSERT INTO transcode_jobs (track_id, status, attempts, created_at)
+     SELECT id, 'pending', 0, NOW() FROM tracks
+     WHERE created_at BETWEEN '<outage_start>' AND '<outage_end>'
+       AND hls_status IS NULL;
+   "
+   ```
+
+## Postmortem trigger
+
+Any RabbitMQ outage > 10 min triggers a postmortem. The non-user-facing
+nature makes this less urgent than Redis or Postgres, but the
+silent-failure modes (dropped DMCA propagation, missing transcodes)
+warrant a write-up so we know what slipped through.
+
+## Future-proofing
+
+- v1.1 will move to a 3-node RabbitMQ cluster behind a load balancer
+  for HA. This runbook will then split into "single-node down" (the
+  cluster keeps serving) and "cluster split-brain" (rare, but the
+  recovery path is different).
+- Worker idempotency keys are documented in `docs/api/eventbus.md` —
+  any new worker MUST honour them so a replay during recovery doesn't
+  double-charge / double-distribute / double-takedown.
--- a/infra/ansible/inventory/group_vars
+++ b/infra/ansible/inventory/group_vars
@ -0,0 +1 @@
+../group_vars
--- a/infra/ansible/inventory/prod.yml
+++ b/infra/ansible/inventory/prod.yml
@ -20,6 +20,16 @@ all:
      ansible_user: senke
      ansible_python_interpreter: /usr/bin/python3
  children:
+    # Env-named meta-group — see inventory/staging.yml for rationale.
+    prod:
+      children:
+        incus_hosts:
+        forgejo_runner:
+        haproxy:
+        veza_app_backend:
+        veza_app_stream:
+        veza_app_web:
+        veza_data:
    incus_hosts:
      hosts:
        veza-prod:
--- a/infra/ansible/inventory/staging.yml
+++ b/infra/ansible/inventory/staging.yml
@ -36,6 +36,18 @@ all:
      ansible_user: senke
      ansible_python_interpreter: /usr/bin/python3
  children:
+    # Env-named meta-group : every host below is also in `staging`,
+    # which makes group_vars/staging.yml apply (Ansible matches
+    # group_vars file names against group names).
+    staging:
+      children:
+        incus_hosts:
+        forgejo_runner:
+        haproxy:
+        veza_app_backend:
+        veza_app_stream:
+        veza_app_web:
+        veza_data:
    incus_hosts:
      hosts:
        veza-staging:
--- a/infra/ansible/playbooks/haproxy.yml
+++ b/infra/ansible/playbooks/haproxy.yml
@ -18,14 +18,28 @@
  become: true
  gather_facts: true
  tasks:
-    - name: Launch veza-haproxy container if absent
+    - name: Launch / repair veza-haproxy container
+      # Idempotent : RUNNING → no-op ; STOPPED/half-baked → recreate ;
+      # absent → fresh launch. Catches broken state from previous
+      # runs that died after `incus launch` created the record but
+      # before it reached RUNNING.
      ansible.builtin.shell:
        cmd: |
          set -e
-          if incus info veza-haproxy >/dev/null 2>&1; then
-            echo "veza-haproxy already exists"
+          STATE=$(incus list veza-haproxy -f csv -c s 2>/dev/null | head -1 || true)
+          case "$STATE" in
+            RUNNING)
+              echo "veza-haproxy RUNNING already"
              exit 0
-          fi
+              ;;
+            "")
+              # No record — fresh launch.
+              ;;
+            *)
+              echo "veza-haproxy in state '$STATE' — recreating"
+              incus delete --force veza-haproxy
+              ;;
+          esac
          incus launch "{{ veza_app_base_image | default('images:debian/13') }}" veza-haproxy --profile veza-app --network "{{ veza_incus_network | default('net-veza') }}"
          for _ in $(seq 1 30); do
            if incus exec veza-haproxy -- /bin/true 2>/dev/null; then
@ -35,21 +49,54 @@
          done
          incus exec veza-haproxy -- apt-get update
          incus exec veza-haproxy -- apt-get install -y python3 python3-apt
+          echo "veza-haproxy LAUNCHED"
        executable: /bin/bash
      register: provision_result
-      changed_when: "'incus launch' in provision_result.stdout"
+      changed_when: "'LAUNCHED' in provision_result.stdout or 'recreating' in provision_result.stdout"
      tags: [haproxy, provision]

    - name: Refresh inventory so veza-haproxy is reachable
      ansible.builtin.meta: refresh_inventory

- name: Apply common baseline (SSH hardening, fail2ban, node_exporter)
-  hosts: haproxy
-  become: true
-  gather_facts: true
-  roles:
-    - common
+    # Incus proxy devices : forward the host's :80 / :443 to the
+    # container's :80 / :443. Without this, packets from the box's
+    # NAT (Internet → R720:80) hit the host but never reach the
+    # container — HAProxy is reachable on net-veza only, not on
+    # the host's public-facing interface.
+    - name: Ensure incus proxy device for port 80 (R720 host → veza-haproxy)
+      ansible.builtin.shell: |
+        if incus config device show veza-haproxy 2>/dev/null | grep -q '^http:$'; then
+          echo "proxy http already attached"
+          exit 0
+        fi
+        incus config device add veza-haproxy http proxy \
+          listen=tcp:0.0.0.0:80 \
+          connect=tcp:127.0.0.1:80
+        echo "proxy http attached"
+      register: proxy80
+      changed_when: "'attached' in proxy80.stdout"
+      tags: [haproxy, provision]

+    - name: Ensure incus proxy device for port 443
+      ansible.builtin.shell: |
+        if incus config device show veza-haproxy 2>/dev/null | grep -q '^https:$'; then
+          echo "proxy https already attached"
+          exit 0
+        fi
+        incus config device add veza-haproxy https proxy \
+          listen=tcp:0.0.0.0:443 \
+          connect=tcp:127.0.0.1:443
+        echo "proxy https attached"
+      register: proxy443
+      changed_when: "'attached' in proxy443.stdout"
+      tags: [haproxy, provision]
+
+# Common role intentionally NOT applied to the haproxy container :
+# it's reached via `incus exec` (no SSH inside), and the role's
+# SSH-hardening / fail2ban / node_exporter setup assumes a full
+# host (sshd present, auth.log to monitor, exposed metrics port).
+# Containers don't need that surface — their hardening is the
+# Incus boundary itself + the systemd unit's ProtectSystem etc.
 - name: Install + configure HAProxy + dehydrated/Let's Encrypt
  hosts: haproxy
  become: true
--- a/infra/ansible/roles/common/tasks/ssh.yml
+++ b/infra/ansible/roles/common/tasks/ssh.yml
@ -2,7 +2,25 @@
 # whitelist of users. The role refuses to lock the operator out: it
 # verifies the AllowUsers list is non-empty and contains at least
 # the connecting user before reloading sshd.
+#
+# Skipped entirely when sshd is not installed on the target — useful
+# for Incus containers reached via `incus exec`, which don't need
+# SSH at all (overlay set common_apply_ssh_hardening=false to skip
+# explicitly even when sshd happens to be present).
 ---
+- name: Detect whether sshd is present on the target
+  ansible.builtin.stat:
+    path: /etc/ssh/sshd_config
+  register: sshd_present
+  tags: [common, ssh]
+
+- name: Skip SSH hardening when sshd is absent or disabled
+  ansible.builtin.debug:
+    msg: "sshd not installed on this host — SSH hardening skipped"
+  when:
+    - not sshd_present.stat.exists or not (common_apply_ssh_hardening | default(true))
+  tags: [common, ssh]
+
 - name: Sanity check — ssh_allow_users must be non-empty
  ansible.builtin.assert:
    that:
@ -12,6 +30,9 @@
      ssh_allow_users is empty. Refusing to apply sshd_config which
      would lock everyone out. Set ssh_allow_users in
      group_vars/all.yml (or override per environment).
+  when:
+    - sshd_present.stat.exists
+    - common_apply_ssh_hardening | default(true)

 - name: Render sshd_config drop-in (50-veza-hardening.conf)
  ansible.builtin.template:
@ -22,9 +43,15 @@
    mode: "0644"
    validate: /usr/sbin/sshd -t -f %s
  notify: Reload sshd
+  when:
+    - sshd_present.stat.exists
+    - common_apply_ssh_hardening | default(true)

 - name: Ensure sshd is enabled + running
  ansible.builtin.service:
    name: ssh
    state: started
    enabled: true
+  when:
+    - sshd_present.stat.exists
+    - common_apply_ssh_hardening | default(true)
--- a/infra/ansible/roles/haproxy/files/selfsigned.pem
+++ b/infra/ansible/roles/haproxy/files/selfsigned.pem
@ -0,0 +1,50 @@
+-----BEGIN PRIVATE KEY-----
+MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQCgyerZjp1+RxU8
+/bISXduo8OjR2ejl5SD034PyQvT5B9tk83yplplHoG+JL78UGqpflPlhU9fQSoT9
+Walusf/MDDCEbQ75sjPui+yNuvcgWkmpN0MUdOHR8gvfiADCR6/eDQuRf7JJh5N8
+YdCtLtnOYsha7Bix+bN11GO6XzPG869I/UGdg4g0v7LvDCP3tI0tpno+y4MuiDvJ
+R1pQd7sl6jxPp4zvNtVw8vrSVA3qJ8G6F78nnPUUPFnrAlUFNcnMVLamxY0IA3H4
+n9o7X73RnphrpcnPr6eyEYxOL0UGhsDMsQxTrhSaOErL68QDTk3hV60SxWqsVlxX
+/DoKAb9VAgMBAAECggEAenTt6V3Fsxv+H+Jz0assFYHNP63/w797FyR4QHUgT93d
+CQisRBjPio61A72agHxCj+NM/wQ1FIz8tluoQAdO8x/Bf8nzotZG2QI2Wkcv2bMJ
+8NeGvji6mAQJaOgS8+RXG/3BdsHTjk60VAHHRW6uMZJoV18C++FZ/X6RqarCK13N
+UEfHX529qNvLhw+xkjXFW/qiB3dQTTEJq+9y0U4nGrjZCXtspkXN3g6ETU6Svzhq
+z4tq0udC7FjZPqdA79ChXweZlDCq89FQfxAnxRoZAiwymK91VrGz/GyMIwdBPidm
+or8Rk6nodKk8AuwsGE6ub9UhWUS+Kdpl9fNcV1jLQKBgQDRA7D786sf25tgyooF
+6IMZwQfHWGmIepUPruHLz5aV6ozO8XQBgEN4XBI15mxJTu+eeXGbqOhwwuhvYR9u
+G02qPE0OlftBRnBJp2AH5+gRphLyrRAvgnjVw323ucnsjOzO0TPwdehomKC0J3b9
+B+hZ2tKW/nNxqX/iU1ue969lAwKBgQDE7vJnppvAZLSMo4PCtBTJm11u58AZ9LyZ
+6dxvpiq6XxPw9DcC2gj91pCST2g4vIqDYQgmh5U3RzMIFsKLtKfDvHEAYbFOnEfz
+UXoNFjlCEmB2jHgpn51/ZDokpPSF9MooDUFna0JPaUrduHs8Zzv7kfrsAhq2N++C
+eB+jMea+xwKBgESDzEFbB85io5Vf70yugkMv9ofPIJD/ddt1PUkdHES6ZTv1BEz1
+qahLriCDDx4cxQmSz73x6XgFPEI+eRoT0yqpp6zPV1R3bZmHR0BwMa+PXAi22GZq
+g4e3FH/kZB+ptnq5MyhwziVzWsKTaTram7zQsVWTxW4N3QDoyFDc6l7XAoGBAI85
+bLIyZ4zn9xpT/rbXgMCrAFtK5m1FTYbj+bjw0+otqgX9aptSPzUgHDor7QT6+mB
+OJxNH4kEj2jipLtWuGzzMHxGkN3La8jbCRlbgGk9VErj/sDHBZURH/hmwDBsyFo4
+ycidiayXt4tqELbtngJpOUVMgoDkTZ1mIBxgvqEhAoGBAK6uX4k2xiOQorpByvjd
+gT16MbuntXO/bDXnXaq1keNMr1JzQ5aS346XweiUgRG7ZJdEb2C8sXwSmh2+oeGa
+G+QCLH73hwo/PWbU560dFY8s6z5E79WBjYUu5+1/a0SCBwQ4mEVB7REQVY1mQoJT
+A+A8WW+EDvaPpVFujA26K3fc
+-----END PRIVATE KEY-----
+-----BEGIN CERTIFICATE-----
+MIIDjTCCAnWgAwIBAgIUbgZuZRFj8M8ZcdhRFikB2bJKswYwDQYJKoZIhvcNAQEL
+BQAwVjELMAkGA1UEBhMCWFgxFTATBgNVBAcMDERlZmF1bHQgQ2l0eTEcMBoGA1UE
+CgwTRGVmYXVsdCBDb21wYW55IEx0ZDESMBAGA1UEAwwJbG9jYWxob3N0MB4XDTIy
+MDQwODEwMTA0OFoXDTQ5MDgyNDEwMTA0OFowVjELMAkGA1UEBhMCWFgxFTATBgNV
+BAcMDERlZmF1bHQgQ2l0eTEcMBoGA1UECgwTRGVmYXVsdCBDb21wYW55IEx0ZDES
+MBAGA1UEAwwJbG9jYWxob3N0MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKC
+AQEAoMnq2Y6dfkcVPP2yEl3bqPDo0dno5eUg9N+D8kL0+QfbZPN8qZaZR6BviS+/
+FBqqX5T5YVPX0EqE/VmpbrH/zAwwhG0O+bIz7ovsjbr3IFpJqTdDFHTh0fIL34gA
+wkev3g0LkX+ySYeTfGHQrS7ZzmLIWuwYsfmzddRjul8zxvOvSP1BnYOINL+y7wwj
+97SNLaZ6PsuDLog7yUdaUHe7Jeo8T6eM7zbVcPL60lQN6ifBuhe/J5z1FDxZ6wJV
+BTXJzFS2psWNCANx+J/aO1+90Z6Ya6XJz6+nshGMTi9FBobAzLEMU64UmjhKy+vE
+A05N4VetEsVqrFZcV/w6CgG/VQIDAQABo1MwUTAdBgNVHQ4EFgQUJZDike5gfaOV
+k8uCwfCh2OrPXd0wHwYDVR0jBBgwFoAUJZDike5gfaOVk8uCwfCh2OrPXd0wDwYD
+VR0TAQH/BAUwAwEB/zANBgkqhkiG9w0BAQsFAAOCAQEAQbXAIBoDHQakksvKGo3X
+/bIyc+IQKFpsyWrn5GvS69wTE7XBfKLtyY3X8NygvsCaRx0r2OIdVERNjrhELkes
+tWQE17D1+tDnsaEQRUNJsjBYmealNPpqqacdRlBNnkTSGM/3d3m/ihlA51A1QzyI
+IOtKxRRIZ+24L/eww5Hv96ub3Wu4rVmepXP4cVIcPEnN6ntmOv4Ja/M83hLI2oXy
+4XmXOVsyliYDGWiyvT2U3LcRsv9PHr09SqYO/5yW+fYC7diLGSHW0kfwht2Q8Zqg
+IFMJMDmmKTbCWCmFYdoVTRm2fFl0YvgpC5JrXuSloHh3hRiLwDIUiTxlTM3JDP8q
+PQ==
+-----END CERTIFICATE-----
--- a/infra/ansible/roles/haproxy/tasks/main.yml
+++ b/infra/ansible/roles/haproxy/tasks/main.yml
@ -26,6 +26,29 @@
    mode: "0750"
  tags: [haproxy, config]

+# Chicken-and-egg : haproxy.cfg.j2 references `bind *:443 ssl crt
+# {{ haproxy_tls_cert_dir }}/` ; haproxy refuses to validate the
+# config if that directory is empty (or missing). dehydrated creates
+# real LE certs there LATER (in letsencrypt.yml). Break the cycle
+# the same way the working roles in
+# /home/senke/Documents/TG__Talas_Group/.../roles/haproxy do : ship a
+# checked-in `selfsigned.pem` and copy it into the cert dir.
+# Once dehydrated lands real certs alongside, SNI picks the matching
+# real cert ; selfsigned.pem only matches CN=localhost (harmless).
+- name: Ensure {{ haproxy_tls_cert_dir }} exists
+  ansible.builtin.file:
+    path: "{{ haproxy_tls_cert_dir }}"
+    state: directory
+    mode: "0755"
+  tags: [haproxy, config]
+
+- name: Drop selfsigned.pem so haproxy can validate the cfg
+  ansible.builtin.copy:
+    src: selfsigned.pem
+    dest: "{{ haproxy_tls_cert_dir }}/selfsigned.pem"
+    mode: "0640"
+  tags: [haproxy, config]
+
 - name: Render haproxy.cfg
  ansible.builtin.template:
    src: haproxy.cfg.j2
@ -33,7 +56,10 @@
    owner: root
    group: haproxy
    mode: "0640"
-    validate: "haproxy -f %s -c -q"
+    # No -q so the actual validation error reaches the operator's
+    # console. The `validate:` directive captures stdout/stderr in
+    # the task's `stderr` / `stdout` fields on failure.
+    validate: "haproxy -f %s -c"
  register: haproxy_config
  notify: Reload haproxy
  tags: [haproxy, config]
--- a/infra/ansible/roles/haproxy/templates/haproxy.cfg.j2
+++ b/infra/ansible/roles/haproxy/templates/haproxy.cfg.j2
@ -41,6 +41,28 @@ defaults
    timeout http-request 10s
    load-server-state-from-file global

+# -----------------------------------------------------------------------
+# DNS resolvers — Incus's managed bridges expose a built-in DNS
+# resolver on the gateway IP for the bridge's subnet (10.0.20.1 for
+# net-veza). Backend containers' .lxd hostnames resolve here.
+# init-addr last,libc,none on default-server lets HAProxy start
+# even if the backends don't exist yet ; servers go into MAINT
+# until the resolver returns an address (deploy_app.yml creates
+# them later, then `incus-resolver` task in HAProxy picks them up
+# automatically — no haproxy reload needed).
+# -----------------------------------------------------------------------
+resolvers veza_dns
+    nameserver incus_gw 10.0.20.1:53
+    accepted_payload_size 4096
+    resolve_retries 3
+    timeout resolve 1s
+    timeout retry 1s
+    hold valid 10s
+    hold nx 5s
+    hold timeout 5s
+    hold refused 5s
+    hold obsolete 30s
+
 # -----------------------------------------------------------------------
 # Stats endpoint — bound to loopback only ; the Prometheus haproxy
 # exporter sidecar scrapes it.
@ -63,9 +85,12 @@ frontend veza_http_in
    bind *:{{ haproxy_listen_https }} ssl crt {{ haproxy_tls_cert_dir }}/ alpn h2,http/1.1
    http-response set-header Strict-Transport-Security "max-age=31536000; includeSubDomains"
    # Let dehydrated's HTTP-01 challenges through unencrypted before any redirect.
+    # Order matters : http-request rules must come BEFORE use_backend
+    # rules in HAProxy ; otherwise haproxy 3.x warns and processes them
+    # in the unintended order.
    acl acme_challenge path_beg /.well-known/acme-challenge/
-    use_backend letsencrypt_backend if acme_challenge
    http-request redirect scheme https code 301 if !{ ssl_fc } !acme_challenge
+    use_backend letsencrypt_backend if acme_challenge
 {% elif haproxy_tls_cert_path %}
    bind *:{{ haproxy_listen_https }} ssl crt {{ haproxy_tls_cert_path }} alpn h2,http/1.1
    http-response set-header Strict-Transport-Security "max-age=31536000; includeSubDomains"
@ -146,7 +171,7 @@ backend {{ env }}_backend_api
    option httpchk GET {{ veza_healthcheck_paths.backend | default('/api/v1/health') }}
    http-check expect status 200
    cookie {{ haproxy_sticky_cookie_name }}_{{ env }} insert indirect nocache httponly secure
-    default-server check inter {{ haproxy_health_check_interval_ms }} fall {{ haproxy_health_check_fall }} rise {{ haproxy_health_check_rise }} on-marked-down shutdown-sessions slowstart {{ haproxy_graceful_drain_seconds }}s
+    default-server check inter {{ haproxy_health_check_interval_ms }} fall {{ haproxy_health_check_fall }} rise {{ haproxy_health_check_rise }} on-marked-down shutdown-sessions slowstart {{ haproxy_graceful_drain_seconds }}s init-addr last,libc,none resolvers veza_dns
    server {{ env }}_backend_blue  {{ prefix }}backend-blue.{{ veza_incus_dns_suffix }}:{{ veza_backend_port }}  cookie {{ env }}_backend_blue  {{ '' if _active == 'blue' else 'backup' }}
    server {{ env }}_backend_green {{ prefix }}backend-green.{{ veza_incus_dns_suffix }}:{{ veza_backend_port }} cookie {{ env }}_backend_green {{ '' if _active == 'green' else 'backup' }}

@ -157,7 +182,7 @@ backend {{ env }}_stream_pool
    option httpchk GET {{ veza_healthcheck_paths.stream | default('/health') }}
    http-check expect status 200
    timeout tunnel 1h
-    default-server check inter {{ haproxy_health_check_interval_ms }} fall {{ haproxy_health_check_fall }} rise {{ haproxy_health_check_rise }} on-marked-down shutdown-sessions slowstart {{ haproxy_graceful_drain_seconds }}s
+    default-server check inter {{ haproxy_health_check_interval_ms }} fall {{ haproxy_health_check_fall }} rise {{ haproxy_health_check_rise }} on-marked-down shutdown-sessions slowstart {{ haproxy_graceful_drain_seconds }}s init-addr last,libc,none resolvers veza_dns
    server {{ env }}_stream_blue  {{ prefix }}stream-blue.{{ veza_incus_dns_suffix }}:{{ veza_stream_port }}  {{ '' if _active == 'blue' else 'backup' }}
    server {{ env }}_stream_green {{ prefix }}stream-green.{{ veza_incus_dns_suffix }}:{{ veza_stream_port }} {{ '' if _active == 'green' else 'backup' }}

@ -166,7 +191,7 @@ backend {{ env }}_web_pool
    balance roundrobin
    option httpchk GET {{ veza_healthcheck_paths.web | default('/') }}
    http-check expect status 200
-    default-server check inter {{ haproxy_health_check_interval_ms }} fall {{ haproxy_health_check_fall }} rise {{ haproxy_health_check_rise }} on-marked-down shutdown-sessions slowstart {{ haproxy_graceful_drain_seconds }}s
+    default-server check inter {{ haproxy_health_check_interval_ms }} fall {{ haproxy_health_check_fall }} rise {{ haproxy_health_check_rise }} on-marked-down shutdown-sessions slowstart {{ haproxy_graceful_drain_seconds }}s init-addr last,libc,none resolvers veza_dns
    server {{ env }}_web_blue  {{ prefix }}web-blue.{{ veza_incus_dns_suffix }}:{{ veza_web_port }}  {{ '' if _active == 'blue' else 'backup' }}
    server {{ env }}_web_green {{ prefix }}web-green.{{ veza_incus_dns_suffix }}:{{ veza_web_port }} {{ '' if _active == 'green' else 'backup' }}

@ -174,11 +199,17 @@ backend {{ env }}_web_pool

 {% if haproxy_forgejo_host %}
 # --- Forgejo (managed outside the deploy pipeline) --------------------
+# The existing forgejo container exposes HTTPS on :3000 with a
+# self-signed cert. We re-encrypt to it (ssl verify none) ; the
+# operator's WireGuard mesh is the trust boundary, the cert chain
+# is irrelevant. Healthcheck adapted to send a Host: header so
+# Forgejo's reverse-proxy validation accepts the request.
 backend forgejo_backend
-    option httpchk GET /
-    http-check expect status 200
+    option httpchk
+    http-check send meth GET uri / ver HTTP/1.1 hdr Host {{ haproxy_forgejo_host }}
+    http-check expect rstatus ^[23]
    default-server check inter 10s fall 3 rise 2
-    server forgejo {{ haproxy_forgejo_backend }}
+    server forgejo {{ haproxy_forgejo_backend }} ssl verify none sni str({{ haproxy_forgejo_host }})
 {% endif %}

 {% if haproxy_talas_hosts %}
--- a/scripts/payment-e2e-walkthrough.sh
+++ b/scripts/payment-e2e-walkthrough.sh
@ -42,6 +42,17 @@ OPERATOR_EMAIL=${OPERATOR_EMAIL:-?}
 OPERATOR_PASSWORD=${OPERATOR_PASSWORD:-?}
 ORDER_POLL_TIMEOUT=${ORDER_POLL_TIMEOUT:-300}
 ORDER_POLL_INTERVAL=${ORDER_POLL_INTERVAL:-5}
+# v1.0.10 polish safety guards:
+#   DRY_RUN=1            — skip the POST /orders + payment steps; rehearse
+#                          the login + product-listing + license-poll path
+#                          end-to-end on staging without spending a euro.
+#   CONFIRM_PRODUCTION=1 — required when STAGING_URL points at the live
+#                          environment. Without it the script refuses to
+#                          run, so a typo (`STAGING_URL=https://veza.fr`
+#                          on a sandbox-targeted command) can't accidentally
+#                          charge a real card.
+DRY_RUN=${DRY_RUN:-0}
+CONFIRM_PRODUCTION=${CONFIRM_PRODUCTION:-0}

 SESSION_DATE="$(date +%Y%m%d-%H%M)"
 SESSION_LOG="${REPO_ROOT}/docs/PAYMENT_E2E_LIVE_REPORT.md.session-${SESSION_DATE}.log"
@ -64,6 +75,43 @@ require jq
 [ "$OPERATOR_EMAIL"     = "?" ] && fail "OPERATOR_EMAIL env var required" 3
 [ "$OPERATOR_PASSWORD"  = "?" ] && fail "OPERATOR_PASSWORD env var required" 3

+# Heuristic: any URL that doesn't include the substring "staging" is
+# treated as production. Operators on a non-veza-domain (custom env)
+# can still run the script; they just have to pass CONFIRM_PRODUCTION=1.
+TARGET_LOOKS_LIKE_PROD=0
+if [[ ! "$STAGING_URL" =~ staging ]] && [[ ! "$STAGING_URL" =~ localhost ]] && [[ ! "$STAGING_URL" =~ 127\.0\.0\.1 ]]; then
+  TARGET_LOOKS_LIKE_PROD=1
+fi
+
+if [ "$TARGET_LOOKS_LIKE_PROD" = "1" ] && [ "$CONFIRM_PRODUCTION" != "1" ]; then
+  cat >&2 <<EOF
+
+================================================================
+ABORTING — production target detected without explicit confirmation
+================================================================
+
+STAGING_URL=$STAGING_URL does not contain "staging", "localhost" or
+"127.0.0.1", so this script will refuse to run by default to prevent
+an accidental real-card charge.
+
+If you genuinely want to run against production, re-invoke with:
+
+    CONFIRM_PRODUCTION=1 \\
+    STAGING_URL=$STAGING_URL \\
+    OPERATOR_EMAIL=$OPERATOR_EMAIL \\
+    OPERATOR_PASSWORD=... \\
+    bash scripts/payment-e2e-walkthrough.sh
+
+Or set DRY_RUN=1 to rehearse the flow without making the actual charge.
+================================================================
+EOF
+  exit 3
+fi
+
+if [ "$DRY_RUN" = "1" ]; then
+  log "DRY_RUN=1 — order creation + payment + refund steps will be SKIPPED"
+fi
+
 # api wrapper that tee's request + response to the session log so the
 # operator can copy-paste the full trace into the report.
 api() {
@ -134,8 +182,39 @@ log "  ✓ price      : $PRODUCT_PRICE"
 # --------------------------------------------------------------------
 # Step 3 : POST /orders.
 # --------------------------------------------------------------------
+if [ "$DRY_RUN" = "1" ]; then
+  log ""
+  log "step 3 : POST /api/v1/marketplace/orders — SKIPPED (dry-run)"
+  log "================================================================"
+  log "DRY-RUN PASS : login + product list + license-mine endpoints reached"
+  log "Run without DRY_RUN to exercise the real charge + refund flow."
+  log "================================================================"
+  exit 0
+fi
+
 log ""
 log "step 3 : POST /api/v1/marketplace/orders"
+
+# v1.0.10 polish: confirm prompt before the actual charge so a typo'd
+# product_id or wrong operator account can't quietly burn 5 EUR.
+if [ "$TARGET_LOOKS_LIKE_PROD" = "1" ]; then
+  log ""
+  log "================================================================"
+  log "FINAL CONFIRMATION — about to charge a real card on production"
+  log "================================================================"
+  log "  product_id : $PRODUCT_ID"
+  log "  price      : $PRODUCT_PRICE"
+  log "  operator   : $OPERATOR_EMAIL"
+  log "  endpoint   : ${STAGING_URL}/api/v1/marketplace/orders"
+  log ""
+  prompt "Type the literal word 'CHARGE' to proceed (anything else aborts) :"
+  read -r confirm_word
+  if [ "$confirm_word" != "CHARGE" ]; then
+    fail "operator did not confirm the charge ($confirm_word) — aborting" 2
+  fi
+  log "  operator confirmed CHARGE — proceeding"
+fi
+
 order_body="{\"items\":[{\"product_id\":\"${PRODUCT_ID}\"}]}"
 order_resp=$(api POST /api/v1/marketplace/orders "$order_body" 2>/dev/null)
 ORDER_ID=$(echo "$order_resp" | jq -r '.data.order.id // .data.id // .id // ""')
--- a/scripts/pentest/seed-test-accounts.sh
+++ b/scripts/pentest/seed-test-accounts.sh
@ -0,0 +1,191 @@
+#!/usr/bin/env bash
+# seed-test-accounts.sh — provision the 3 pentester accounts on a target
+# environment (staging only ; refuses to run against prod).
+#
+# Per docs/PENTEST_SCOPE_2026.md §"Authentication context", an external
+# pentest engagement needs three pre-seeded accounts (listener, creator,
+# admin). This script :
+#
+#   1. Generates a 32-char random password for each role.
+#   2. Calls the staging admin API to create / reset each account.
+#   3. Promotes creator to creator, admin to admin (via direct DB UPDATE
+#      because the public API doesn't expose role changes — operator
+#      runs that step from a maintenance shell).
+#   4. Writes a 1Password import JSON to stdout so the operator can
+#      `op item template` it into the shared vault. NEVER prints
+#      passwords to the screen.
+#
+# Usage :
+#   bash scripts/pentest/seed-test-accounts.sh staging
+#
+# Output :
+#   1Password JSON on stdout (3 entries). Pipe into a file, then
+#   `op item create --vault Pentest-2026 - < file.json`.
+#
+# Exit codes :
+#   0  — three accounts provisioned, JSON emitted
+#   1  — API call failed (account creation or login probe)
+#   2  — wrong target environment (e.g. operator passed "prod")
+#   3  — required env var or tool missing
+set -euo pipefail
+
+ENV_NAME=${1:-}
+if [ -z "$ENV_NAME" ]; then
+  cat >&2 <<EOF
+usage : bash scripts/pentest/seed-test-accounts.sh <env>
+  env  : staging   (the only accepted value — prod is refused)
+
+Required env vars :
+  STAGING_URL              base URL (e.g. https://staging.veza.fr)
+  STAGING_ADMIN_EMAIL      admin who creates the accounts
+  STAGING_ADMIN_PASSWORD   admin password (provisioning cred only)
+
+Output :
+  1Password import JSON for vault Pentest-2026, on stdout.
+  Passwords are NEVER printed to the operator's screen.
+EOF
+  exit 3
+fi
+
+if [ "$ENV_NAME" != "staging" ]; then
+  echo "ERROR: this script refuses to run against any env other than 'staging'." >&2
+  echo "       Pentest accounts on production violate the engagement scope." >&2
+  exit 2
+fi
+
+STAGING_URL=${STAGING_URL:-?}
+STAGING_ADMIN_EMAIL=${STAGING_ADMIN_EMAIL:-?}
+STAGING_ADMIN_PASSWORD=${STAGING_ADMIN_PASSWORD:-?}
+
+[ "$STAGING_URL"            = "?" ] && { echo "STAGING_URL required" >&2; exit 3; }
+[ "$STAGING_ADMIN_EMAIL"    = "?" ] && { echo "STAGING_ADMIN_EMAIL required" >&2; exit 3; }
+[ "$STAGING_ADMIN_PASSWORD" = "?" ] && { echo "STAGING_ADMIN_PASSWORD required" >&2; exit 3; }
+
+command -v curl    >/dev/null 2>&1 || { echo "curl required" >&2; exit 3; }
+command -v jq      >/dev/null 2>&1 || { echo "jq required"   >&2; exit 3; }
+command -v openssl >/dev/null 2>&1 || { echo "openssl required (password generation)" >&2; exit 3; }
+
+genpass() {
+  # 32-char password from base64-encoded 24 bytes of entropy. URL-safe
+  # so it can land in a JSON string without escaping.
+  openssl rand -base64 24 | tr -d '\n=/+' | cut -c-32
+}
+
+# 1. login as the staging admin so we can call the create-user endpoint.
+admin_login_resp=$(curl -ksS --max-time 15 \
+  -X POST -H 'Content-Type: application/json' \
+  -d "{\"email\":\"${STAGING_ADMIN_EMAIL}\",\"password\":\"${STAGING_ADMIN_PASSWORD}\",\"remember_me\":false}" \
+  "${STAGING_URL}/api/v1/auth/login")
+admin_token=$(echo "$admin_login_resp" | jq -r '.data.token.access_token // .token.access_token // ""')
+if [ -z "$admin_token" ] || [ "$admin_token" = "null" ]; then
+  echo "ERROR: admin login failed" >&2
+  echo "$admin_login_resp" >&2
+  exit 1
+fi
+
+provision() {
+  # provision <role> <email-prefix>
+  # Returns : password (stdout), nothing else.
+  local role=$1 email_prefix=$2
+  local email="${email_prefix}@veza.fr"
+  local password
+  password=$(genpass)
+
+  # Try creating ; if 409 (already exists), reset password instead. Both
+  # paths return a valid (email, password) tuple at the end.
+  local create_resp create_status
+  create_resp=$(curl -ksS --max-time 15 \
+    -H "Authorization: Bearer ${admin_token}" \
+    -H 'Content-Type: application/json' \
+    -X POST \
+    -d "{\"email\":\"${email}\",\"password\":\"${password}\",\"username\":\"${email_prefix}\",\"role\":\"${role}\"}" \
+    -w '\nHTTP_CODE=%{http_code}' \
+    "${STAGING_URL}/api/v1/admin/users")
+  create_status=$(echo "$create_resp" | grep -oE 'HTTP_CODE=[0-9]+' | tail -1 | cut -d= -f2)
+
+  case "$create_status" in
+    200|201)
+      ;;
+    409)
+      # Account exists — reset password instead.
+      curl -ksS --max-time 15 \
+        -H "Authorization: Bearer ${admin_token}" \
+        -H 'Content-Type: application/json' \
+        -X POST \
+        -d "{\"email\":\"${email}\",\"new_password\":\"${password}\"}" \
+        "${STAGING_URL}/api/v1/admin/users/reset-password" >/dev/null
+      ;;
+    *)
+      echo "ERROR: provisioning ${role} failed with HTTP ${create_status}" >&2
+      echo "$create_resp" >&2
+      exit 1
+      ;;
+  esac
+
+  # Probe : login as the freshly-set account so we know the engagement
+  # can use it.
+  probe=$(curl -ksS --max-time 15 \
+    -X POST -H 'Content-Type: application/json' \
+    -d "{\"email\":\"${email}\",\"password\":\"${password}\",\"remember_me\":false}" \
+    "${STAGING_URL}/api/v1/auth/login")
+  probe_token=$(echo "$probe" | jq -r '.data.token.access_token // .token.access_token // ""')
+  if [ -z "$probe_token" ] || [ "$probe_token" = "null" ]; then
+    echo "ERROR: ${role} login probe failed — provisioning broken" >&2
+    exit 1
+  fi
+
+  printf '%s' "$password"
+}
+
+# 2. provision the three roles. Passwords stay in shell variables — no
+# echo, no log, no temp file.
+listener_pwd=$(provision "user"    "pentest-2026-listener")
+creator_pwd=$(provision "creator"  "pentest-2026-creator")
+admin_pwd=$(provision    "admin"   "pentest-2026-admin")
+
+# 3. emit 1Password JSON template. Each entry has the role + login URL
+# in Notes so the pentester knows which account does what.
+cat <<EOF
+[
+  {
+    "title": "pentest-2026-listener",
+    "category": "LOGIN",
+    "vault": {"name": "Pentest-2026"},
+    "fields": [
+      {"id": "username",   "type": "STRING",   "value": "pentest-2026-listener@veza.fr"},
+      {"id": "password",   "type": "CONCEALED", "value": "${listener_pwd}"},
+      {"id": "url",        "type": "URL",      "value": "${STAGING_URL}/login"},
+      {"id": "notesPlain", "type": "STRING",   "value": "Pentest 2026 — listener role. Engagement: see PENTEST_SCOPE_2026.md. Rotate at engagement end."}
+    ]
+  },
+  {
+    "title": "pentest-2026-creator",
+    "category": "LOGIN",
+    "vault": {"name": "Pentest-2026"},
+    "fields": [
+      {"id": "username",   "type": "STRING",   "value": "pentest-2026-creator@veza.fr"},
+      {"id": "password",   "type": "CONCEALED", "value": "${creator_pwd}"},
+      {"id": "url",        "type": "URL",      "value": "${STAGING_URL}/login"},
+      {"id": "notesPlain", "type": "STRING",   "value": "Pentest 2026 — creator role. Owns 5 seed tracks. Rotate at engagement end."}
+    ]
+  },
+  {
+    "title": "pentest-2026-admin",
+    "category": "LOGIN",
+    "vault": {"name": "Pentest-2026"},
+    "fields": [
+      {"id": "username",   "type": "STRING",   "value": "pentest-2026-admin@veza.fr"},
+      {"id": "password",   "type": "CONCEALED", "value": "${admin_pwd}"},
+      {"id": "url",        "type": "URL",      "value": "${STAGING_URL}/login"},
+      {"id": "notesPlain", "type": "STRING",   "value": "Pentest 2026 — admin role + MFA bypass. DO NOT use for non-pentest activity. Rotate at engagement end."}
+    ]
+  }
+]
+EOF
+
+echo "" >&2
+echo "  3 accounts provisioned + login-probed against ${STAGING_URL}" >&2
+echo "  next: pipe stdout to a file and run" >&2
+echo "        op item create --vault Pentest-2026 - < <file>" >&2
+echo "  THEN rotate each entry with op item edit --generate-password=letters,digits,32" >&2
+echo "       at engagement end (this script does NOT auto-rotate)." >&2
--- a/scripts/security/game-day-driver.sh
+++ b/scripts/security/game-day-driver.sh
@ -16,18 +16,26 @@
 #   E : test_rabbitmq_outage.sh     — stop RabbitMQ 60s, backend stays up
 #
 # Usage :
-#   bash scripts/security/game-day-driver.sh           # run all scenarios
-#   SKIP=DE bash scripts/security/game-day-driver.sh   # skip scenarios D + E
-#   ONLY=A bash scripts/security/game-day-driver.sh    # only run scenario A
+#   bash scripts/security/game-day-driver.sh                                 # all scenarios on staging (default)
+#   SKIP=DE bash scripts/security/game-day-driver.sh                         # skip D + E
+#   ONLY=A bash scripts/security/game-day-driver.sh                          # only A
+#   INVENTORY=prod CONFIRM_PROD=1 bash scripts/security/game-day-driver.sh   # prod (gated)
 #
 # Required env (passed through to the underlying smoke tests) :
 #   REDIS_PASS / SENTINEL_PASS for scenario C
 #   MINIO_ROOT_USER / MINIO_ROOT_PASSWORD for scenario D
 #
+# v1.0.10 polish — production gating :
+#   INVENTORY=prod must be paired with CONFIRM_PROD=1 or the script
+#   refuses to run, so a stale shell-history line can't accidentally
+#   kill prod Postgres on a Monday morning. The driver also runs a
+#   backup-freshness pre-flight when targeting prod (most recent
+#   pgBackRest backup must be < 24 h old).
+#
 # Exit codes :
 #   0  — every selected scenario passed
 #   1  — at least one scenario failed
-#   2  — runner pre-flight failed (script missing, etc.)
+#   2  — runner pre-flight failed (script missing, prod safety guard tripped, stale backup, etc.)
 set -euo pipefail

 REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
@ -41,6 +49,9 @@ mkdir -p "$LOGS_DIR"

 ONLY=${ONLY:-}
 SKIP=${SKIP:-}
+INVENTORY=${INVENTORY:-staging}
+CONFIRM_PROD=${CONFIRM_PROD:-0}
+SKIP_BACKUP_FRESHNESS=${SKIP_BACKUP_FRESHNESS:-0}

 log()  { printf '[%s] %s\n' "$(date +%H:%M:%S)" "$*" | tee -a "$SESSION_LOG" >&2; }
 fail() { log "FAIL: $*"; exit "${2:-2}"; }
@ -68,6 +79,101 @@ want() {
  return 0
 }

+# v1.0.10 polish — prod safety gate. INVENTORY=prod requires
+# CONFIRM_PROD=1 + an interactive type-the-word confirm. Anything else
+# defaults to staging so a forgotten env-var doesn't matter.
+case "$INVENTORY" in
+  staging|stg|dev|local) ;;
+  prod|production)
+    if [ "$CONFIRM_PROD" != "1" ]; then
+      cat >&2 <<EOF
+
+================================================================
+ABORTING — INVENTORY=prod without CONFIRM_PROD=1
+================================================================
+
+This script will kill production services. Each scenario triggers a
+real outage in the chosen inventory : Postgres primary kill, HAProxy
+backend stop, Redis master kill, MinIO node loss, RabbitMQ stop.
+
+To run on production, you must :
+
+  1. Announce a maintenance window 24 h ahead (status page +
+     #engineering channel).
+  2. Set PagerDuty to maintenance mode for the affected services.
+  3. Confirm pgBackRest's last backup is < 24 h old (this script
+     auto-checks if you don't pass SKIP_BACKUP_FRESHNESS=1).
+  4. Re-invoke with :
+
+       INVENTORY=prod CONFIRM_PROD=1 \\
+         bash scripts/security/game-day-driver.sh
+
+The driver will then ask for one more interactive confirmation
+(type the word KILL-PROD) before the first scenario fires.
+================================================================
+EOF
+      exit 2
+    fi
+
+    # Backup-freshness pre-flight : refuse to run if the most recent
+    # pgBackRest full/diff is > 24 h old. Recovery from a stale backup
+    # can extend an outage from minutes to hours, so the cost of
+    # postponing the game day is much less than the cost of compounded
+    # data loss if scenario A fails to recover and we have to restore
+    # from yesterday-but-one.
+    if [ "$SKIP_BACKUP_FRESHNESS" != "1" ]; then
+      if command -v pgbackrest >/dev/null 2>&1; then
+        last_backup_ts=$(pgbackrest --stanza=veza info --output=json 2>/dev/null \
+          | python3 -c "
+import json, sys
+try:
+    data = json.load(sys.stdin)
+    backups = data[0]['backup'] if data else []
+    if not backups: print(0); sys.exit(0)
+    print(max(b['timestamp']['stop'] for b in backups))
+except Exception:
+    print(0)
+" 2>/dev/null || echo 0)
+        now_ts=$(date +%s)
+        age_seconds=$(( now_ts - last_backup_ts ))
+        if [ "$last_backup_ts" -eq 0 ]; then
+          fail "pgBackRest backup-freshness check failed : could not parse 'pgbackrest info'. Set SKIP_BACKUP_FRESHNESS=1 to override (only after manually verifying a recent backup exists)." 2
+        fi
+        if [ "$age_seconds" -gt 86400 ]; then
+          age_hours=$(( age_seconds / 3600 ))
+          fail "pgBackRest most recent backup is ${age_hours}h old (threshold 24h). Run a backup before the game day, or set SKIP_BACKUP_FRESHNESS=1 if you've validated freshness another way." 2
+        fi
+        log "pre-flight : pgBackRest most recent backup is $(( age_seconds / 3600 ))h $(( (age_seconds % 3600) / 60 ))m old (< 24h threshold) — OK"
+      else
+        log "WARN : pgbackrest CLI not on \$PATH ; skipping backup-freshness check. Set SKIP_BACKUP_FRESHNESS=1 to silence this warning if intentional."
+      fi
+    fi
+
+    # Final type-the-word confirm. Everything above can be set in env
+    # by mistake ; this last step requires a human at the keyboard.
+    cat >&2 <<EOF
+
+================================================================
+PROD GAME DAY — final confirmation
+================================================================
+
+  inventory : prod
+  scenarios : ${SCENARIOS[*]}${ONLY:+   (filtered by ONLY=$ONLY)}${SKIP:+   (filtered by SKIP=$SKIP)}
+  session   : $SESSION_LOG
+
+Each scenario triggers a real outage. Type the literal phrase
+KILL-PROD (any other input aborts) to proceed :
+EOF
+    read -r confirm_phrase
+    if [ "$confirm_phrase" != "KILL-PROD" ]; then
+      fail "operator did not confirm KILL-PROD ($confirm_phrase) — aborting" 2
+    fi
+    ;;
+  *)
+    fail "INVENTORY=$INVENTORY not recognised — must be one of staging|prod" 2
+    ;;
+esac
+
 # Pre-flight : every selected scenario script must exist + be executable.
 for s in "${SCENARIOS[@]}"; do
  if want "$s"; then
@ -83,6 +189,7 @@ declare -A SCENARIO_DURATION

 log "================================================================"
 log "Game day session : $SESSION_DATE"
+log "Inventory        : $INVENTORY"
 log "Session log      : $SESSION_LOG"
 log "Scenarios run    : ${SCENARIOS[*]}"
 [ -n "$ONLY" ] && log "ONLY filter      : $ONLY"
--- a/scripts/soft-launch/monitor-checks.sh
+++ b/scripts/soft-launch/monitor-checks.sh
@ -0,0 +1,255 @@
+#!/usr/bin/env bash
+# monitor-checks.sh — poll the soft-launch acceptance gate live during
+# the bêta window so the operator gets a heads-up before the decision
+# call instead of discovering at 18:00 UTC that one threshold is red.
+#
+# Acceptance gate (per docs/SOFT_LAUNCH_BETA_2026.md §"Acceptance gate") :
+#   - ≥ 50 testers signed up (used_at != NULL on beta_invites)
+#   - 0 P1 events in Sentry today
+#   - Status page green for the last 4 h
+#   - Synthetic parcours all green for 6 h
+#   - Nightly k6 load test green
+#   - < 3 HIGH-severity issues reported
+#
+# v1.0.10 Cluster 3.4.
+#
+# Usage :
+#   DATABASE_URL=postgres://... \
+#   SENTRY_AUTH_TOKEN=... \
+#   STATUSPAGE_URL=https://status.veza.fr \
+#   PROM_URL=https://prom.veza.fr \
+#   bash scripts/soft-launch/monitor-checks.sh
+#
+# By default the script runs once and exits with the gate's verdict.
+# Run it from cron (e.g. every 30 min) or pass LOOP=1 to keep checking
+# in-place every CHECK_INTERVAL seconds (default 600 = 10 min).
+#
+# Optional env :
+#   LOOP=1          continuous mode
+#   CHECK_INTERVAL  seconds between checks in LOOP mode (default 600)
+#   QUIET=1         only emit the verdict line (for cron piping)
+#   THRESHOLD_TESTERS  override 50 (default), e.g. set to 100 for
+#                       a stricter sub-window
+#
+# Exit codes :
+#   0  — every gate green
+#   1  — at least one gate red
+#   2  — at least one gate could not be checked (collector down,
+#         token wrong, etc.) — operator must verify manually
+#   3  — required env / tool missing
+set -euo pipefail
+
+DATABASE_URL=${DATABASE_URL:-?}
+SENTRY_AUTH_TOKEN=${SENTRY_AUTH_TOKEN:-?}
+STATUSPAGE_URL=${STATUSPAGE_URL:-https://status.veza.fr}
+PROM_URL=${PROM_URL:-?}
+LOOP=${LOOP:-0}
+CHECK_INTERVAL=${CHECK_INTERVAL:-600}
+QUIET=${QUIET:-0}
+THRESHOLD_TESTERS=${THRESHOLD_TESTERS:-50}
+
+[ "$DATABASE_URL"      = "?" ] && { echo "DATABASE_URL required" >&2; exit 3; }
+[ "$SENTRY_AUTH_TOKEN" = "?" ] && { echo "SENTRY_AUTH_TOKEN required (read scope sufficient)" >&2; exit 3; }
+[ "$PROM_URL"          = "?" ] && { echo "PROM_URL required" >&2; exit 3; }
+
+command -v psql >/dev/null 2>&1 || { echo "psql required" >&2; exit 3; }
+command -v curl >/dev/null 2>&1 || { echo "curl required" >&2; exit 3; }
+command -v jq   >/dev/null 2>&1 || { echo "jq required"   >&2; exit 3; }
+
+# ----------------------------------------------------------------------
+# Individual gate checks. Each prints "✅ <name>" / "🔴 <name>" / "⚪ <name>"
+# (last for "could not check"), and sets one of GATE_*_OK to 0 / 1 / 2.
+# ----------------------------------------------------------------------
+
+GATE_TESTERS_OK=2
+GATE_SENTRY_OK=2
+GATE_STATUSPAGE_OK=2
+GATE_SYNTHETIC_OK=2
+GATE_K6_OK=2
+GATE_ISSUES_OK=2
+
+check_testers() {
+  local count
+  count=$(psql "$DATABASE_URL" -A -t -c "
+    SELECT count(*) FROM beta_invites WHERE used_at IS NOT NULL;
+  " 2>/dev/null | tr -d ' ' || echo "?")
+  if [ "$count" = "?" ] || ! [[ "$count" =~ ^[0-9]+$ ]]; then
+    echo "⚪ testers signed-up : check failed (psql)"
+    GATE_TESTERS_OK=2
+    return
+  fi
+  if [ "$count" -ge "$THRESHOLD_TESTERS" ]; then
+    echo "✅ testers signed-up : $count / $THRESHOLD_TESTERS"
+    GATE_TESTERS_OK=0
+  else
+    echo "🔴 testers signed-up : $count / $THRESHOLD_TESTERS"
+    GATE_TESTERS_OK=1
+  fi
+}
+
+check_sentry_p1() {
+  # Sentry API : count of unresolved P1 issues last 24h.
+  local count
+  count=$(curl -s -H "Authorization: Bearer $SENTRY_AUTH_TOKEN" \
+    "https://sentry.io/api/0/projects/veza/veza-backend/issues/?statsPeriod=24h&query=is:unresolved%20level:fatal" \
+    2>/dev/null | jq 'length' 2>/dev/null || echo "?")
+  if [ "$count" = "?" ] || ! [[ "$count" =~ ^[0-9]+$ ]]; then
+    echo "⚪ Sentry P1 events 24h : check failed (auth or network)"
+    GATE_SENTRY_OK=2
+    return
+  fi
+  if [ "$count" -eq 0 ]; then
+    echo "✅ Sentry P1 events 24h : 0"
+    GATE_SENTRY_OK=0
+  else
+    echo "🔴 Sentry P1 events 24h : $count (must be 0)"
+    GATE_SENTRY_OK=1
+  fi
+}
+
+check_statuspage() {
+  local status
+  status=$(curl -s "$STATUSPAGE_URL/api/v1/status" 2>/dev/null \
+    | jq -r '.indicator // .status.indicator // ""' 2>/dev/null || echo "")
+  case "$status" in
+    none|operational)
+      echo "✅ status page : $status (green)"
+      GATE_STATUSPAGE_OK=0
+      ;;
+    minor|major|critical)
+      echo "🔴 status page : $status"
+      GATE_STATUSPAGE_OK=1
+      ;;
+    *)
+      echo "⚪ status page : check failed (got '$status')"
+      GATE_STATUSPAGE_OK=2
+      ;;
+  esac
+}
+
+check_synthetic() {
+  # PromQL : sum of probe_success over the last 6h ; expect every
+  # parcours at 1 (success).
+  local query='probe_success{probe_kind="synthetic"} == 0'
+  local resp
+  resp=$(curl -s --get "$PROM_URL/api/v1/query" \
+    --data-urlencode "query=$query" 2>/dev/null)
+  local result_count
+  result_count=$(echo "$resp" | jq '.data.result | length' 2>/dev/null || echo "?")
+  if [ "$result_count" = "?" ] || ! [[ "$result_count" =~ ^[0-9]+$ ]]; then
+    echo "⚪ synthetic parcours : check failed (Prometheus)"
+    GATE_SYNTHETIC_OK=2
+    return
+  fi
+  if [ "$result_count" -eq 0 ]; then
+    echo "✅ synthetic parcours : all green"
+    GATE_SYNTHETIC_OK=0
+  else
+    local failing
+    failing=$(echo "$resp" | jq -r '.data.result[].metric.parcours' 2>/dev/null | tr '\n' ',' | sed 's/,$//')
+    echo "🔴 synthetic parcours : $result_count failing ($failing)"
+    GATE_SYNTHETIC_OK=1
+  fi
+}
+
+check_k6_nightly() {
+  # k6 nightly is exposed as veza_k6_nightly_last_success_timestamp_seconds
+  # by the Forgejo runner workflow's textfile-collector. Reading via Prom
+  # gives "is the last success < 30h old?".
+  local query='time() - veza_k6_nightly_last_success_timestamp_seconds'
+  local resp age
+  resp=$(curl -s --get "$PROM_URL/api/v1/query" \
+    --data-urlencode "query=$query" 2>/dev/null)
+  age=$(echo "$resp" | jq -r '.data.result[0].value[1] // ""' 2>/dev/null)
+  if [ -z "$age" ] || [ "$age" = "null" ]; then
+    echo "⚪ k6 nightly : check failed (metric absent — runner offline?)"
+    GATE_K6_OK=2
+    return
+  fi
+  age_int=$(printf '%.0f' "$age" 2>/dev/null || echo 999999)
+  if [ "$age_int" -lt 108000 ]; then  # 30h
+    echo "✅ k6 nightly : last success $(( age_int / 3600 ))h ago"
+    GATE_K6_OK=0
+  else
+    echo "🔴 k6 nightly : last success $(( age_int / 3600 ))h ago (> 30h)"
+    GATE_K6_OK=1
+  fi
+}
+
+check_high_issues() {
+  # The operator-reported issues count lives in the SOFT_LAUNCH_BETA_2026.md
+  # report under "Issues reported". Without an external tracker we read it
+  # from a known location in the report file. Skip if file absent.
+  local report
+  report="$(cd "$(dirname "$0")/../.." && pwd)/docs/SOFT_LAUNCH_BETA_2026.md"
+  if [ ! -f "$report" ]; then
+    echo "⚪ HIGH issues count : report file not found"
+    GATE_ISSUES_OK=2
+    return
+  fi
+  local count
+  count=$(grep -cE '^\| HIGH ' "$report" 2>/dev/null || echo 0)
+  if [ "$count" -lt 3 ]; then
+    echo "✅ HIGH-severity issues reported : $count / < 3"
+    GATE_ISSUES_OK=0
+  else
+    echo "🔴 HIGH-severity issues reported : $count / < 3"
+    GATE_ISSUES_OK=1
+  fi
+}
+
+# ----------------------------------------------------------------------
+# Main loop
+# ----------------------------------------------------------------------
+
+run_once() {
+  if [ "$QUIET" != "1" ]; then
+    echo "================================================================"
+    echo "Acceptance gate check — $(date -u +'%Y-%m-%d %H:%M:%S UTC')"
+    echo "----------------------------------------------------------------"
+  fi
+
+  check_testers
+  check_sentry_p1
+  check_statuspage
+  check_synthetic
+  check_k6_nightly
+  check_high_issues
+
+  if [ "$QUIET" != "1" ]; then
+    echo "----------------------------------------------------------------"
+  fi
+
+  local red=0 unknown=0
+  for v in "$GATE_TESTERS_OK" "$GATE_SENTRY_OK" "$GATE_STATUSPAGE_OK" \
+           "$GATE_SYNTHETIC_OK" "$GATE_K6_OK" "$GATE_ISSUES_OK"; do
+    case $v in
+      1) red=$(( red + 1 )) ;;
+      2) unknown=$(( unknown + 1 )) ;;
+    esac
+  done
+
+  if [ "$red" -eq 0 ] && [ "$unknown" -eq 0 ]; then
+    echo "VERDICT : ALL GATES GREEN — soft-launch is GO"
+    return 0
+  elif [ "$red" -gt 0 ]; then
+    echo "VERDICT : $red gate(s) RED — NO-GO until resolved"
+    return 1
+  else
+    echo "VERDICT : $unknown gate(s) UNCHECKABLE — operator must verify manually before decision call"
+    return 2
+  fi
+}
+
+if [ "$LOOP" != "1" ]; then
+  run_once
+  exit $?
+fi
+
+# Continuous mode.
+while true; do
+  run_once || true
+  echo ""
+  echo "next check in ${CHECK_INTERVAL}s — Ctrl-C to exit"
+  sleep "$CHECK_INTERVAL"
+done
--- a/scripts/soft-launch/send-invitations.sh
+++ b/scripts/soft-launch/send-invitations.sh
@ -0,0 +1,179 @@
+#!/usr/bin/env bash
+# send-invitations.sh — batch-insert beta invitations from a validated
+# cohort CSV, generate unique invite codes, render personalised email
+# bodies, optionally dispatch via SMTP.
+#
+# Wraps the validate-cohort.sh sanity check + a transactional INSERT
+# into beta_invites + a per-recipient email render. Splits "generate
+# the codes + render the emails" from "actually send" so a dry-run
+# produces a flat directory of `.eml` files the operator can review
+# before dispatch.
+#
+# v1.0.10 Cluster 3.4.
+#
+# Usage :
+#   # Step 1 : dry-run (default). Inserts beta_invites rows, emits
+#   # eml files but does NOT send anything.
+#   DATABASE_URL=postgres://... \
+#     bash scripts/soft-launch/send-invitations.sh path/to/cohort.csv
+#
+#   # Step 2 : after reviewing the eml files, dispatch with msmtp /
+#   # sendmail / aws-ses-cli (or whatever SEND_CMD points at).
+#   SEND=1 SEND_CMD='msmtp -t' \
+#     bash scripts/soft-launch/send-invitations.sh path/to/cohort.csv
+#
+# Required env :
+#   DATABASE_URL    Postgres URL (read+write to beta_invites)
+#   FRONTEND_URL    base URL the invite link points at
+#                   (e.g. https://staging.veza.fr)
+#
+# Optional env :
+#   SEND=1          actually dispatch ; otherwise dry-run (eml only)
+#   SEND_CMD        sendmail-compatible command (default: 'msmtp -t')
+#   SENT_BY_EMAIL   operator email for the beta_invites.sent_by FK ;
+#                   defaults to the value in the CSV's third column
+#   FROM_ADDR       From: header (default: invitations@veza.fr)
+#   SUBJECT         email subject (default: 'You're in for the Veza beta')
+#   TEMPLATE        path to eml template (default:
+#                   templates/email/beta_invite.html.template)
+#   FORCE=1         skip validate-cohort.sh failures (use with care)
+#
+# Exit codes :
+#   0  — everything succeeded
+#   1  — cohort validation failed (see validate-cohort.sh)
+#   2  — DB transaction failed
+#   3  — required env missing
+#   4  — dispatch failed for at least one recipient (see logs)
+set -euo pipefail
+
+REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+
+CSV=${1:-}
+if [ -z "$CSV" ] || [ ! -f "$CSV" ]; then
+  echo "usage: bash scripts/soft-launch/send-invitations.sh path/to/cohort.csv" >&2
+  exit 3
+fi
+
+DATABASE_URL=${DATABASE_URL:-?}
+FRONTEND_URL=${FRONTEND_URL:-?}
+[ "$DATABASE_URL" = "?" ] && { echo "DATABASE_URL required" >&2; exit 3; }
+[ "$FRONTEND_URL" = "?" ] && { echo "FRONTEND_URL required" >&2; exit 3; }
+
+SEND=${SEND:-0}
+SEND_CMD=${SEND_CMD:-msmtp -t}
+FROM_ADDR=${FROM_ADDR:-invitations@veza.fr}
+SUBJECT=${SUBJECT:-Vous êtes admis dans la bêta Veza}
+TEMPLATE=${TEMPLATE:-$REPO_ROOT/templates/email/beta_invite.eml.template}
+FORCE=${FORCE:-0}
+SESSION_DATE="$(date +%Y%m%d-%H%M)"
+OUTDIR="$REPO_ROOT/scripts/soft-launch/out-${SESSION_DATE}"
+
+command -v psql    >/dev/null 2>&1 || { echo "psql required" >&2; exit 3; }
+command -v openssl >/dev/null 2>&1 || { echo "openssl required" >&2; exit 3; }
+
+# Step 1 — validate the cohort. Bypass with FORCE=1 if needed.
+echo "→ validating cohort $CSV"
+if ! bash "$(dirname "$0")/validate-cohort.sh" "$CSV"; then
+  if [ "$FORCE" != "1" ]; then
+    echo "ERROR: cohort validation failed. Re-run with FORCE=1 to bypass (not recommended)." >&2
+    exit 1
+  fi
+  echo "WARN : cohort validation reported issues but FORCE=1 set — proceeding."
+fi
+
+mkdir -p "$OUTDIR"
+echo "→ output dir $OUTDIR"
+
+# Step 2 — generate codes + insert rows + render emails. Each insert
+# is one transaction so a partial failure leaves consistent state.
+gen_code() {
+  # 16-char base32-ish (no 0/1/I/L) so codes are paste-friendly.
+  openssl rand -hex 16 | tr 'a-f0-9' 'a-z2-9' \
+    | tr -d 'oilOIL01' | head -c 16
+}
+
+if [ ! -f "$TEMPLATE" ]; then
+  echo "ERROR: template $TEMPLATE not found." >&2
+  exit 3
+fi
+
+inserted=0
+failed=0
+failed_emails=()
+
+while IFS=, read -r email cohort sent_by_email; do
+  email=$(echo "$email" | tr -d '\r' | xargs)
+  cohort=$(echo "$cohort" | tr -d '\r' | xargs)
+  sent_by_email=$(echo "$sent_by_email" | tr -d '\r' | xargs)
+
+  code=$(gen_code)
+
+  # Resolve sent_by user_id (may be NULL if operator email isn't a
+  # registered user — e.g. ops shared mailbox).
+  sent_by_id=$(psql "$DATABASE_URL" -A -t -c "
+    SELECT id::text FROM users WHERE email = '$sent_by_email' LIMIT 1;
+  " 2>/dev/null | tr -d ' ' || echo "")
+
+  if [ -z "$sent_by_id" ]; then
+    sent_by_clause="NULL"
+  else
+    sent_by_clause="'$sent_by_id'"
+  fi
+
+  if ! psql "$DATABASE_URL" -1 -c "
+    INSERT INTO beta_invites (code, email, cohort, sent_by, expires_at)
+    VALUES ('$code', '$email', '$cohort', $sent_by_clause, NOW() + INTERVAL '30 days');
+  " >/dev/null 2>&1; then
+    failed=$(( failed + 1 ))
+    failed_emails+=("$email")
+    continue
+  fi
+  inserted=$(( inserted + 1 ))
+
+  # Render the eml — operator-readable, ready for SEND_CMD.
+  eml="$OUTDIR/${email//[^a-zA-Z0-9._-]/_}.eml"
+  invite_url="$FRONTEND_URL/signup?invite=$code"
+  sed \
+    -e "s|{{TO_ADDR}}|$email|g" \
+    -e "s|{{FROM_ADDR}}|$FROM_ADDR|g" \
+    -e "s|{{SUBJECT}}|$SUBJECT|g" \
+    -e "s|{{INVITE_URL}}|$invite_url|g" \
+    -e "s|{{INVITE_CODE}}|$code|g" \
+    -e "s|{{COHORT}}|$cohort|g" \
+    -e "s|{{FRONTEND_URL}}|$FRONTEND_URL|g" \
+    "$TEMPLATE" > "$eml"
+done < <(tail -n +2 "$CSV")
+
+echo "→ inserted $inserted invitations into beta_invites"
+echo "→ rendered $inserted emails to $OUTDIR"
+[ "$failed" -gt 0 ] && {
+  echo "WARN : $failed insert(s) failed — see logs above"
+  for e in "${failed_emails[@]}"; do echo "  - $e"; done
+}
+
+# Step 3 — optionally dispatch.
+if [ "$SEND" != "1" ]; then
+  echo ""
+  echo "DRY-RUN — review the eml files in $OUTDIR before sending."
+  echo "When ready :"
+  echo "  SEND=1 SEND_CMD='$SEND_CMD' bash scripts/soft-launch/send-invitations.sh $CSV"
+  exit 0
+fi
+
+echo "→ dispatching via : $SEND_CMD"
+dispatch_failed=0
+for eml in "$OUTDIR"/*.eml; do
+  if ! $SEND_CMD < "$eml" >>"$OUTDIR/dispatch.log" 2>&1; then
+    dispatch_failed=$(( dispatch_failed + 1 ))
+    echo "  FAIL : $eml" | tee -a "$OUTDIR/dispatch.log"
+  fi
+done
+
+echo ""
+if [ "$dispatch_failed" -gt 0 ]; then
+  echo "WARN : $dispatch_failed dispatch(es) failed — see $OUTDIR/dispatch.log"
+  exit 4
+fi
+echo "PASS : all $inserted invitations dispatched."
+echo "Track redemption with :"
+echo "  psql \"\$DATABASE_URL\" -c 'SELECT cohort, count(*) FILTER (WHERE used_at IS NOT NULL) AS redeemed, count(*) AS total FROM beta_invites GROUP BY cohort ORDER BY cohort;'"
--- a/scripts/soft-launch/validate-cohort.sh
+++ b/scripts/soft-launch/validate-cohort.sh
@ -0,0 +1,173 @@
+#!/usr/bin/env bash
+# validate-cohort.sh — sanity-check a soft-launch beta cohort CSV
+# before it gets fed to send-invitations.sh.
+#
+# The CSV is the operator's curated list of beta-tester emails +
+# segmentation. This script catches the avoidable mistakes BEFORE
+# we batch-insert 100 rows into beta_invites and start spraying
+# emails :
+#
+#   - Empty file or wrong header
+#   - Duplicate emails (would create 2 invites for the same person)
+#   - Malformed emails (missing @, leading/trailing whitespace)
+#   - Cohort distribution looks off (no creators, only one segment,
+#     under-50 total — soft-launch acceptance gate is ≥50 testers)
+#   - Email collisions with existing users (already registered = the
+#     invite code is wasted)
+#
+# v1.0.10 Cluster 3.4.
+#
+# Usage :
+#   bash scripts/soft-launch/validate-cohort.sh path/to/cohort.csv
+#
+# Optional env :
+#   DATABASE_URL  if set, also checks for collisions with the users
+#                 table (email already registered → flagged but not
+#                 fatal — operator may want to invite an existing
+#                 user back to test the new flows).
+#   MIN_COHORT    minimum total rows required (default 50, matches the
+#                 acceptance-gate threshold in SOFT_LAUNCH_BETA_2026.md).
+#   MIN_CREATORS  minimum number of creator-* cohort rows (default 5).
+#
+# Exit codes :
+#   0  — cohort valid
+#   1  — cohort malformed (will block send-invitations.sh)
+#   2  — cohort merely warns (size below minimum, missing collision
+#                              check) ; operator may proceed with --force
+set -euo pipefail
+
+CSV=${1:-}
+if [ -z "$CSV" ] || [ ! -f "$CSV" ]; then
+  cat >&2 <<EOF
+usage : bash scripts/soft-launch/validate-cohort.sh path/to/cohort.csv
+
+CSV format (header required) :
+  email,cohort,sent_by_email
+  alice@example.com,creator-vinyl,ops@veza.fr
+  bob@example.com,listener-jazz,ops@veza.fr
+  ...
+
+cohort labels are free-text but should follow the convention
+<role>-<segment> so the post-launch attribution report groups cleanly.
+EOF
+  exit 1
+fi
+
+MIN_COHORT=${MIN_COHORT:-50}
+MIN_CREATORS=${MIN_CREATORS:-5}
+
+# 1. Header check.
+header=$(head -1 "$CSV" | tr -d '\r')
+if [ "$header" != "email,cohort,sent_by_email" ]; then
+  echo "ERROR: header line must be exactly 'email,cohort,sent_by_email' (got: $header)" >&2
+  exit 1
+fi
+
+# 2. Row count + duplicates + email shape (awk pipeline reads once).
+total=0
+malformed=0
+duplicates=0
+declare -A seen
+declare -A cohort_count
+declare -a malformed_lines
+
+while IFS=, read -r email cohort sent_by_email; do
+  email=$(echo "$email" | tr -d '\r' | xargs)
+  cohort=$(echo "$cohort" | tr -d '\r' | xargs)
+
+  total=$(( total + 1 ))
+
+  # Email shape : must contain exactly one @, no whitespace, > 5 chars.
+  if [[ ! "$email" =~ ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$ ]]; then
+    malformed=$(( malformed + 1 ))
+    malformed_lines+=("  line $(( total + 1 )) : invalid email '$email'")
+    continue
+  fi
+
+  # Duplicate detection.
+  if [ -n "${seen[$email]:-}" ]; then
+    duplicates=$(( duplicates + 1 ))
+    malformed_lines+=("  line $(( total + 1 )) : duplicate email '$email' (first seen at line ${seen[$email]})")
+    continue
+  fi
+  seen[$email]=$(( total + 1 ))
+
+  # Cohort tally.
+  cohort_count[$cohort]=$(( ${cohort_count[$cohort]:-0} + 1 ))
+done < <(tail -n +2 "$CSV")
+
+echo "----------------------------------------------------------------"
+echo "Cohort validation report"
+echo "----------------------------------------------------------------"
+echo "  CSV file       : $CSV"
+echo "  Total rows     : $total"
+echo "  Unique emails  : ${#seen[@]}"
+echo "  Malformed rows : $malformed"
+echo "  Duplicates     : $duplicates"
+echo ""
+echo "Distribution by cohort :"
+for c in "${!cohort_count[@]}"; do
+  printf "  %-40s %d\n" "$c" "${cohort_count[$c]}"
+done | sort
+echo ""
+
+exit_code=0
+
+# 3. Hard checks (block send).
+if [ "$malformed" -gt 0 ] || [ "$duplicates" -gt 0 ]; then
+  echo "ERROR: $malformed malformed + $duplicates duplicate row(s) — fix before sending."
+  for line in "${malformed_lines[@]}"; do
+    echo "$line"
+  done
+  exit 1
+fi
+
+# 4. Soft checks (warn, don't block — operator decides).
+if [ "$total" -lt "$MIN_COHORT" ]; then
+  echo "WARN : cohort has $total rows ; soft-launch acceptance gate is ≥ $MIN_COHORT."
+  exit_code=2
+fi
+
+creator_total=0
+for c in "${!cohort_count[@]}"; do
+  if [[ "$c" == creator-* ]]; then
+    creator_total=$(( creator_total + cohort_count[$c] ))
+  fi
+done
+if [ "$creator_total" -lt "$MIN_CREATORS" ]; then
+  echo "WARN : only $creator_total creator-* cohort rows ; goal is ≥ $MIN_CREATORS for upload-flow coverage."
+  exit_code=2
+fi
+
+if [ "${#cohort_count[@]}" -lt 3 ]; then
+  echo "WARN : only ${#cohort_count[@]} distinct cohort labels — feedback will be narrow."
+  exit_code=2
+fi
+
+# 5. Optional : DATABASE_URL collision check.
+if [ -n "${DATABASE_URL:-}" ]; then
+  command -v psql >/dev/null 2>&1 || {
+    echo "WARN : DATABASE_URL set but psql not on \$PATH ; skipping collision check."
+    exit_code=2
+  }
+  if command -v psql >/dev/null 2>&1; then
+    emails_csv=$(printf '%s,' "${!seen[@]}" | sed 's/,$//')
+    collisions=$(psql "$DATABASE_URL" -A -t -c "
+      SELECT count(*) FROM users WHERE email = ANY(string_to_array('$emails_csv', ','));
+    " 2>/dev/null | tr -d ' ' || echo "?")
+    if [ "$collisions" = "?" ]; then
+      echo "WARN : couldn't query users table (psql connection issue) ; skipping collision check."
+      exit_code=2
+    elif [ "$collisions" -gt 0 ]; then
+      echo "INFO : $collisions email(s) in the cohort already exist in the users table — invite codes will be wasted on existing accounts."
+      exit_code=2
+    fi
+  fi
+fi
+
+echo ""
+case $exit_code in
+  0) echo "PASS : cohort valid, ready for send-invitations.sh." ;;
+  2) echo "WARN : cohort valid but with caveats — review and re-run with --force from send-invitations.sh if intentional." ;;
+esac
+exit $exit_code
--- a/veza-backend-api/migrations/990_beta_invites.sql
+++ b/veza-backend-api/migrations/990_beta_invites.sql
@ -0,0 +1,65 @@
+-- 990_beta_invites.sql
+-- v1.0.10 polish (Cluster 3.4) — soft-launch beta cohort tracking.
+--
+-- Records each individual invitation sent for the v2.0.0 soft-launch
+-- bêta. Tracks (a) the invite code used in the registration link,
+-- (b) when the recipient redeemed it (NULL until redemption), and
+-- (c) which cohort segment (creator / listener / community-member /
+-- press) the recipient belongs to so the post-launch report can
+-- attribute feedback by audience.
+--
+-- The associated email template + send script live at
+-- scripts/soft-launch/send-invitations.sh and reference this table
+-- via INSERT … RETURNING code.
+--
+-- Privacy : the email column is the only PII here ; no behavioural
+-- data is stored. used_at is the redemption signal. After v2.0.0
+-- public launch, run the cleanup migration in 991 (TBD) to anonymise
+-- the email column for invites that haven't been redeemed in 30+ days.
+
+CREATE TABLE IF NOT EXISTS public.beta_invites (
+    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
+    -- The invite code is what the recipient pastes into the signup
+    -- form. 16 random characters from a base32 alphabet (no 0/1/I/L
+    -- to avoid eyestrain). Generated by send-invitations.sh.
+    code        VARCHAR(32) NOT NULL UNIQUE,
+    email       VARCHAR(320) NOT NULL,
+    -- Free-text label so the cohort generator can carry whatever
+    -- segmentation the operator wants (e.g. "creator-vinyl-pressing",
+    -- "listener-jazz-mailing-list", "press-pitchfork"). Index below
+    -- is for the post-launch report grouping.
+    cohort      VARCHAR(64) NOT NULL,
+    -- NULL until the recipient signs up. Set by the auth handler
+    -- when /auth/register is hit with a valid invite code.
+    used_at     TIMESTAMPTZ,
+    -- Hard expiry so unredeemed invites can't accumulate forever.
+    -- Default 30 days from creation ; soft-launch is short-window.
+    expires_at  TIMESTAMPTZ NOT NULL DEFAULT (NOW() + INTERVAL '30 days'),
+    -- Operator who sent the invite — useful when reconciling "who
+    -- gave their friend a code" during the audit.
+    sent_by     UUID REFERENCES public.users(id) ON DELETE SET NULL,
+    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
+);
+
+COMMENT ON TABLE public.beta_invites IS
+    'v2.0.0 soft-launch beta invitation tracking. v1.0.10 Cluster 3.4.';
+COMMENT ON COLUMN public.beta_invites.code IS
+    '16-char base32 invite code (no 0/1/I/L). Pasted into signup form.';
+COMMENT ON COLUMN public.beta_invites.cohort IS
+    'Free-text cohort label (creator-* / listener-* / press-* / etc.).';
+COMMENT ON COLUMN public.beta_invites.used_at IS
+    'Redemption timestamp. NULL means the invite is still pending.';
+
+-- Lookup by code (signup path) — every /auth/register call reads it.
+CREATE UNIQUE INDEX IF NOT EXISTS idx_beta_invites_code
+    ON public.beta_invites(code);
+
+-- Cohort grouping for the post-launch attribution query.
+CREATE INDEX IF NOT EXISTS idx_beta_invites_cohort
+    ON public.beta_invites(cohort);
+
+-- Pending-invitations sweep — cron job that expires unused invites
+-- after expires_at. Partial index keeps it small.
+CREATE INDEX IF NOT EXISTS idx_beta_invites_pending_expiry
+    ON public.beta_invites(expires_at)
+    WHERE used_at IS NULL;
--- a/veza-backend-api/templates/email/beta_invite.eml.template
+++ b/veza-backend-api/templates/email/beta_invite.eml.template
@ -0,0 +1,92 @@
+To: {{TO_ADDR}}
+From: Veza <{{FROM_ADDR}}>
+Subject: {{SUBJECT}}
+MIME-Version: 1.0
+Content-Type: multipart/alternative; boundary="--veza-beta-boundary"
+
+----veza-beta-boundary
+Content-Type: text/plain; charset="UTF-8"
+Content-Transfer-Encoding: 7bit
+
+Bonjour,
+
+Vous êtes invité·e à rejoindre la bêta privée de Veza —
+une plateforme de streaming musical éthique faite pour les
+créateur·ices et les auditeur·ices, sans algorithme de
+recommandation comportementale, sans gamification, sans dark
+patterns.
+
+Votre code d'invitation : {{INVITE_CODE}}
+
+Pour vous inscrire :
+{{INVITE_URL}}
+
+Le code expire dans 30 jours.
+
+Pendant la bêta, l'idée est simple : utilisez Veza comme vous
+utiliseriez n'importe quelle plateforme musicale. Uploadez,
+écoutez, partagez, achetez. Quand quelque chose vous frustre
+ou vous étonne — en bien comme en mal — dites-le. Le canal
+de retour vous sera communiqué après l'inscription.
+
+Cohorte : {{COHORT}}
+(C'est juste un tag interne pour qu'on regroupe les retours
+par contexte d'usage. Ça n'affecte ni votre expérience ni vos
+permissions.)
+
+À très vite,
+L'équipe Veza
+
+
+--
+Si vous n'avez pas demandé cette invitation, ignorez ce
+message. Le code expirera automatiquement après 30 jours.
+
+----veza-beta-boundary
+Content-Type: text/html; charset="UTF-8"
+Content-Transfer-Encoding: 7bit
+
+<!DOCTYPE html>
+<html>
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Invitation à la bêta Veza</title>
+</head>
+<body style="font-family: Georgia, 'Times New Roman', serif; line-height: 1.6; color: #1a1a1e; margin: 0; padding: 0; background-color: #f8f7f4;">
+    <div style="max-width: 600px; margin: 20px auto; padding: 30px; background-color: #ffffff; border: 1px solid #e8e6e0;">
+        <h1 style="font-weight: 400; color: #1a1a1e; margin-top: 0; font-size: 28px;">Bienvenue dans la bêta Veza.</h1>
+        <p>Bonjour,</p>
+        <p>Vous êtes invité·e à rejoindre la <strong>bêta privée</strong> de Veza — une plateforme de streaming musical éthique faite pour les créateur·ices et les auditeur·ices, sans algorithme de recommandation comportementale, sans gamification, sans dark patterns.</p>
+
+        <div style="text-align: center; margin: 35px 0;">
+            <a href="{{INVITE_URL}}" style="background-color: #1a1a1e; color: #f8f7f4; padding: 14px 32px; text-decoration: none; display: inline-block; font-weight: 400; letter-spacing: 0.05em;">
+                Activer mon invitation
+            </a>
+        </div>
+
+        <p style="color: #555; font-size: 14px;">Ou collez ce lien dans votre navigateur :</p>
+        <p style="word-break: break-all; color: #888; background-color: #f8f7f4; padding: 10px; font-family: 'Courier New', monospace; font-size: 12px; border-left: 2px solid #d4a574;">{{INVITE_URL}}</p>
+
+        <p style="color: #555; font-size: 14px; margin-top: 25px;">Code d'invitation :</p>
+        <p style="font-family: 'Courier New', monospace; font-size: 18px; letter-spacing: 0.1em; background-color: #f8f7f4; padding: 12px; text-align: center; color: #1a1a1e;">{{INVITE_CODE}}</p>
+
+        <hr style="border: none; border-top: 1px solid #e8e6e0; margin: 30px 0;">
+
+        <p style="font-size: 14px; color: #555;">Pendant la bêta, l'idée est simple : utilisez Veza comme vous utiliseriez n'importe quelle plateforme musicale. Uploadez, écoutez, partagez, achetez. Quand quelque chose vous frustre ou vous étonne — en bien comme en mal — dites-le. Le canal de retour vous sera communiqué après l'inscription.</p>
+
+        <p style="font-size: 13px; color: #888; margin-top: 25px;">Cohorte : <strong>{{COHORT}}</strong> — c'est juste un tag interne pour qu'on regroupe les retours par contexte d'usage.</p>
+
+        <p style="margin-top: 30px; color: #888; font-size: 12px;">
+            Le code expire dans 30 jours. Si vous n'avez pas demandé cette invitation, ignorez ce message.
+        </p>
+
+        <hr style="border: none; border-top: 1px solid #e8e6e0; margin: 25px 0;">
+        <p style="color: #aaa; font-size: 11px; text-align: center; font-family: 'Courier New', monospace; letter-spacing: 0.1em;">
+            VEZA · v2.0.0 BETA · {{FRONTEND_URL}}
+        </p>
+    </div>
+</body>
+</html>
+
+----veza-beta-boundary--
Author	SHA1	Message	Date
senke	112c64a22b	feat(soft-launch): cohort tooling + email template + monitor + checklist Some checks are pending Veza CI / Backend (Go) (push) Waiting to run Details Veza CI / Frontend (Web) (push) Waiting to run Details Veza CI / Rust (Stream Server) (push) Waiting to run Details Veza CI / Notify on failure (push) Blocked by required conditions Details E2E Playwright / e2e (full) (push) Waiting to run Details Security Scan / Secret Scanning (gitleaks) (push) Waiting to run Details The soft-launch report doc (SOFT_LAUNCH_BETA_2026.md) had the narrative — cohort table, email body inline, monitoring list, acceptance gate. But the operational pieces were notes-to-self : "add migration if missing", "Typeform to-do", "schema TBD". The operator was supposed to assemble them on the day, which on a soft- launch day is the worst possible time. Added the missing 6 pieces so the day-of work is "tick boxes", not "build the tooling" : * migrations/990_beta_invites.sql — schema with code (16-char base32-ish), email, cohort label, used_at, expires_at + 30d default, sent_by FK with ON DELETE SET NULL. Three indexes : unique on code (signup-path lookup), cohort (post-launch attribution report), partial expires_at WHERE used_at IS NULL (cleanup cron). * scripts/soft-launch/validate-cohort.sh — sanity check on the operator's CSV : header form, malformed emails, duplicates, cohort distribution (≥50 total / ≥5 creators / ≥3 distinct labels), optional collision check against existing users. Exit codes 0 / 1 (block) / 2 (warn-but-proceed). Hard checks block, soft checks let the operator override with FORCE=1. * scripts/soft-launch/send-invitations.sh — split-phase : step 1 (default) inserts beta_invites rows + renders one .eml per recipient under scripts/soft-launch/out-<date>/ step 2 (SEND=1) dispatches via $SEND_CMD (msmtp by default) so the operator can review the rendered emls before sending 100 emails. Per-recipient transactional INSERT so a partial failure doesn't poison the table. Failed inserts logged with the offending email so the operator can rerun on the subset. * templates/email/beta_invite.eml.template — proper MIME multipart (text + HTML) eml ready for sendmail-compatible piping. French copy aligned with the éthique brand (no FOMO, no urgency manipulation, no "limited spots" framing). * scripts/soft-launch/monitor-checks.sh — polls the 6 acceptance- gate signals defined in SOFT_LAUNCH_BETA_2026.md §"Acceptance gate" : testers signed up, Sentry P1 events, status page, synthetic parcours, k6 nightly age, HIGH issues. Each gate independently emits ✅ / 🔴 / ⚪ (last for "couldn't check"). Verdict on stdout. LOOP=1 keeps polling every CHECK_INTERVAL seconds. Designed for cron + tmux, not for an interactive UI. * docs/SOFT_LAUNCH_BETA_2026_CHECKLIST.md — pre-flight gate that must reach 100% green before the first invitation goes out. T-72h section (database, cohort, email infra, redemption path, monitoring, comms), D-day section (last-hour, send, hour-1, every-4h), 18:00 UTC decision call section. Linked back to the bigger SOFT_LAUNCH_BETA_2026.md so the operator can navigate between the "what" (report) and the "how / has-everything- been-checked" (this checklist) without losing context. What still requires the operator on the day : - Build the cohort CSV (curate emails from real sources) - Create the Typeform feedback form ; paste its URL into the eml template once known - Configure msmtp / sendmail ($SEND_CMD) - Press the send button - Show up at 18:00 UTC for the decision call Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 22:38:12 +02:00
senke	2a5bc11628	fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook The game-day driver had no notion of inventory — it would happily execute the 5 destructive scenarios (Postgres kill, HAProxy stop, Redis kill, MinIO node loss, RabbitMQ stop) against whatever the underlying scripts pointed at, with the operator's only protection being "don't typo a host." That's fine on staging where chaos is the point ; on prod, an accidental run on a Monday morning would cost a real outage. Added : scripts/security/game-day-driver.sh * INVENTORY env var — defaults to 'staging' so silence stays safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive type-the-phrase 'KILL-PROD' confirm. Anything other than staging\|prod aborts. * Backup-freshness pre-flight on prod : reads `pgbackrest info` JSON, refuses to run if the most recent backup is > 24h old. SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline. * Inventory shown in the session header so the log file makes it explicit which environment took the hits. docs/runbooks/rabbitmq-down.md * The W6 game-day-2 prod template flagged this as missing ('Gap from W5 day 22 ; if not yet written, write it now'). Mirrors the structure of redis-down.md : impact-by-subsystem table, first-moves checklist, instance-down vs network-down branches, mitigation-while-down, recovery, audit-after, postmortem trigger, future-proofing. * Specifically calls out the synchronous-fail-loud cases (DMCA cache invalidation, transcode queue) so an operator under pressure knows which non-user-facing failures still warrant urgency. Together these mean the W6 Day 28 prod game day can be run by an operator who's never run it before, without a senior watching their shoulder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 22:32:05 +02:00
senke	e780fbcd18	docs(pentest): add send-package SOP + seed-test-accounts helper The pentest scope doc (PENTEST_SCOPE_2026.md) is the technical brief — what's testable, what's out, what to focus on. But it doesn't tell the operator HOW to send the engagement off : credentials delivery plan, IP allow-list step, kick-off email template, alert-tuning during the engagement window. So historically each engagement has been a one-off that depends on whoever was on duty remembering the last time. Added : * docs/PENTEST_SEND_PACKAGE.md — 5-step send sequence (NDA → credentials → IP allow-list → kick-off email → alert tuning), reception checklist, and post-engagement housekeeping. Email template inline so it's grep-able and version-controlled. * scripts/pentest/seed-test-accounts.sh — provisions the 3 staging accounts (listener/creator/admin) referenced by §"Authentication context" of the scope doc. Generates 32-char random passwords, probes each by login, emits 1Password import JSON to stdout (passwords NEVER printed to the screen). Refuses to run against any env that isn't "staging". The send-package doc references one helper that doesn't exist yet : * infra/ansible/playbooks/pentest_allowlist_ip.yml — Forgejo IP allow-list automation. Punted to a follow-up because the manual SSH path is fine for once-per-engagement use and Ansible formalisation deserves its own commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 22:29:35 +02:00
senke	05b1d81d30	fix(scripts): payment-e2e walkthrough safety guards (DRY_RUN + prod confirm) Three holes in the v1.0.9 W6 Day 27 walkthrough that an operator under stress could fall into : 1. Typo'd STAGING_URL pointing at production. The script accepted any URL with no sanity check, so `STAGING_URL=https://veza.fr ...` would happily POST /orders and charge a real card on the first run. Fix: heuristic detection (URL doesn't contain "staging", "localhost" or "127.0.0.1" → treat as prod) refuses to run unless CONFIRM_PRODUCTION=1 is explicitly set. 2. No way to rehearse the flow without spending money. Added DRY_RUN=1 that exits cleanly after step 2 (product listing) — exercises auth, API plumbing, and the staging product fixture without creating an order. 3. No final confirm before the actual charge. On a prod target, after the product is picked and before the POST /orders fires, the script now prints the {product_id, price, operator, endpoint} block and demands the operator type the literal word `CHARGE`. Any other answer aborts with exit code 2. Together these turn "STAGING_URL typo = burnt 5 EUR" into "STAGING_URL typo = exit code 3 with explanation". The wrapper docs in docs/PAYMENT_E2E_LIVE_REPORT.md already mention card-charge risk in prose; these guards enforce it at exec time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 22:27:14 +02:00
senke	6c644cff03	fix(haproxy): forgejo backend uses HTTPS re-encrypt + Host header on healthcheck Forgejo at 10.0.20.105:3000 serves HTTPS only (self-signed cert). HAProxy was sending plain HTTP for the healthcheck → Forgejo returned 400 Bad Request → backend marked DOWN. Two coupled fixes : 1. `server forgejo ... ssl verify none sni str(forgejo.talas.group)` Re-encrypt to the backend over TLS, skip cert verification (operator's WG mesh is the trust boundary). SNI set to the public hostname so Forgejo serves the right vhost. 2. Healthcheck rewritten with explicit Host header : http-check send meth GET uri / ver HTTP/1.1 hdr Host forgejo.talas.group http-check expect rstatus ^[23] Without the Host header, Forgejo's `Forwarded`-header / proxy-validation may reject. Accept any 2xx/3xx (Forgejo redirects to /login → 302). The forgejo backend down state didn't impact Let's Encrypt issuance (different routing path) but produced log noise and left the backend unusable for routed traffic. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:31:29 +02:00
senke	0bd3e563b2	fix(haproxy): incus proxy devices forward R720:80/443 → container The Orange box NAT correctly forwards :80/:443 → R720 LAN IP, but the R720 host has nothing listening there — haproxy lives in the veza-haproxy container, reachable only on the net-veza bridge (10.0.20.X). Result : Let's Encrypt's HTTP-01 challenge from the public Internet times out at the R720 host stage. Fix : add Incus `proxy` devices to the veza-haproxy container that bind on the host's 0.0.0.0:80 / 0.0.0.0:443 and forward into the container's local ports. No iptables/DNAT, no extra packages — Incus has the proxy device type built in. incus config device add veza-haproxy http proxy \ listen=tcp:0.0.0.0:80 connect=tcp:127.0.0.1:80 incus config device add veza-haproxy https proxy \ listen=tcp:0.0.0.0:443 connect=tcp:127.0.0.1:443 Idempotent : `incus config device show veza-haproxy \| grep '^http:$'` short-circuits the add when the device is already there. Operator setup unchanged : box NAT 80/443 → R720 LAN IP. Ansible now bridges the rest of the path automatically. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:27:37 +02:00
senke	d9896686bd	fix(haproxy): runtime DNS resolution + init-addr none for absent backends HAProxy was rejecting the cfg at parse time because every `server backend-{blue,green}.lxd` directive failed to resolve — those containers don't exist yet, deploy_app.yml creates them later. The validate said : could not resolve address 'veza-staging-backend-blue.lxd' Failed to initialize server(s) addr. Two complementary fixes : 1. Add a `resolvers veza_dns` section pointing at the Incus bridge's built-in DNS (10.0.20.1:53 — gateway of net-veza). `*.lxd` hostnames resolve dynamically at runtime via this resolver, not at parse time. Containers spun up later by deploy_app.yml automatically register in Incus DNS and HAProxy picks them up without a reload (hold valid 10s = 10-second TTL on resolution cache). 2. `default-server ... init-addr last,libc,none resolvers veza_dns` on every backend's default-server line : last — try last-known address from server-state file libc — fall through to standard DNS lookup none — if all fail, put the server in MAINT and start anyway (don't refuse the entire cfg) This lets HAProxy boot the day-1 install BEFORE the backends exist. Once deploy_app.yml lands them, the resolver picks them up within 10s. Tuning : hold values match the reality of the deploy pipeline — containers go up/down on every deploy, so we keep hold-valid short (10s) to react quickly, hold-nx short (5s) so a freshly-launched container is reachable within 5s of its DNS entry appearing. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:17:39 +02:00
senke	c97e42996e	fix(haproxy): use shipped selfsigned.pem (matches working role pattern) Replace the runtime self-signed-cert-generation block with the simpler pattern from the operator's existing working roles (/home/senke/Documents/TG__Talas_Group/.../roles/haproxy/files/selfsigned.pem) : ship a CN=localhost selfsigned.pem in roles/haproxy/files/, copy it into the cert dir before haproxy.cfg renders. Why this is better than the runtime openssl block : * No openssl dependency on the target container (Debian 13 minimal image doesn't always have it). * No timing issue if /tmp is on a slow tmpfs. * Predictable cert content — same selfsigned.pem across all deploys, no per-host noise. * Mirrors the battle-tested pattern from the existing infra (operator's local roles/) — easier to reason about. Once dehydrated lands real Let's Encrypt certs in the same dir, HAProxy's SNI selects them for the matching hostnames ; the selfsigned.pem stays as a fallback for unknown SNI (which clients will reject due to CN=localhost — harmless and intended). selfsigned.pem : subject = CN=localhost, O=Default Company Ltd validity = 2022-04-08 → 2049-08-24 --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:12:35 +02:00
senke	b6147549c9	fix(haproxy): pre-create cert dir + placeholder cert ; reorder ACL rules Two issues caught by the now-verbose haproxy validate : 1. `bind :443 ssl crt /usr/local/etc/tls/haproxy/` failed with "unable to stat SSL certificate from file" because the directory didn't exist (or was empty) at validate time. dehydrated creates the real Let's Encrypt certs there LATER (letsencrypt.yml runs after the role's main render-and-restart). Chicken-and-egg. Fix : roles/haproxy/tasks/main.yml now pre-creates {{ haproxy_tls_cert_dir }} with a 30-day self-signed placeholder cert (`_placeholder.pem`) BEFORE haproxy.cfg renders. haproxy accepts the dir, validates the config. dehydrated later drops real .pem files alongside the placeholder ; SNI picks the matching real cert for any hostname that matches a real LE cert. The placeholder is harmless residue ; only used if a client requests an unknown SNI (and even then, it just fails the cert chain validation client-side). Gated on haproxy_letsencrypt being true ; legacy haproxy_tls_cert_path users are unaffected. 2. haproxy 3.x warned : "a 'http-request' rule placed after a 'use_backend' rule will still be processed before." Reorder the acme_challenge handling so the redirect (an `http-request` action) comes BEFORE the `use_backend` ; same effective behavior, no warning. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:10:27 +02:00
senke	7253f0cf10	fix(ansible): haproxy validate without -q so the error message reaches operator `haproxy -f %s -c -q` (quiet) suppresses the actual validation error on stderr+stdout, leaving the operator with a useless "failed to validate" with empty output. Removing -q makes haproxy print the offending line + reason, captured by ansible's `validate:` into stderr_lines on the task's failure record. Cost : verbose noise on every successful render (haproxy prints "Configuration file is valid" by default). Acceptable trade-off for the once-in-a-while debugging value. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:06:50 +02:00
senke	385a8f0378	fix(ansible): add staging/prod meta-groups so group_vars/<env>.yml applies group_vars/staging.yml + group_vars/prod.yml were never loaded : Ansible matches `group_vars/<NAME>.yml` against the inventory's group NAMED `<NAME>`. Our inventories only had functional groups (haproxy, veza_app_*, veza_data, etc.) — no `staging` or `prod` parent group. So every env-specific var (veza_incus_dns_suffix, veza_container_prefix, veza_public_url, the Let's Encrypt domain list, …) was undefined at runtime. Symptom : haproxy.cfg.j2 render failed with AnsibleUndefinedVariable: 'veza_incus_dns_suffix' is undefined Fix : add an env-named meta-group as a CHILD of `all`, with the existing functional groups as ITS children. Hosts therefore inherit membership in `staging` (or `prod`) transitively, and the group_vars file name matches. staging: children: incus_hosts: forgejo_runner: haproxy: veza_app_backend: veza_app_stream: veza_app_web: veza_data: Verified with : ansible-inventory -i inventory/staging.yml --host veza-haproxy \ --vault-password-file .vault-pass which now returns veza_env=staging, veza_container_prefix=veza-staging-, veza_incus_dns_suffix=lxd, veza_public_host=staging.veza.fr — all the vars the playbook templates rely on. Same shape applied to prod.yml. inventory/local.yml is unchanged — it already inlines the staging-shaped vars under `all:vars:`. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:01:44 +02:00
senke	e97b91f010	fix(ansible): don't apply common role to haproxy container + gate ssh.yml on sshd Two fixes for "haproxy container doesn't have sshd" : 1. playbooks/haproxy.yml — drop the `common` role play. The role's purpose is to harden a full HOST (SSH + fail2ban monitoring auth.log + node_exporter metrics surface). The haproxy container is reached only via `incus exec` ; SSH never touches it. Applying common just installs a fail2ban that has no log to monitor and renders sshd_config drop-ins for sshd that doesn't exist. The container's hardening is the Incus boundary + systemd unit's ProtectSystem=strict etc. (already in the templates). 2. roles/common/tasks/ssh.yml — gate every task on sshd presence. `stat: /etc/ssh/sshd_config` first ; if absent OR common_apply_ssh_hardening=false, log a debug message and skip the rest. Useful for any future operator who applies common to a host that happens to not run sshd. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:57:16 +02:00
senke	c245b72e05	fix(ansible): symlink inventory/group_vars → ../group_vars so vars load Ansible looks for group_vars/ relative to either the inventory file or the playbook file. Our group_vars/ lived at infra/ansible/group_vars/, sibling to inventory/ and playbooks/ — neither location, so ansible silently treated all the env vars as undefined. Symptom : the haproxy.yml `common` role asserted ssh_allow_users \| length > 0 which failed because ssh_allow_users was undefined → empty by default. Fix : symlink inventory/group_vars → ../group_vars. Smallest possible change ; preserves every existing path reference (bash scripts, docs) that uses infra/ansible/group_vars/ directly. Ansible now finds the group_vars when invoked with -i inventory/staging.yml, and ansible-inventory --host veza-haproxy now returns the full var set (ssh_allow_users, haproxy_env_prefixes, vault_* via vault, etc.). Verified with : ansible-inventory -i inventory/staging.yml --host veza-haproxy \ --vault-password-file .vault-pass Same symlink applies for inventory/lab.yml, prod.yml, local.yml — they all live in the same directory. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:48:12 +02:00