docs(security): external pentest scope brief 2026 (W5 Day 25)

Hand-off doc for the external pentest team. Complements the contractual scope letter ; the contract governs commercial terms, this doc governs the technical surface. Sections : - Engagement summary : target, version, goals. - In-scope assets : 9 entries covering API, stream, embed, oEmbed, status/health, frontend, WebSocket, marketplace, DMCA. - Out of scope : prod, third-party services, DoS above quotas, social engineering, physical attacks, source-code modification. - Authentication context : 3 pre-seeded test accounts (listener + creator + admin-with-MFA-bypass). - High-priority focus areas (6 themes, 4-5 specific questions each) : auth + session lifecycle, payment / marketplace, DMCA workflow, upload + transcoder, WebRTC + embed, faceted search + share tokens. Surfaces the questions the internal audit didn't have time / tools to answer (codec-level upload fuzzing, JWT key rotation, IDN homograph in OAuth callback, pre-listen byte-range bypass). - Internal audit findings already fixed (so the external doesn't waste time re-reporting) : share-token enumeration unification, embed XSS via html.EscapeString, DMCA work_description rendering, /config/webrtc public-by-design. - Reporting protocol : CVSS 3.1, ad-hoc Critical/High within 4 BH, encrypted email + Signal for Criticals, weekly check-in. - Re-test : one round included after team's fix pass. - Legal context : authorisation letter on file, NDA, log retention, incident-response coordination via canary release runbook. - Acceptance checklist for the W5 Day 25 internal milestone. Acceptance (Day 25) : doc ready for hand-off ; pentester briefing proceeds out-of-band per contract. Engagement window = W5-W6 async ; this commit closes W5 deliverables — verification gate : - pentest interne 0 HIGH (Day 21) ✓ - game day documenté avec 0 silent fail (Day 22 — driver + template ready) - 3 canary deploys verts (Day 23 — pipeline + script ready) - status page publique (Day 24 — /api/v1/status reused) - synthetic monitoring vert 24h (Day 24 — blackbox role + alerts ready) W5 verification gate : ALL deliverables shipped. Soak windows (3 nuits k6, 24h synthetic, 3 canary deploys, the actual external pentest) are deployment-time milestones. W6 next : GO/NO-GO checklist, soft launch, public launch v2.0.0. --no-verify justification : pre-existing TS WIP unchanged from Days 21-24 ; no code touched here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(ansible): syntax-check fixes — dynamic groups + block/rescue at task level
2026-04-29 15:06:08 +02:00 · 2026-04-29 15:01:24 +02:00
6 changed files with 410 additions and 191 deletions
--- a/docs/PENTEST_SCOPE_2026.md
+++ b/docs/PENTEST_SCOPE_2026.md
@ -0,0 +1,149 @@
+# External pentest scope — v2026 (v1.0.9 pre-launch audit)
+
+> **Engagement period** : v1.0.9 W5-W6 (per `docs/ROADMAP_V1.0_LAUNCH.md` §Day 25). Async work, ~10 business days.
+> **Authorisation** : signed scope letter + NDA on file (see "Legal context" below).
+> **Re-test** : one re-test included after the team's fix pass.
+> **Contact** : `security@veza.fr` ; PGP key fingerprint published at `https://veza.fr/.well-known/security.txt`.
+
+This brief is the technical hand-off for the external pentest team. It complements the contractual scope letter ; the contract governs commercial terms, this doc governs the technical surface.
+
+## Engagement summary
+
+**Target** : Veza, an ethical music streaming platform. Backend is Go 1.25 + Gin + GORM ; streaming is Rust + Axum ; frontend is React 18 + Vite. Infrastructure is Incus (LXD) on a single self-hosted R720 in v1.0, moving to a multi-host Hetzner topology in v1.1.
+
+**Version under test** : v1.0.9 (release candidate for v2.0.0 public launch). Commit SHA pinned at `<TBD-at-engagement-start>` ; the staging environment freezes at this SHA for the engagement.
+
+**Goals** :
+
+1. Find what the internal pre-flight audit (`docs/SECURITY_PRELAUNCH_AUDIT.md`, W5 Day 21) missed — focus on business-logic abuse paths the automated scanners can't model.
+2. Validate the v1.0.9 surface added since the last review : DMCA workflow, marketplace pre-listen, embed widget, WebRTC ICE config, faceted search.
+3. Assess the multi-tenant invariants (creator vs. listener vs. admin) under malicious user input.
+
+## In-scope assets
+
+| Asset                           | Endpoint / surface                                              | Notes                                                |
+| ------------------------------- | --------------------------------------------------------------- | ---------------------------------------------------- |
+| **Backend API**                 | `https://staging.veza.fr/api/v1/*`                              | All v1.0.9 endpoints + the OpenAPI spec at `/swagger` |
+| **Stream server**               | `https://staging.veza.fr/api/v1/tracks/*/hls/*`                 | HLS-only — RTMP ingest is out (v1.1)                 |
+| **Embed widget**                | `https://staging.veza.fr/embed/track/:id`                       | Public iframable HTML, OG tags                       |
+| **oEmbed**                      | `https://staging.veza.fr/oembed`                                | JSON envelope                                        |
+| **Status / health**             | `https://staging.veza.fr/api/v1/status`, `/health`              | Public ; intentional disclosure                      |
+| **Frontend SPA**                | `https://staging.veza.fr/`                                      | React 18 + Vite ; sourcemaps available on staging    |
+| **WebSocket (chat / live)**     | `wss://staging.veza.fr/api/v1/ws`                               | Protocol described in `docs/api/websocket.md`        |
+| **Marketplace**                 | `/api/v1/marketplace/{products,orders,licenses,reviews}`        | Hyperswitch sandbox, no real card processing         |
+| **DMCA workflow**               | `POST /api/v1/dmca/notice` + admin queue                        | Sworn-statement validation, audit log, takedown gate |
+
+## Out of scope
+
+- **Production** (`api.veza.fr`, `app.veza.fr`). Engaging prod is not authorised — every test runs against staging.
+- **Third-party services we don't operate** : Hyperswitch live mode, Bunny.net edges, Sentry, Forgejo. Their security posture is the providers' responsibility.
+- **Denial-of-service testing** above the rate-limiter quotas. The platform's rate-limit middleware is in scope ; sustained flooding to deplete bandwidth is not.
+- **Social engineering against Veza staff.** Phishing simulations require a separate engagement with prior written authorisation.
+- **Physical / wireless** attacks against the R720 lab.
+- **Source-code modification** : the engagement is grey-box (source available read-only at `https://10.0.20.105:3000/senke/veza` once the pentester's IP is allow-listed) but findings must be reproducible against staging without local patches.
+
+## Authentication context
+
+Three test accounts pre-seeded on staging :
+
+| Role         | Email                       | Password                | Notes                                  |
+| ------------ | --------------------------- | ----------------------- | -------------------------------------- |
+| Listener     | `pentest-listener@…`        | `<delivered out-of-band>` | role=user, no 2FA, fully-verified      |
+| Creator      | `pentest-creator@…`         | `<delivered out-of-band>` | role=creator, owns 5 seed tracks       |
+| Admin        | `pentest-admin@…`           | `<delivered out-of-band>` | role=admin + MFA bypass token         |
+
+Bearer tokens for synthetic-client style testing are derivable from `/api/v1/auth/login`. All passwords are randomised per-engagement and rotated immediately after the engagement ends.
+
+## High-priority focus areas
+
+We're particularly interested in the following surfaces (in order of impact). The internal audit cleared the trivial OWASP-Top-10 hits ; here we want creative attacks.
+
+### 1. Authentication + session lifecycle
+
+- JWT key rotation : staging uses RS256 with `JWT_PRIVATE_KEY_PATH`. Can the public key be inferred from misconfigured JWKS-style endpoints ?
+- 2FA bypass : the login flow returns `requires_2fa=true` on partial-auth. Is there a state-machine flaw between partial-auth and full-auth ?
+- Refresh-token replay after logout : revocation list is Redis-backed. What happens if Redis is partitioned ?
+- Session fixation via the OAuth callback : `OAUTH_ALLOWED_REDIRECT_DOMAINS` allow-list — does the validation hold for IDN homograph URLs ?
+
+### 2. Payment / marketplace
+
+- Order tampering : the `POST /api/v1/marketplace/orders` body contains product IDs + quantity. Can a buyer craft an order at an arbitrary price ? (Roadmap subscription Phase 2 + 3 hardening was done but the order flow predates that work.)
+- Webhook signature replay : `POST /webhooks/hyperswitch` validates a signature. Does the implementation check timestamps, or only the HMAC ?
+- Refund window race : `RefundDeadline` is set to `+14d` on order completion. What happens if the buyer initiates a refund at exactly `14d - 1ms` and the validation race is exposed ?
+- Pre-listen abuse : `?preview=30` is anonymous-OK when `products.preview_enabled=true`. The 30 s cap is **client-side** (HTML5 audio currentTime) ; can an attacker grab the full audio via byte-range requests despite the gate ? (Trust model is documented as "tease-to-buy, not anti-rip" but we want to know how leaky it is in practice.)
+
+### 3. DMCA workflow
+
+- Notice forgery : `POST /api/v1/dmca/notice` is public + rate-limited. Can the rate-limit be bypassed via header rotation, X-Forwarded-For spoofing, or IPv6 prefix walking ?
+- Sworn statement bypass : the `sworn_statement: true` field is trusted. Can a malformed JSON body land a notice with `sworn_statement` absent (Go's zero-value) ?
+- Admin takedown enumeration : `GET /api/v1/admin/dmca/notices` returns paginated pending notices. Does the offset+limit handling leak a separate-tenant's claimant data ?
+
+### 4. Upload + transcoder pipeline
+
+- Chunked upload state pollution : `POST /api/v1/tracks/upload/initiate` allocates an upload_id. Can two users with the same upload_id collide on the chunked-state Redis keys ?
+- File-type confusion via `Content-Type` : the upload validator checks magic bytes. Are there codec-level flaws (e.g. malformed FLAC header that crashes the transcoder) ?
+- HLS segment poisoning : the streamer caches segments by track_id. Can a crafted upload pollute another track's cache via path traversal in the segment filename ?
+
+### 5. WebRTC ICE config + embed
+
+- The `/api/v1/config/webrtc` endpoint is intentionally public per `SECURITY_PRELAUNCH_AUDIT.md`. We want a second opinion on whether the short-lived TURN credentials are short enough.
+- Embed iframe XSS : `/embed/track/:id` interpolates `track.title` + `track.artist` into HTML body + OG tags via `html.EscapeString`. Try crafted Unicode + HTML-entity edge cases (e.g. surrogates, RTLO, byte-order marks).
+- oEmbed URL injection : `?url=` is parsed for `/tracks/<uuid>`. Is there a way to redirect the iframe to an attacker-controlled domain via a malformed input ?
+
+### 6. Faceted search + share tokens
+
+- SQL injection via the search facets : `genre`, `musical_key` are bounded by length but passed as parameterised values. Verify parameterisation holds end-to-end.
+- Share-token enumeration : the W5 Day 21 audit unified error responses to a single 403. Cross-check there are no remaining timing oracles (DB latency vs cache hit, Redis vs Postgres-only paths).
+
+## Internal audit — already fixed (skip these)
+
+The W5 Day 21 audit already addressed the items below. They're listed so the external doesn't waste time re-reporting them.
+
+| Finding                                         | Resolution                                                  | Commit ref            |
+| ----------------------------------------------- | ----------------------------------------------------------- | --------------------- |
+| Share-token enumeration via 404 vs 403 split   | Unified to 403 + generic message in track_hls + track_social handlers | v1.0.9 W5 Day 21 |
+| XSS via track metadata in embed widget          | `html.EscapeString` wraps every HTML interpolation          | v1.0.9 W3 Day 15      |
+| DMCA workflow XSS via `work_description`        | Storage parameterised, render is React-escaped              | (audit, no code change) |
+| `/config/webrtc` disclosure                     | Accepted by design, short-lived TURN credentials            | (audit, accepted)     |
+
+## Reporting protocol
+
+- **Severity scale** : CVSS 3.1. Critical (9.0+), High (7.0–8.9), Medium (4.0–6.9), Low (0.1–3.9), Informational.
+- **Reporting cadence** : ad-hoc for Critical/High (within 4 business hours of confirmation), batched daily for Medium and below.
+- **Channel** : encrypted email to `security@veza.fr`. PGP key at `https://veza.fr/.well-known/security.txt`. For Critical findings, also use the Signal contact in the engagement letter.
+- **Format** : per finding — title, severity, CVSS vector, reproduction steps (curl / browser-side script), proof of exploitation, recommended remediation, affected component(s).
+- **Status calls** : weekly 30-min check-in (calendar invite from `security@veza.fr`).
+
+## Re-test
+
+The engagement includes one re-test. After the team confirms remediation of all High+ findings, the pentester verifies each fix in the same environment + signs off on the report.
+
+## Legal context
+
+- Authorisation letter on file : signed by `<CEO name>` for Veza, signed by `<lead pentester>` for the firm. Effective `<start date>` to `<end date + 30 d for re-test>`.
+- NDA covers : everything observed during the engagement, including findings, source code, internal architecture, runbooks.
+- Logs : Veza retains all server-side logs for 30 d post-engagement so the team can reconstruct any reported finding without relying on the pentester's local notes.
+- Incident-response coordination : if the pentester believes they've triggered a real incident (e.g. accidentally took staging down beyond the agreed scope), they ping `security@veza.fr` immediately ; we coordinate a controlled rollback per the canary release runbook (`docs/CANARY_RELEASE.md`).
+
+## What we'll do with the report
+
+- **Critical / High** : fix before the v2.0.0 public launch. The launch GO/NO-GO checklist (W6 Day 26) blocks on these.
+- **Medium** : fix in v2.0.x patch releases.
+- **Low / Info** : tracked in the `docs/SECURITY_PRELAUNCH_AUDIT.md` follow-up table for the next review cycle.
+- **Public credit** : the firm's name in `docs/SECURITY_ACKNOWLEDGEMENTS.md` (with prior consent) once the report is delivered + remediation is shipped.
+
+## Files for the pentester's first day
+
+- `docs/ROADMAP_V1.0_LAUNCH.md` — what shipped in v1.0.9 + the launch acceptance bar.
+- `docs/SECURITY_PRELAUNCH_AUDIT.md` — internal audit findings + resolutions (skip these in the external).
+- `docs/api/` — OpenAPI / Swagger generated from the live source ; `https://staging.veza.fr/swagger` mirrors it.
+- `docs/CANARY_RELEASE.md` — how the team rolls fixes during the engagement (so the pentester can predict re-test windows).
+- `infra/ansible/` — read-only via the Forgejo allow-list ; gives architectural context.
+
+## Acceptance gate (Day 25 internal milestone)
+
+- [ ] Pentester briefed (this doc + scope letter handed off)
+- [ ] Staging access provisioned + test accounts delivered out-of-band
+- [ ] Source-code repo allow-list includes pentester's static IP
+- [ ] Initial check-in scheduled
+- [ ] Internal audit findings (W5 Day 21) confirmed fixed in the staging build the pentester is testing
--- a/infra/ansible/inventory/prod.yml
+++ b/infra/ansible/inventory/prod.yml
@ -28,33 +28,66 @@ all:
        ansible_connection: community.general.incus
        ansible_python_interpreter: /usr/bin/python3
    veza_app_backend:
+      children:
+        veza_app_backend_blue:
+        veza_app_backend_green:
+        veza_app_backend_tools:
+      vars:
+        ansible_connection: community.general.incus
+        ansible_python_interpreter: /usr/bin/python3
+    veza_app_backend_blue:
      hosts:
        veza-backend-blue:
+    veza_app_backend_green:
+      hosts:
        veza-backend-green:
+    veza_app_backend_tools:
+      hosts:
        veza-backend-tools:  # ephemeral, Phase A only
+    veza_app_stream:
+      children:
+        veza_app_stream_blue:
+        veza_app_stream_green:
      vars:
        ansible_connection: community.general.incus
        ansible_python_interpreter: /usr/bin/python3
-    veza_app_stream:
+    veza_app_stream_blue:
      hosts:
        veza-stream-blue:
+    veza_app_stream_green:
+      hosts:
        veza-stream-green:
+    veza_app_web:
+      children:
+        veza_app_web_blue:
+        veza_app_web_green:
      vars:
        ansible_connection: community.general.incus
        ansible_python_interpreter: /usr/bin/python3
-    veza_app_web:
+    veza_app_web_blue:
      hosts:
        veza-web-blue:
+    veza_app_web_green:
+      hosts:
        veza-web-green:
+    veza_data:
+      children:
+        veza_data_postgres:
+        veza_data_redis:
+        veza_data_rabbitmq:
+        veza_data_minio:
      vars:
        ansible_connection: community.general.incus
        ansible_python_interpreter: /usr/bin/python3
-    veza_data:
+    veza_data_postgres:
      hosts:
        veza-postgres:
+    veza_data_redis:
+      hosts:
        veza-redis:
+    veza_data_rabbitmq:
+      hosts:
        veza-rabbitmq:
+    veza_data_minio:
+      hosts:
        veza-minio:
-      vars:
-        ansible_connection: community.general.incus
-        ansible_python_interpreter: /usr/bin/python3
--- a/infra/ansible/inventory/staging.yml
+++ b/infra/ansible/inventory/staging.yml
@ -48,35 +48,68 @@ all:
    # container's /var/lib/veza/active-color file ; both blue and
    # green sit in inventory so either color is reachable when needed.
    veza_app_backend:
+      children:
+        veza_app_backend_blue:
+        veza_app_backend_green:
+        veza_app_backend_tools:
+      vars:
+        ansible_connection: community.general.incus
+        ansible_python_interpreter: /usr/bin/python3
+    veza_app_backend_blue:
      hosts:
        veza-staging-backend-blue:
+    veza_app_backend_green:
+      hosts:
        veza-staging-backend-green:
+    veza_app_backend_tools:
+      hosts:
        veza-staging-backend-tools:  # ephemeral, Phase A only
+    veza_app_stream:
+      children:
+        veza_app_stream_blue:
+        veza_app_stream_green:
      vars:
        ansible_connection: community.general.incus
        ansible_python_interpreter: /usr/bin/python3
-    veza_app_stream:
+    veza_app_stream_blue:
      hosts:
        veza-staging-stream-blue:
+    veza_app_stream_green:
+      hosts:
        veza-staging-stream-green:
+    veza_app_web:
+      children:
+        veza_app_web_blue:
+        veza_app_web_green:
      vars:
        ansible_connection: community.general.incus
        ansible_python_interpreter: /usr/bin/python3
-    veza_app_web:
+    veza_app_web_blue:
      hosts:
        veza-staging-web-blue:
+    veza_app_web_green:
+      hosts:
        veza-staging-web-green:
-      vars:
-        ansible_connection: community.general.incus
-        ansible_python_interpreter: /usr/bin/python3
    # Data tier — never destroyed, only created if absent. ZFS
    # snapshots taken on every deploy as the safety net.
    veza_data:
-      hosts:
-        veza-staging-postgres:
-        veza-staging-redis:
-        veza-staging-rabbitmq:
-        veza-staging-minio:
+      children:
+        veza_data_postgres:
+        veza_data_redis:
+        veza_data_rabbitmq:
+        veza_data_minio:
      vars:
        ansible_connection: community.general.incus
        ansible_python_interpreter: /usr/bin/python3
+    veza_data_postgres:
+      hosts:
+        veza-staging-postgres:
+    veza_data_redis:
+      hosts:
+        veza-staging-redis:
+    veza_data_rabbitmq:
+      hosts:
+        veza-staging-rabbitmq:
+    veza_data_minio:
+      hosts:
+        veza-staging-minio:
--- a/infra/ansible/playbooks/deploy_app.yml
+++ b/infra/ansible/playbooks/deploy_app.yml
@ -62,14 +62,9 @@
      tags: [phaseA]

 - name: Phase A — install backend artifact + run migrate_tool inside tools
-  hosts: "{{ veza_container_prefix + 'backend-tools' }}"
+  hosts: veza_app_backend_tools
  become: true
  gather_facts: false
-  vars:
-    ansible_connection: community.general.incus
-    ansible_python_interpreter: /usr/bin/python3
-    veza_component: backend
-    veza_target_color: tools  # not blue/green — bypass color logic in name
  tasks:
    - name: Apt deps for tools container
      ansible.builtin.apt:
@ -125,13 +120,10 @@
 # =====================================================================
 # Phase B — Determine inactive color
 # =====================================================================
- name: Phase B — read active color, compute inactive_color
-  hosts: "{{ veza_container_prefix + 'haproxy' }}"
+- name: Phase B — read active color, compute inactive_color, populate dynamic groups
+  hosts: haproxy
  become: true
  gather_facts: false
-  vars:
-    ansible_connection: community.general.incus
-    ansible_python_interpreter: /usr/bin/python3
  tasks:
    - name: Read currently-active color
      ansible.builtin.slurp:
@ -157,6 +149,41 @@
          Deploying SHA {{ veza_release_sha[:12] }} to color
          {{ inactive_color }} (currently active: {{ prior_active_color }}).

+    # Use add_host to dynamically populate phase_c_<component> groups
+    # with the correct inactive-color hostnames. Subsequent plays
+    # target these dynamic groups by static name — Ansible's host
+    # parser doesn't see {{ }} so this avoids the var-undefined-at-
+    # parse-time issue.
+    - name: Stage inactive-color backend in phase_c_backend group
+      ansible.builtin.add_host:
+        name: "{{ veza_container_prefix }}backend-{{ inactive_color }}"
+        groups: phase_c_backend
+        ansible_connection: community.general.incus
+        ansible_python_interpreter: /usr/bin/python3
+        veza_component: backend
+        veza_target_color: "{{ inactive_color }}"
+      changed_when: false
+
+    - name: Stage inactive-color stream in phase_c_stream group
+      ansible.builtin.add_host:
+        name: "{{ veza_container_prefix }}stream-{{ inactive_color }}"
+        groups: phase_c_stream
+        ansible_connection: community.general.incus
+        ansible_python_interpreter: /usr/bin/python3
+        veza_component: stream
+        veza_target_color: "{{ inactive_color }}"
+      changed_when: false
+
+    - name: Stage inactive-color web in phase_c_web group
+      ansible.builtin.add_host:
+        name: "{{ veza_container_prefix }}web-{{ inactive_color }}"
+        groups: phase_c_web
+        ansible_connection: community.general.incus
+        ansible_python_interpreter: /usr/bin/python3
+        veza_component: web
+        veza_target_color: "{{ inactive_color }}"
+      changed_when: false
+
 # =====================================================================
 # Phase C — destroy + relaunch the three app containers in inactive_color
 # =====================================================================
@ -165,28 +192,23 @@
  become: true
  gather_facts: false
  vars:
-    inactive_color: "{{ hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}"
+    inactive_color: "{{ hostvars[groups['haproxy'][0]]['inactive_color'] }}"
  tasks:
    - name: Destroy + launch each component container
-      ansible.builtin.shell: |
-        set -e
-        CT="{{ veza_container_prefix }}{{ item }}-{{ inactive_color }}"
-        # Force-delete is fine — these are stateless app containers ; the
-        # active color is untouched.
-        incus delete --force "$CT" 2>/dev/null || true
-        incus launch {{ veza_app_base_image }} "$CT" \
-          --profile veza-app \
-          --profile veza-net \
-          --network "{{ veza_incus_network }}"
-        for i in $(seq 1 {{ veza_app_container_ready_timeout | default(30) }}); do
-          if incus exec "$CT" -- /bin/true 2>/dev/null; then
-            exit 0
-          fi
-          sleep 1
-        done
-        echo "Container $CT did not become ready"
-        exit 1
-      args:
+      ansible.builtin.shell:
+        cmd: |
+          set -e
+          CT="{{ veza_container_prefix }}{{ item }}-{{ inactive_color }}"
+          incus delete --force "$CT" 2>/dev/null || true
+          incus launch "{{ veza_app_base_image }}" "$CT" --profile veza-app --profile veza-net --network "{{ veza_incus_network }}"
+          for i in $(seq 1 {{ veza_app_container_ready_timeout | default(30) }}); do
+            if incus exec "$CT" -- /bin/true 2>/dev/null; then
+              exit 0
+            fi
+            sleep 1
+          done
+          echo "Container $CT did not become ready"
+          exit 1
        executable: /bin/bash
      loop:
        - backend
@ -200,40 +222,25 @@
      tags: [phaseC]

 - name: Phase C — provision backend (inactive color) via veza_app role
-  hosts: "{{ veza_container_prefix + 'backend-' + hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}"
+  hosts: phase_c_backend
  become: true
  gather_facts: false
-  vars:
-    ansible_connection: community.general.incus
-    ansible_python_interpreter: /usr/bin/python3
-    veza_component: backend
-    veza_target_color: "{{ hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}"
  roles:
    - veza_app
  tags: [phaseC, backend]

 - name: Phase C — provision stream (inactive color)
-  hosts: "{{ veza_container_prefix + 'stream-' + hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}"
+  hosts: phase_c_stream
  become: true
  gather_facts: false
-  vars:
-    ansible_connection: community.general.incus
-    ansible_python_interpreter: /usr/bin/python3
-    veza_component: stream
-    veza_target_color: "{{ hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}"
  roles:
    - veza_app
  tags: [phaseC, stream]

 - name: Phase C — provision web (inactive color)
-  hosts: "{{ veza_container_prefix + 'web-' + hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}"
+  hosts: phase_c_web
  become: true
  gather_facts: false
-  vars:
-    ansible_connection: community.general.incus
-    ansible_python_interpreter: /usr/bin/python3
-    veza_component: web
-    veza_target_color: "{{ hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}"
  roles:
    - veza_app
  tags: [phaseC, web]
@ -244,12 +251,9 @@
 # is up locally but unreachable via Incus DNS.
 # =====================================================================
 - name: Phase D — probe each component via Incus DNS (cross-container)
-  hosts: "{{ veza_container_prefix + 'haproxy' }}"
+  hosts: haproxy
  become: true
  gather_facts: false
-  vars:
-    ansible_connection: community.general.incus
-    ansible_python_interpreter: /usr/bin/python3
  tasks:
    - name: Curl each component's health endpoint
      ansible.builtin.uri:
@ -274,12 +278,10 @@
 # cfg on failure.
 # =====================================================================
 - name: Phase E — switch HAProxy to the new color
-  hosts: "{{ veza_container_prefix + 'haproxy' }}"
+  hosts: haproxy
  become: true
  gather_facts: true   # roles/veza_haproxy_switch wants ansible_date_time
  vars:
-    ansible_connection: community.general.incus
-    ansible_python_interpreter: /usr/bin/python3
    veza_active_color: "{{ inactive_color }}"  # the color we ARE switching TO
  roles:
    - veza_haproxy_switch
@ -295,61 +297,71 @@
  become: true
  gather_facts: true
  vars:
-    inactive_color: "{{ hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}"
-    prior_active_color: "{{ hostvars[veza_container_prefix + 'haproxy']['prior_active_color'] }}"
+    inactive_color: "{{ hostvars[groups['haproxy'][0]]['inactive_color'] }}"
+    prior_active_color: "{{ hostvars[groups['haproxy'][0]]['prior_active_color'] }}"
  tasks:
-    - name: Curl public health endpoint via HAProxy
-      ansible.builtin.uri:
-        url: "{{ veza_public_url }}/api/v1/health"
-        method: GET
-        status_code: [200]
-        timeout: 10
-        validate_certs: "{{ veza_public_url.startswith('https://') }}"
-      register: public_health
-      retries: 10
-      delay: 3
-      until: public_health.status == 200
-      tags: [phaseF, verify]
+    # Block/rescue at TASK level — Ansible doesn't accept rescue at play
+    # level. Both the success path (verify + record) and the rescue path
+    # (record failure + revert HAProxy + fail) live inside this block.
+    - name: Verify externally and record state, with rollback-on-failure
+      block:
+        - name: Curl public health endpoint via HAProxy
+          ansible.builtin.uri:
+            url: "{{ veza_public_url }}/api/v1/health"
+            method: GET
+            status_code: [200]
+            timeout: 10
+            validate_certs: "{{ veza_public_url.startswith('https://') }}"
+          register: public_health
+          retries: 10
+          delay: 3
+          until: public_health.status == 200
+          tags: [phaseF, verify]

-    - name: Write deploy-state.json (consumed by node-exporter textfile)
-      ansible.builtin.copy:
-        dest: /var/lib/node_exporter/textfile_collector/veza_deploy.prom
-        content: |
-          # HELP veza_deploy_active_color 0=blue, 1=green.
-          # TYPE veza_deploy_active_color gauge
-          veza_deploy_active_color{env="{{ veza_env }}"} {{ 0 if inactive_color == 'blue' else 1 }}
-          # HELP veza_deploy_release_sha info metric, label=sha.
-          # TYPE veza_deploy_release_sha gauge
-          veza_deploy_release_sha{env="{{ veza_env }}",sha="{{ veza_release_sha }}",color="{{ inactive_color }}"} 1
-          # HELP veza_deploy_last_success_timestamp unix epoch of last successful deploy.
-          # TYPE veza_deploy_last_success_timestamp gauge
-          veza_deploy_last_success_timestamp{env="{{ veza_env }}"} {{ ansible_date_time.epoch }}
-        mode: "0644"
-      tags: [phaseF, metrics]
-  rescue:
-    - name: Public health failed — record the failure timestamp
-      ansible.builtin.copy:
-        dest: /var/lib/node_exporter/textfile_collector/veza_deploy.prom
-        content: |
-          # HELP veza_deploy_last_failure_timestamp unix epoch of last failed deploy.
-          # TYPE veza_deploy_last_failure_timestamp gauge
-          veza_deploy_last_failure_timestamp{env="{{ veza_env }}",sha="{{ veza_release_sha }}",color="{{ inactive_color }}"} {{ ansible_date_time.epoch }}
-        mode: "0644"
-      failed_when: false
+        - name: Write deploy-state.json (consumed by node-exporter textfile)
+          ansible.builtin.copy:
+            dest: /var/lib/node_exporter/textfile_collector/veza_deploy.prom
+            content: |
+              # HELP veza_deploy_active_color 0=blue, 1=green.
+              # TYPE veza_deploy_active_color gauge
+              veza_deploy_active_color{env="{{ veza_env }}"} {{ 0 if inactive_color == 'blue' else 1 }}
+              # HELP veza_deploy_release_sha info metric, label=sha.
+              # TYPE veza_deploy_release_sha gauge
+              veza_deploy_release_sha{env="{{ veza_env }}",sha="{{ veza_release_sha }}",color="{{ inactive_color }}"} 1
+              # HELP veza_deploy_last_success_timestamp unix epoch of last successful deploy.
+              # TYPE veza_deploy_last_success_timestamp gauge
+              veza_deploy_last_success_timestamp{env="{{ veza_env }}"} {{ ansible_date_time.epoch }}
+            mode: "0644"
+          tags: [phaseF, metrics]
+      rescue:
+        - name: Public health failed — record the failure timestamp
+          ansible.builtin.copy:
+            dest: /var/lib/node_exporter/textfile_collector/veza_deploy.prom
+            content: |
+              # HELP veza_deploy_last_failure_timestamp unix epoch of last failed deploy.
+              # TYPE veza_deploy_last_failure_timestamp gauge
+              veza_deploy_last_failure_timestamp{env="{{ veza_env }}",sha="{{ veza_release_sha }}",color="{{ inactive_color }}"} {{ ansible_date_time.epoch }}
+            mode: "0644"
+          failed_when: false

-    - name: Re-switch HAProxy back to the prior color
-      ansible.builtin.import_role:
-        name: veza_haproxy_switch
-      vars:
-        veza_active_color: "{{ prior_active_color }}"
-      delegate_to: "{{ veza_container_prefix + 'haproxy' }}"
+        - name: Re-switch HAProxy back to the prior color (delegated)
+          delegate_to: "{{ groups['haproxy'][0] }}"
+          vars:
+            ansible_connection: community.general.incus
+            ansible_python_interpreter: /usr/bin/python3
+          block:
+            - name: Apply veza_haproxy_switch with prior_active_color
+              ansible.builtin.include_role:
+                name: veza_haproxy_switch
+              vars:
+                veza_active_color: "{{ prior_active_color }}"

-    - name: Fail the playbook
-      ansible.builtin.fail:
-        msg: >-
-          Public health probe via HAProxy failed after deploy of SHA
-          {{ veza_release_sha[:12] }} to color {{ inactive_color }}.
-          HAProxy reverted to the prior color ({{ prior_active_color }}).
-          The freshly-deployed {{ inactive_color }} containers are kept
-          alive for forensics — inspect with:
-            incus exec {{ veza_container_prefix }}backend-{{ inactive_color }} -- journalctl -u veza-backend -n 200
+        - name: Fail the playbook
+          ansible.builtin.fail:
+            msg: >-
+              Public health probe via HAProxy failed after deploy of SHA
+              {{ veza_release_sha[:12] }} to color {{ inactive_color }}.
+              HAProxy reverted to the prior color ({{ prior_active_color }}).
+              The freshly-deployed {{ inactive_color }} containers are kept
+              alive for forensics — inspect with:
+                incus exec {{ veza_container_prefix }}backend-{{ inactive_color }} -- journalctl -u veza-backend -n 200
--- a/infra/ansible/playbooks/deploy_data.yml
+++ b/infra/ansible/playbooks/deploy_data.yml
@ -112,28 +112,23 @@
  gather_facts: false
  tasks:
    - name: Launch container if absent
-      ansible.builtin.shell: |
-        set -e
-        if incus info "{{ item.name }}" >/dev/null 2>&1; then
-          echo "{{ item.name }} already exists"
-          exit 0
-        fi
-        incus launch {{ veza_app_base_image }} "{{ item.name }}" \
-          --profile veza-data \
-          --profile veza-net \
-          --network "{{ veza_incus_network }}"
-        # Wait for the container's API to respond before any subsequent task
-        # (apt, systemd) hits a half-up container.
-        for i in $(seq 1 {{ veza_app_container_ready_timeout | default(30) }}); do
-          if incus exec "{{ item.name }}" -- /bin/true 2>/dev/null; then
-            echo "Container {{ item.name }} ready"
+      ansible.builtin.shell:
+        cmd: |
+          set -e
+          if incus info "{{ item.name }}" >/dev/null 2>&1; then
+            echo "{{ item.name }} already exists"
            exit 0
          fi
-          sleep 1
-        done
-        echo "Container {{ item.name }} did not become ready within timeout"
-        exit 1
-      args:
+          incus launch "{{ veza_app_base_image }}" "{{ item.name }}" --profile veza-data --profile veza-net --network "{{ veza_incus_network }}"
+          for i in $(seq 1 {{ veza_app_container_ready_timeout | default(30) }}); do
+            if incus exec "{{ item.name }}" -- /bin/true 2>/dev/null; then
+              echo "Container {{ item.name }} ready"
+              exit 0
+            fi
+            sleep 1
+          done
+          echo "Container {{ item.name }} did not become ready within timeout"
+          exit 1
        executable: /bin/bash
      loop: "{{ veza_data_containers }}"
      register: launch_result
@ -150,7 +145,7 @@
 # tasks/<kind>.yml or role.
 # -----------------------------------------------------------------------
 - name: Configure postgres
-  hosts: "{{ veza_container_prefix + 'postgres' }}"
+  hosts: veza_data_postgres
  become: true
  gather_facts: false
  vars:
@ -198,7 +193,7 @@
  tags: [data, postgres]

 - name: Configure redis
-  hosts: "{{ veza_container_prefix + 'redis' }}"
+  hosts: veza_data_redis
  become: true
  gather_facts: false
  vars:
@ -250,7 +245,7 @@
  tags: [data, redis]

 - name: Configure rabbitmq
-  hosts: "{{ veza_container_prefix + 'rabbitmq' }}"
+  hosts: veza_data_rabbitmq
  become: true
  gather_facts: false
  vars:
@ -295,7 +290,7 @@
  tags: [data, rabbitmq]

 - name: Configure minio
-  hosts: "{{ veza_container_prefix + 'minio' }}"
+  hosts: veza_data_minio
  become: true
  gather_facts: false
  vars:
--- a/infra/ansible/playbooks/rollback.yml
+++ b/infra/ansible/playbooks/rollback.yml
@ -1,14 +1,12 @@
 # rollback.yml — two modes :
 #
 #  1. fast      : flip HAProxy back to the previous active color.
-#                Works only if those containers are still alive
-#                (i.e., the next deploy has NOT yet recycled them).
+#                Works only if those containers are still alive.
 #                Effect time : ~5 seconds.
 #
 #  2. full      : redeploy a specific release_sha by re-running
-#                deploy_app.yml with that SHA. Works whenever the
-#                tarball is still in the Forgejo Registry. Effect
-#                time : ~5-10 minutes.
+#                deploy_app.yml with that SHA.
+#                Effect time : ~5-10 minutes.
 #
 # Required extra-vars:
 #   env             staging | prod
@ -16,11 +14,7 @@
 #   target_color    (mode=fast only)  the color to flip TO
 #   release_sha     (mode=full only)  the SHA to redeploy
 #
-# Caller (workflow_dispatch only — see .forgejo/workflows/rollback.yml):
-#   ansible-playbook -i inventory/{{env}}.yml playbooks/rollback.yml \
-#     -e env={{env}} -e mode=fast -e target_color=blue
-#   ansible-playbook -i inventory/{{env}}.yml playbooks/rollback.yml \
-#     -e env={{env}} -e mode=full -e release_sha=<previous_sha>
+# Caller (workflow_dispatch only — see .forgejo/workflows/rollback.yml).
 ---
 - name: Validate inputs
  hosts: incus_hosts
@ -57,27 +51,28 @@

 # ---------------------------------------------------------------------
 # mode=fast  →  HAProxy flip only.
+# `when:` lives at TASK level (Ansible doesn't accept it at play level).
 # ---------------------------------------------------------------------
 - name: Fast rollback — verify target_color containers are alive
  hosts: incus_hosts
  become: true
  gather_facts: false
  tasks:
-    - name: Check each target-color container exists
-      ansible.builtin.shell: |
-        set -e
-        CT="{{ veza_container_prefix }}{{ item }}-{{ target_color }}"
-        if ! incus info "$CT" >/dev/null 2>&1; then
-          echo "MISSING $CT"
-          exit 1
-        fi
-        STATE=$(incus list "$CT" -c s --format csv)
-        if [ "$STATE" != "RUNNING" ]; then
-          echo "$CT is $STATE (not RUNNING)"
-          exit 1
-        fi
-        echo "OK $CT"
-      args:
+    - name: Check each target-color container exists and is RUNNING
+      ansible.builtin.shell:
+        cmd: |
+          set -e
+          CT="{{ veza_container_prefix }}{{ item }}-{{ target_color }}"
+          if ! incus info "$CT" >/dev/null 2>&1; then
+            echo "MISSING $CT"
+            exit 1
+          fi
+          STATE=$(incus list "$CT" -c s --format csv)
+          if [ "$STATE" != "RUNNING" ]; then
+            echo "$CT is $STATE (not RUNNING)"
+            exit 1
+          fi
+          echo "OK $CT"
        executable: /bin/bash
      loop:
        - backend
@ -85,29 +80,31 @@
        - web
      changed_when: false
      register: alive_check
-  when: mode == 'fast'
-  tags: [rollback, fast]
+      when: mode == 'fast'
+      tags: [rollback, fast]

 - name: Fast rollback — flip HAProxy
-  hosts: "{{ veza_container_prefix + 'haproxy' }}"
+  hosts: haproxy
  become: true
  gather_facts: true
-  vars:
-    ansible_connection: community.general.incus
-    ansible_python_interpreter: /usr/bin/python3
-    veza_active_color: "{{ target_color }}"
-    # Fast rollback re-uses the previous SHA from the history file.
-    veza_release_sha: "{{ lookup('ansible.builtin.file', '/var/lib/veza/active-color.history', errors='ignore') | regex_search('sha=([0-9a-f]+)', '\\1') | default(['rollback'], true) | first }}"
-  roles:
-    - veza_haproxy_switch
-  when: mode == 'fast'
-  tags: [rollback, fast]
+  tasks:
+    - name: Apply veza_haproxy_switch with target_color
+      ansible.builtin.include_role:
+        name: veza_haproxy_switch
+      vars:
+        veza_active_color: "{{ target_color }}"
+        # Fast rollback re-uses the previous SHA from the history file.
+        # Fallback to a synthetic 40-char SHA if the file is missing —
+        # the role's assert tolerates this for the rollback case.
+        veza_release_sha: "{{ (lookup('ansible.builtin.file', '/var/lib/veza/active-color.history', errors='ignore') | default('', true) | regex_search('sha=([0-9a-f]{40})', '\\1') | default('r0llback' + '0' * 32, true)) }}"
+      when: mode == 'fast'
+      tags: [rollback, fast]

 # ---------------------------------------------------------------------
-# mode=full  →  re-import deploy_app.yml with the rollback SHA.
-# Functionally identical to a fresh deploy of an older release.
+# mode=full  →  re-run deploy_app.yml with the rollback SHA.
+# `when:` IS valid on import_playbook (unlike on a regular play).
 # ---------------------------------------------------------------------
- name: Full rollback — delegate to deploy_app.yml with release_sha={{ veza_release_sha | default('') }}
+- name: Full rollback — delegate to deploy_app.yml
  ansible.builtin.import_playbook: deploy_app.yml
  when: mode == 'full'
  tags: [rollback, full]