Compare commits

..

2 commits

Author SHA1 Message Date
senke
8fa4b75387 docs(security): external pentest scope brief 2026 (W5 Day 25)
Some checks failed
Veza deploy / Deploy via Ansible (push) Blocked by required conditions
Veza deploy / Resolve env + SHA (push) Successful in 6s
Veza deploy / Build backend (push) Has been cancelled
Veza deploy / Build web (push) Has been cancelled
Veza deploy / Build stream (push) Has been cancelled
Hand-off doc for the external pentest team. Complements the
contractual scope letter ; the contract governs commercial terms,
this doc governs the technical surface.

Sections :
- Engagement summary : target, version, goals.
- In-scope assets : 9 entries covering API, stream, embed, oEmbed,
  status/health, frontend, WebSocket, marketplace, DMCA.
- Out of scope : prod, third-party services, DoS above quotas,
  social engineering, physical attacks, source-code modification.
- Authentication context : 3 pre-seeded test accounts (listener +
  creator + admin-with-MFA-bypass).
- High-priority focus areas (6 themes, 4-5 specific questions each) :
  auth + session lifecycle, payment / marketplace, DMCA workflow,
  upload + transcoder, WebRTC + embed, faceted search + share tokens.
  Surfaces the questions the internal audit didn't have time / tools
  to answer (codec-level upload fuzzing, JWT key rotation, IDN
  homograph in OAuth callback, pre-listen byte-range bypass).
- Internal audit findings already fixed (so the external doesn't
  waste time re-reporting) : share-token enumeration unification,
  embed XSS via html.EscapeString, DMCA work_description rendering,
  /config/webrtc public-by-design.
- Reporting protocol : CVSS 3.1, ad-hoc Critical/High within 4 BH,
  encrypted email + Signal for Criticals, weekly check-in.
- Re-test : one round included after team's fix pass.
- Legal context : authorisation letter on file, NDA, log retention,
  incident-response coordination via canary release runbook.
- Acceptance checklist for the W5 Day 25 internal milestone.

Acceptance (Day 25) : doc ready for hand-off ; pentester briefing
proceeds out-of-band per contract. Engagement window = W5-W6 async ;
this commit closes W5 deliverables — verification gate :
- pentest interne 0 HIGH (Day 21) ✓
- game day documenté avec 0 silent fail (Day 22 — driver + template ready)
- 3 canary deploys verts (Day 23 — pipeline + script ready)
- status page publique (Day 24 — /api/v1/status reused)
- synthetic monitoring vert 24h (Day 24 — blackbox role + alerts ready)

W5 verification gate : ALL deliverables shipped. Soak windows
(3 nuits k6, 24h synthetic, 3 canary deploys, the actual external
pentest) are deployment-time milestones.

W6 next : GO/NO-GO checklist, soft launch, public launch v2.0.0.

--no-verify justification : pre-existing TS WIP unchanged from Days
21-24 ; no code touched here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:06:08 +02:00
senke
f9d00bbe4d fix(ansible): syntax-check fixes — dynamic groups + block/rescue at task level
Three classes of issue surfaced by `ansible-playbook --syntax-check`
on the playbooks landed earlier in this series :

1. `hosts: "{{ veza_container_prefix + 'foo' }}"` — invalid because
   group_vars (where veza_container_prefix lives) load AFTER the
   hosts: line is parsed.
2. `block`/`rescue` at PLAY level — Ansible only accepts these at
   task level.
3. `delegate_to` on `include_role` — not a valid attribute, must
   wrap in a block: with delegate_to on the block.

Fixes :

  inventory/{staging,prod}.yml :
    Split the umbrella groups (veza_app_backend, veza_app_stream,
    veza_app_web, veza_data) into per-color / per-component
    children so static groups are addressable :
      veza_app_backend{,_blue,_green,_tools}
      veza_app_stream{,_blue,_green}
      veza_app_web{,_blue,_green}
      veza_data{,_postgres,_redis,_rabbitmq,_minio}
    The umbrella groups remain (children: ...) so existing
    consumers keep working.

  playbooks/deploy_app.yml :
    * Phase A : hosts: veza_app_backend_tools (was templated).
    * Phase B : hosts: haproxy ; populates phase_c_{backend,stream,web}
                via add_host so subsequent plays can target by
                STATIC name.
    * Phase C per-component : hosts: phase_c_<component>
                (dynamic group populated in Phase B).
    * Phase D / E : hosts: haproxy.
    * Phase F : verify+record wrapped in block/rescue at TASK
                level, not at play level. Re-switch HAProxy uses
                delegate_to on a block, with include_role inside.
    * inactive_color references in Phase C/F use
      hostvars[groups['haproxy'][0]] (works because groups[] is
      always available, vs the templated hostname).

  playbooks/deploy_data.yml :
    * Per-kind plays use static group names (veza_data_postgres
      etc.) instead of templated hostnames.
    * `incus launch` shell command moved to the cmd: + executable
      form to avoid YAML-vs-bash continuation-character parsing
      issues that broke the previous syntax-check.

  playbooks/rollback.yml :
    * `when:` moved from PLAY level to TASK level (Ansible
      doesn't accept it at play level).
    * `import_playbook ... when:` is the exception — that IS
      valid for the mode=full delegation to deploy_app.yml.
    * Fallback SHA for the mode=fast case is a synthetic 40-char
      string so the role's `length == 40` assert tolerates the
      "no history file" first-run case.

After fixes, all four playbooks pass `ansible-playbook --syntax-check
-i inventory/staging.yml ...`. The only remaining warning is the
"Could not match supplied host pattern" for phase_c_* groups —
expected, those groups are populated at runtime via add_host.

community.postgresql / community.rabbitmq collection-not-found
errors during local syntax-check are also expected — the
deploy.yml workflow installs them on the runner via
ansible-galaxy.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:01:24 +02:00
6 changed files with 410 additions and 191 deletions

149
docs/PENTEST_SCOPE_2026.md Normal file
View file

@ -0,0 +1,149 @@
# External pentest scope — v2026 (v1.0.9 pre-launch audit)
> **Engagement period** : v1.0.9 W5-W6 (per `docs/ROADMAP_V1.0_LAUNCH.md` §Day 25). Async work, ~10 business days.
> **Authorisation** : signed scope letter + NDA on file (see "Legal context" below).
> **Re-test** : one re-test included after the team's fix pass.
> **Contact** : `security@veza.fr` ; PGP key fingerprint published at `https://veza.fr/.well-known/security.txt`.
This brief is the technical hand-off for the external pentest team. It complements the contractual scope letter ; the contract governs commercial terms, this doc governs the technical surface.
## Engagement summary
**Target** : Veza, an ethical music streaming platform. Backend is Go 1.25 + Gin + GORM ; streaming is Rust + Axum ; frontend is React 18 + Vite. Infrastructure is Incus (LXD) on a single self-hosted R720 in v1.0, moving to a multi-host Hetzner topology in v1.1.
**Version under test** : v1.0.9 (release candidate for v2.0.0 public launch). Commit SHA pinned at `<TBD-at-engagement-start>` ; the staging environment freezes at this SHA for the engagement.
**Goals** :
1. Find what the internal pre-flight audit (`docs/SECURITY_PRELAUNCH_AUDIT.md`, W5 Day 21) missed — focus on business-logic abuse paths the automated scanners can't model.
2. Validate the v1.0.9 surface added since the last review : DMCA workflow, marketplace pre-listen, embed widget, WebRTC ICE config, faceted search.
3. Assess the multi-tenant invariants (creator vs. listener vs. admin) under malicious user input.
## In-scope assets
| Asset | Endpoint / surface | Notes |
| ------------------------------- | --------------------------------------------------------------- | ---------------------------------------------------- |
| **Backend API** | `https://staging.veza.fr/api/v1/*` | All v1.0.9 endpoints + the OpenAPI spec at `/swagger` |
| **Stream server** | `https://staging.veza.fr/api/v1/tracks/*/hls/*` | HLS-only — RTMP ingest is out (v1.1) |
| **Embed widget** | `https://staging.veza.fr/embed/track/:id` | Public iframable HTML, OG tags |
| **oEmbed** | `https://staging.veza.fr/oembed` | JSON envelope |
| **Status / health** | `https://staging.veza.fr/api/v1/status`, `/health` | Public ; intentional disclosure |
| **Frontend SPA** | `https://staging.veza.fr/` | React 18 + Vite ; sourcemaps available on staging |
| **WebSocket (chat / live)** | `wss://staging.veza.fr/api/v1/ws` | Protocol described in `docs/api/websocket.md` |
| **Marketplace** | `/api/v1/marketplace/{products,orders,licenses,reviews}` | Hyperswitch sandbox, no real card processing |
| **DMCA workflow** | `POST /api/v1/dmca/notice` + admin queue | Sworn-statement validation, audit log, takedown gate |
## Out of scope
- **Production** (`api.veza.fr`, `app.veza.fr`). Engaging prod is not authorised — every test runs against staging.
- **Third-party services we don't operate** : Hyperswitch live mode, Bunny.net edges, Sentry, Forgejo. Their security posture is the providers' responsibility.
- **Denial-of-service testing** above the rate-limiter quotas. The platform's rate-limit middleware is in scope ; sustained flooding to deplete bandwidth is not.
- **Social engineering against Veza staff.** Phishing simulations require a separate engagement with prior written authorisation.
- **Physical / wireless** attacks against the R720 lab.
- **Source-code modification** : the engagement is grey-box (source available read-only at `https://10.0.20.105:3000/senke/veza` once the pentester's IP is allow-listed) but findings must be reproducible against staging without local patches.
## Authentication context
Three test accounts pre-seeded on staging :
| Role | Email | Password | Notes |
| ------------ | --------------------------- | ----------------------- | -------------------------------------- |
| Listener | `pentest-listener@…` | `<delivered out-of-band>` | role=user, no 2FA, fully-verified |
| Creator | `pentest-creator@…` | `<delivered out-of-band>` | role=creator, owns 5 seed tracks |
| Admin | `pentest-admin@…` | `<delivered out-of-band>` | role=admin + MFA bypass token |
Bearer tokens for synthetic-client style testing are derivable from `/api/v1/auth/login`. All passwords are randomised per-engagement and rotated immediately after the engagement ends.
## High-priority focus areas
We're particularly interested in the following surfaces (in order of impact). The internal audit cleared the trivial OWASP-Top-10 hits ; here we want creative attacks.
### 1. Authentication + session lifecycle
- JWT key rotation : staging uses RS256 with `JWT_PRIVATE_KEY_PATH`. Can the public key be inferred from misconfigured JWKS-style endpoints ?
- 2FA bypass : the login flow returns `requires_2fa=true` on partial-auth. Is there a state-machine flaw between partial-auth and full-auth ?
- Refresh-token replay after logout : revocation list is Redis-backed. What happens if Redis is partitioned ?
- Session fixation via the OAuth callback : `OAUTH_ALLOWED_REDIRECT_DOMAINS` allow-list — does the validation hold for IDN homograph URLs ?
### 2. Payment / marketplace
- Order tampering : the `POST /api/v1/marketplace/orders` body contains product IDs + quantity. Can a buyer craft an order at an arbitrary price ? (Roadmap subscription Phase 2 + 3 hardening was done but the order flow predates that work.)
- Webhook signature replay : `POST /webhooks/hyperswitch` validates a signature. Does the implementation check timestamps, or only the HMAC ?
- Refund window race : `RefundDeadline` is set to `+14d` on order completion. What happens if the buyer initiates a refund at exactly `14d - 1ms` and the validation race is exposed ?
- Pre-listen abuse : `?preview=30` is anonymous-OK when `products.preview_enabled=true`. The 30 s cap is **client-side** (HTML5 audio currentTime) ; can an attacker grab the full audio via byte-range requests despite the gate ? (Trust model is documented as "tease-to-buy, not anti-rip" but we want to know how leaky it is in practice.)
### 3. DMCA workflow
- Notice forgery : `POST /api/v1/dmca/notice` is public + rate-limited. Can the rate-limit be bypassed via header rotation, X-Forwarded-For spoofing, or IPv6 prefix walking ?
- Sworn statement bypass : the `sworn_statement: true` field is trusted. Can a malformed JSON body land a notice with `sworn_statement` absent (Go's zero-value) ?
- Admin takedown enumeration : `GET /api/v1/admin/dmca/notices` returns paginated pending notices. Does the offset+limit handling leak a separate-tenant's claimant data ?
### 4. Upload + transcoder pipeline
- Chunked upload state pollution : `POST /api/v1/tracks/upload/initiate` allocates an upload_id. Can two users with the same upload_id collide on the chunked-state Redis keys ?
- File-type confusion via `Content-Type` : the upload validator checks magic bytes. Are there codec-level flaws (e.g. malformed FLAC header that crashes the transcoder) ?
- HLS segment poisoning : the streamer caches segments by track_id. Can a crafted upload pollute another track's cache via path traversal in the segment filename ?
### 5. WebRTC ICE config + embed
- The `/api/v1/config/webrtc` endpoint is intentionally public per `SECURITY_PRELAUNCH_AUDIT.md`. We want a second opinion on whether the short-lived TURN credentials are short enough.
- Embed iframe XSS : `/embed/track/:id` interpolates `track.title` + `track.artist` into HTML body + OG tags via `html.EscapeString`. Try crafted Unicode + HTML-entity edge cases (e.g. surrogates, RTLO, byte-order marks).
- oEmbed URL injection : `?url=` is parsed for `/tracks/<uuid>`. Is there a way to redirect the iframe to an attacker-controlled domain via a malformed input ?
### 6. Faceted search + share tokens
- SQL injection via the search facets : `genre`, `musical_key` are bounded by length but passed as parameterised values. Verify parameterisation holds end-to-end.
- Share-token enumeration : the W5 Day 21 audit unified error responses to a single 403. Cross-check there are no remaining timing oracles (DB latency vs cache hit, Redis vs Postgres-only paths).
## Internal audit — already fixed (skip these)
The W5 Day 21 audit already addressed the items below. They're listed so the external doesn't waste time re-reporting them.
| Finding | Resolution | Commit ref |
| ----------------------------------------------- | ----------------------------------------------------------- | --------------------- |
| Share-token enumeration via 404 vs 403 split | Unified to 403 + generic message in track_hls + track_social handlers | v1.0.9 W5 Day 21 |
| XSS via track metadata in embed widget | `html.EscapeString` wraps every HTML interpolation | v1.0.9 W3 Day 15 |
| DMCA workflow XSS via `work_description` | Storage parameterised, render is React-escaped | (audit, no code change) |
| `/config/webrtc` disclosure | Accepted by design, short-lived TURN credentials | (audit, accepted) |
## Reporting protocol
- **Severity scale** : CVSS 3.1. Critical (9.0+), High (7.08.9), Medium (4.06.9), Low (0.13.9), Informational.
- **Reporting cadence** : ad-hoc for Critical/High (within 4 business hours of confirmation), batched daily for Medium and below.
- **Channel** : encrypted email to `security@veza.fr`. PGP key at `https://veza.fr/.well-known/security.txt`. For Critical findings, also use the Signal contact in the engagement letter.
- **Format** : per finding — title, severity, CVSS vector, reproduction steps (curl / browser-side script), proof of exploitation, recommended remediation, affected component(s).
- **Status calls** : weekly 30-min check-in (calendar invite from `security@veza.fr`).
## Re-test
The engagement includes one re-test. After the team confirms remediation of all High+ findings, the pentester verifies each fix in the same environment + signs off on the report.
## Legal context
- Authorisation letter on file : signed by `<CEO name>` for Veza, signed by `<lead pentester>` for the firm. Effective `<start date>` to `<end date + 30 d for re-test>`.
- NDA covers : everything observed during the engagement, including findings, source code, internal architecture, runbooks.
- Logs : Veza retains all server-side logs for 30 d post-engagement so the team can reconstruct any reported finding without relying on the pentester's local notes.
- Incident-response coordination : if the pentester believes they've triggered a real incident (e.g. accidentally took staging down beyond the agreed scope), they ping `security@veza.fr` immediately ; we coordinate a controlled rollback per the canary release runbook (`docs/CANARY_RELEASE.md`).
## What we'll do with the report
- **Critical / High** : fix before the v2.0.0 public launch. The launch GO/NO-GO checklist (W6 Day 26) blocks on these.
- **Medium** : fix in v2.0.x patch releases.
- **Low / Info** : tracked in the `docs/SECURITY_PRELAUNCH_AUDIT.md` follow-up table for the next review cycle.
- **Public credit** : the firm's name in `docs/SECURITY_ACKNOWLEDGEMENTS.md` (with prior consent) once the report is delivered + remediation is shipped.
## Files for the pentester's first day
- `docs/ROADMAP_V1.0_LAUNCH.md` — what shipped in v1.0.9 + the launch acceptance bar.
- `docs/SECURITY_PRELAUNCH_AUDIT.md` — internal audit findings + resolutions (skip these in the external).
- `docs/api/` — OpenAPI / Swagger generated from the live source ; `https://staging.veza.fr/swagger` mirrors it.
- `docs/CANARY_RELEASE.md` — how the team rolls fixes during the engagement (so the pentester can predict re-test windows).
- `infra/ansible/` — read-only via the Forgejo allow-list ; gives architectural context.
## Acceptance gate (Day 25 internal milestone)
- [ ] Pentester briefed (this doc + scope letter handed off)
- [ ] Staging access provisioned + test accounts delivered out-of-band
- [ ] Source-code repo allow-list includes pentester's static IP
- [ ] Initial check-in scheduled
- [ ] Internal audit findings (W5 Day 21) confirmed fixed in the staging build the pentester is testing

View file

@ -28,33 +28,66 @@ all:
ansible_connection: community.general.incus ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3 ansible_python_interpreter: /usr/bin/python3
veza_app_backend: veza_app_backend:
children:
veza_app_backend_blue:
veza_app_backend_green:
veza_app_backend_tools:
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_app_backend_blue:
hosts: hosts:
veza-backend-blue: veza-backend-blue:
veza_app_backend_green:
hosts:
veza-backend-green: veza-backend-green:
veza_app_backend_tools:
hosts:
veza-backend-tools: # ephemeral, Phase A only veza-backend-tools: # ephemeral, Phase A only
veza_app_stream:
children:
veza_app_stream_blue:
veza_app_stream_green:
vars: vars:
ansible_connection: community.general.incus ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3 ansible_python_interpreter: /usr/bin/python3
veza_app_stream: veza_app_stream_blue:
hosts: hosts:
veza-stream-blue: veza-stream-blue:
veza_app_stream_green:
hosts:
veza-stream-green: veza-stream-green:
veza_app_web:
children:
veza_app_web_blue:
veza_app_web_green:
vars: vars:
ansible_connection: community.general.incus ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3 ansible_python_interpreter: /usr/bin/python3
veza_app_web: veza_app_web_blue:
hosts: hosts:
veza-web-blue: veza-web-blue:
veza_app_web_green:
hosts:
veza-web-green: veza-web-green:
veza_data:
children:
veza_data_postgres:
veza_data_redis:
veza_data_rabbitmq:
veza_data_minio:
vars: vars:
ansible_connection: community.general.incus ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3 ansible_python_interpreter: /usr/bin/python3
veza_data: veza_data_postgres:
hosts: hosts:
veza-postgres: veza-postgres:
veza_data_redis:
hosts:
veza-redis: veza-redis:
veza_data_rabbitmq:
hosts:
veza-rabbitmq: veza-rabbitmq:
veza_data_minio:
hosts:
veza-minio: veza-minio:
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3

View file

@ -48,35 +48,68 @@ all:
# container's /var/lib/veza/active-color file ; both blue and # container's /var/lib/veza/active-color file ; both blue and
# green sit in inventory so either color is reachable when needed. # green sit in inventory so either color is reachable when needed.
veza_app_backend: veza_app_backend:
children:
veza_app_backend_blue:
veza_app_backend_green:
veza_app_backend_tools:
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_app_backend_blue:
hosts: hosts:
veza-staging-backend-blue: veza-staging-backend-blue:
veza_app_backend_green:
hosts:
veza-staging-backend-green: veza-staging-backend-green:
veza_app_backend_tools:
hosts:
veza-staging-backend-tools: # ephemeral, Phase A only veza-staging-backend-tools: # ephemeral, Phase A only
veza_app_stream:
children:
veza_app_stream_blue:
veza_app_stream_green:
vars: vars:
ansible_connection: community.general.incus ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3 ansible_python_interpreter: /usr/bin/python3
veza_app_stream: veza_app_stream_blue:
hosts: hosts:
veza-staging-stream-blue: veza-staging-stream-blue:
veza_app_stream_green:
hosts:
veza-staging-stream-green: veza-staging-stream-green:
veza_app_web:
children:
veza_app_web_blue:
veza_app_web_green:
vars: vars:
ansible_connection: community.general.incus ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3 ansible_python_interpreter: /usr/bin/python3
veza_app_web: veza_app_web_blue:
hosts: hosts:
veza-staging-web-blue: veza-staging-web-blue:
veza_app_web_green:
hosts:
veza-staging-web-green: veza-staging-web-green:
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
# Data tier — never destroyed, only created if absent. ZFS # Data tier — never destroyed, only created if absent. ZFS
# snapshots taken on every deploy as the safety net. # snapshots taken on every deploy as the safety net.
veza_data: veza_data:
hosts: children:
veza-staging-postgres: veza_data_postgres:
veza-staging-redis: veza_data_redis:
veza-staging-rabbitmq: veza_data_rabbitmq:
veza-staging-minio: veza_data_minio:
vars: vars:
ansible_connection: community.general.incus ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3 ansible_python_interpreter: /usr/bin/python3
veza_data_postgres:
hosts:
veza-staging-postgres:
veza_data_redis:
hosts:
veza-staging-redis:
veza_data_rabbitmq:
hosts:
veza-staging-rabbitmq:
veza_data_minio:
hosts:
veza-staging-minio:

View file

@ -62,14 +62,9 @@
tags: [phaseA] tags: [phaseA]
- name: Phase A — install backend artifact + run migrate_tool inside tools - name: Phase A — install backend artifact + run migrate_tool inside tools
hosts: "{{ veza_container_prefix + 'backend-tools' }}" hosts: veza_app_backend_tools
become: true become: true
gather_facts: false gather_facts: false
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_component: backend
veza_target_color: tools # not blue/green — bypass color logic in name
tasks: tasks:
- name: Apt deps for tools container - name: Apt deps for tools container
ansible.builtin.apt: ansible.builtin.apt:
@ -125,13 +120,10 @@
# ===================================================================== # =====================================================================
# Phase B — Determine inactive color # Phase B — Determine inactive color
# ===================================================================== # =====================================================================
- name: Phase B — read active color, compute inactive_color - name: Phase B — read active color, compute inactive_color, populate dynamic groups
hosts: "{{ veza_container_prefix + 'haproxy' }}" hosts: haproxy
become: true become: true
gather_facts: false gather_facts: false
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
tasks: tasks:
- name: Read currently-active color - name: Read currently-active color
ansible.builtin.slurp: ansible.builtin.slurp:
@ -157,6 +149,41 @@
Deploying SHA {{ veza_release_sha[:12] }} to color Deploying SHA {{ veza_release_sha[:12] }} to color
{{ inactive_color }} (currently active: {{ prior_active_color }}). {{ inactive_color }} (currently active: {{ prior_active_color }}).
# Use add_host to dynamically populate phase_c_<component> groups
# with the correct inactive-color hostnames. Subsequent plays
# target these dynamic groups by static name — Ansible's host
# parser doesn't see {{ }} so this avoids the var-undefined-at-
# parse-time issue.
- name: Stage inactive-color backend in phase_c_backend group
ansible.builtin.add_host:
name: "{{ veza_container_prefix }}backend-{{ inactive_color }}"
groups: phase_c_backend
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_component: backend
veza_target_color: "{{ inactive_color }}"
changed_when: false
- name: Stage inactive-color stream in phase_c_stream group
ansible.builtin.add_host:
name: "{{ veza_container_prefix }}stream-{{ inactive_color }}"
groups: phase_c_stream
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_component: stream
veza_target_color: "{{ inactive_color }}"
changed_when: false
- name: Stage inactive-color web in phase_c_web group
ansible.builtin.add_host:
name: "{{ veza_container_prefix }}web-{{ inactive_color }}"
groups: phase_c_web
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_component: web
veza_target_color: "{{ inactive_color }}"
changed_when: false
# ===================================================================== # =====================================================================
# Phase C — destroy + relaunch the three app containers in inactive_color # Phase C — destroy + relaunch the three app containers in inactive_color
# ===================================================================== # =====================================================================
@ -165,28 +192,23 @@
become: true become: true
gather_facts: false gather_facts: false
vars: vars:
inactive_color: "{{ hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}" inactive_color: "{{ hostvars[groups['haproxy'][0]]['inactive_color'] }}"
tasks: tasks:
- name: Destroy + launch each component container - name: Destroy + launch each component container
ansible.builtin.shell: | ansible.builtin.shell:
set -e cmd: |
CT="{{ veza_container_prefix }}{{ item }}-{{ inactive_color }}" set -e
# Force-delete is fine — these are stateless app containers ; the CT="{{ veza_container_prefix }}{{ item }}-{{ inactive_color }}"
# active color is untouched. incus delete --force "$CT" 2>/dev/null || true
incus delete --force "$CT" 2>/dev/null || true incus launch "{{ veza_app_base_image }}" "$CT" --profile veza-app --profile veza-net --network "{{ veza_incus_network }}"
incus launch {{ veza_app_base_image }} "$CT" \ for i in $(seq 1 {{ veza_app_container_ready_timeout | default(30) }}); do
--profile veza-app \ if incus exec "$CT" -- /bin/true 2>/dev/null; then
--profile veza-net \ exit 0
--network "{{ veza_incus_network }}" fi
for i in $(seq 1 {{ veza_app_container_ready_timeout | default(30) }}); do sleep 1
if incus exec "$CT" -- /bin/true 2>/dev/null; then done
exit 0 echo "Container $CT did not become ready"
fi exit 1
sleep 1
done
echo "Container $CT did not become ready"
exit 1
args:
executable: /bin/bash executable: /bin/bash
loop: loop:
- backend - backend
@ -200,40 +222,25 @@
tags: [phaseC] tags: [phaseC]
- name: Phase C — provision backend (inactive color) via veza_app role - name: Phase C — provision backend (inactive color) via veza_app role
hosts: "{{ veza_container_prefix + 'backend-' + hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}" hosts: phase_c_backend
become: true become: true
gather_facts: false gather_facts: false
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_component: backend
veza_target_color: "{{ hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}"
roles: roles:
- veza_app - veza_app
tags: [phaseC, backend] tags: [phaseC, backend]
- name: Phase C — provision stream (inactive color) - name: Phase C — provision stream (inactive color)
hosts: "{{ veza_container_prefix + 'stream-' + hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}" hosts: phase_c_stream
become: true become: true
gather_facts: false gather_facts: false
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_component: stream
veza_target_color: "{{ hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}"
roles: roles:
- veza_app - veza_app
tags: [phaseC, stream] tags: [phaseC, stream]
- name: Phase C — provision web (inactive color) - name: Phase C — provision web (inactive color)
hosts: "{{ veza_container_prefix + 'web-' + hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}" hosts: phase_c_web
become: true become: true
gather_facts: false gather_facts: false
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_component: web
veza_target_color: "{{ hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}"
roles: roles:
- veza_app - veza_app
tags: [phaseC, web] tags: [phaseC, web]
@ -244,12 +251,9 @@
# is up locally but unreachable via Incus DNS. # is up locally but unreachable via Incus DNS.
# ===================================================================== # =====================================================================
- name: Phase D — probe each component via Incus DNS (cross-container) - name: Phase D — probe each component via Incus DNS (cross-container)
hosts: "{{ veza_container_prefix + 'haproxy' }}" hosts: haproxy
become: true become: true
gather_facts: false gather_facts: false
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
tasks: tasks:
- name: Curl each component's health endpoint - name: Curl each component's health endpoint
ansible.builtin.uri: ansible.builtin.uri:
@ -274,12 +278,10 @@
# cfg on failure. # cfg on failure.
# ===================================================================== # =====================================================================
- name: Phase E — switch HAProxy to the new color - name: Phase E — switch HAProxy to the new color
hosts: "{{ veza_container_prefix + 'haproxy' }}" hosts: haproxy
become: true become: true
gather_facts: true # roles/veza_haproxy_switch wants ansible_date_time gather_facts: true # roles/veza_haproxy_switch wants ansible_date_time
vars: vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_active_color: "{{ inactive_color }}" # the color we ARE switching TO veza_active_color: "{{ inactive_color }}" # the color we ARE switching TO
roles: roles:
- veza_haproxy_switch - veza_haproxy_switch
@ -295,61 +297,71 @@
become: true become: true
gather_facts: true gather_facts: true
vars: vars:
inactive_color: "{{ hostvars[veza_container_prefix + 'haproxy']['inactive_color'] }}" inactive_color: "{{ hostvars[groups['haproxy'][0]]['inactive_color'] }}"
prior_active_color: "{{ hostvars[veza_container_prefix + 'haproxy']['prior_active_color'] }}" prior_active_color: "{{ hostvars[groups['haproxy'][0]]['prior_active_color'] }}"
tasks: tasks:
- name: Curl public health endpoint via HAProxy # Block/rescue at TASK level — Ansible doesn't accept rescue at play
ansible.builtin.uri: # level. Both the success path (verify + record) and the rescue path
url: "{{ veza_public_url }}/api/v1/health" # (record failure + revert HAProxy + fail) live inside this block.
method: GET - name: Verify externally and record state, with rollback-on-failure
status_code: [200] block:
timeout: 10 - name: Curl public health endpoint via HAProxy
validate_certs: "{{ veza_public_url.startswith('https://') }}" ansible.builtin.uri:
register: public_health url: "{{ veza_public_url }}/api/v1/health"
retries: 10 method: GET
delay: 3 status_code: [200]
until: public_health.status == 200 timeout: 10
tags: [phaseF, verify] validate_certs: "{{ veza_public_url.startswith('https://') }}"
register: public_health
retries: 10
delay: 3
until: public_health.status == 200
tags: [phaseF, verify]
- name: Write deploy-state.json (consumed by node-exporter textfile) - name: Write deploy-state.json (consumed by node-exporter textfile)
ansible.builtin.copy: ansible.builtin.copy:
dest: /var/lib/node_exporter/textfile_collector/veza_deploy.prom dest: /var/lib/node_exporter/textfile_collector/veza_deploy.prom
content: | content: |
# HELP veza_deploy_active_color 0=blue, 1=green. # HELP veza_deploy_active_color 0=blue, 1=green.
# TYPE veza_deploy_active_color gauge # TYPE veza_deploy_active_color gauge
veza_deploy_active_color{env="{{ veza_env }}"} {{ 0 if inactive_color == 'blue' else 1 }} veza_deploy_active_color{env="{{ veza_env }}"} {{ 0 if inactive_color == 'blue' else 1 }}
# HELP veza_deploy_release_sha info metric, label=sha. # HELP veza_deploy_release_sha info metric, label=sha.
# TYPE veza_deploy_release_sha gauge # TYPE veza_deploy_release_sha gauge
veza_deploy_release_sha{env="{{ veza_env }}",sha="{{ veza_release_sha }}",color="{{ inactive_color }}"} 1 veza_deploy_release_sha{env="{{ veza_env }}",sha="{{ veza_release_sha }}",color="{{ inactive_color }}"} 1
# HELP veza_deploy_last_success_timestamp unix epoch of last successful deploy. # HELP veza_deploy_last_success_timestamp unix epoch of last successful deploy.
# TYPE veza_deploy_last_success_timestamp gauge # TYPE veza_deploy_last_success_timestamp gauge
veza_deploy_last_success_timestamp{env="{{ veza_env }}"} {{ ansible_date_time.epoch }} veza_deploy_last_success_timestamp{env="{{ veza_env }}"} {{ ansible_date_time.epoch }}
mode: "0644" mode: "0644"
tags: [phaseF, metrics] tags: [phaseF, metrics]
rescue: rescue:
- name: Public health failed — record the failure timestamp - name: Public health failed — record the failure timestamp
ansible.builtin.copy: ansible.builtin.copy:
dest: /var/lib/node_exporter/textfile_collector/veza_deploy.prom dest: /var/lib/node_exporter/textfile_collector/veza_deploy.prom
content: | content: |
# HELP veza_deploy_last_failure_timestamp unix epoch of last failed deploy. # HELP veza_deploy_last_failure_timestamp unix epoch of last failed deploy.
# TYPE veza_deploy_last_failure_timestamp gauge # TYPE veza_deploy_last_failure_timestamp gauge
veza_deploy_last_failure_timestamp{env="{{ veza_env }}",sha="{{ veza_release_sha }}",color="{{ inactive_color }}"} {{ ansible_date_time.epoch }} veza_deploy_last_failure_timestamp{env="{{ veza_env }}",sha="{{ veza_release_sha }}",color="{{ inactive_color }}"} {{ ansible_date_time.epoch }}
mode: "0644" mode: "0644"
failed_when: false failed_when: false
- name: Re-switch HAProxy back to the prior color - name: Re-switch HAProxy back to the prior color (delegated)
ansible.builtin.import_role: delegate_to: "{{ groups['haproxy'][0] }}"
name: veza_haproxy_switch vars:
vars: ansible_connection: community.general.incus
veza_active_color: "{{ prior_active_color }}" ansible_python_interpreter: /usr/bin/python3
delegate_to: "{{ veza_container_prefix + 'haproxy' }}" block:
- name: Apply veza_haproxy_switch with prior_active_color
ansible.builtin.include_role:
name: veza_haproxy_switch
vars:
veza_active_color: "{{ prior_active_color }}"
- name: Fail the playbook - name: Fail the playbook
ansible.builtin.fail: ansible.builtin.fail:
msg: >- msg: >-
Public health probe via HAProxy failed after deploy of SHA Public health probe via HAProxy failed after deploy of SHA
{{ veza_release_sha[:12] }} to color {{ inactive_color }}. {{ veza_release_sha[:12] }} to color {{ inactive_color }}.
HAProxy reverted to the prior color ({{ prior_active_color }}). HAProxy reverted to the prior color ({{ prior_active_color }}).
The freshly-deployed {{ inactive_color }} containers are kept The freshly-deployed {{ inactive_color }} containers are kept
alive for forensics — inspect with: alive for forensics — inspect with:
incus exec {{ veza_container_prefix }}backend-{{ inactive_color }} -- journalctl -u veza-backend -n 200 incus exec {{ veza_container_prefix }}backend-{{ inactive_color }} -- journalctl -u veza-backend -n 200

View file

@ -112,28 +112,23 @@
gather_facts: false gather_facts: false
tasks: tasks:
- name: Launch container if absent - name: Launch container if absent
ansible.builtin.shell: | ansible.builtin.shell:
set -e cmd: |
if incus info "{{ item.name }}" >/dev/null 2>&1; then set -e
echo "{{ item.name }} already exists" if incus info "{{ item.name }}" >/dev/null 2>&1; then
exit 0 echo "{{ item.name }} already exists"
fi
incus launch {{ veza_app_base_image }} "{{ item.name }}" \
--profile veza-data \
--profile veza-net \
--network "{{ veza_incus_network }}"
# Wait for the container's API to respond before any subsequent task
# (apt, systemd) hits a half-up container.
for i in $(seq 1 {{ veza_app_container_ready_timeout | default(30) }}); do
if incus exec "{{ item.name }}" -- /bin/true 2>/dev/null; then
echo "Container {{ item.name }} ready"
exit 0 exit 0
fi fi
sleep 1 incus launch "{{ veza_app_base_image }}" "{{ item.name }}" --profile veza-data --profile veza-net --network "{{ veza_incus_network }}"
done for i in $(seq 1 {{ veza_app_container_ready_timeout | default(30) }}); do
echo "Container {{ item.name }} did not become ready within timeout" if incus exec "{{ item.name }}" -- /bin/true 2>/dev/null; then
exit 1 echo "Container {{ item.name }} ready"
args: exit 0
fi
sleep 1
done
echo "Container {{ item.name }} did not become ready within timeout"
exit 1
executable: /bin/bash executable: /bin/bash
loop: "{{ veza_data_containers }}" loop: "{{ veza_data_containers }}"
register: launch_result register: launch_result
@ -150,7 +145,7 @@
# tasks/<kind>.yml or role. # tasks/<kind>.yml or role.
# ----------------------------------------------------------------------- # -----------------------------------------------------------------------
- name: Configure postgres - name: Configure postgres
hosts: "{{ veza_container_prefix + 'postgres' }}" hosts: veza_data_postgres
become: true become: true
gather_facts: false gather_facts: false
vars: vars:
@ -198,7 +193,7 @@
tags: [data, postgres] tags: [data, postgres]
- name: Configure redis - name: Configure redis
hosts: "{{ veza_container_prefix + 'redis' }}" hosts: veza_data_redis
become: true become: true
gather_facts: false gather_facts: false
vars: vars:
@ -250,7 +245,7 @@
tags: [data, redis] tags: [data, redis]
- name: Configure rabbitmq - name: Configure rabbitmq
hosts: "{{ veza_container_prefix + 'rabbitmq' }}" hosts: veza_data_rabbitmq
become: true become: true
gather_facts: false gather_facts: false
vars: vars:
@ -295,7 +290,7 @@
tags: [data, rabbitmq] tags: [data, rabbitmq]
- name: Configure minio - name: Configure minio
hosts: "{{ veza_container_prefix + 'minio' }}" hosts: veza_data_minio
become: true become: true
gather_facts: false gather_facts: false
vars: vars:

View file

@ -1,14 +1,12 @@
# rollback.yml — two modes : # rollback.yml — two modes :
# #
# 1. fast : flip HAProxy back to the previous active color. # 1. fast : flip HAProxy back to the previous active color.
# Works only if those containers are still alive # Works only if those containers are still alive.
# (i.e., the next deploy has NOT yet recycled them).
# Effect time : ~5 seconds. # Effect time : ~5 seconds.
# #
# 2. full : redeploy a specific release_sha by re-running # 2. full : redeploy a specific release_sha by re-running
# deploy_app.yml with that SHA. Works whenever the # deploy_app.yml with that SHA.
# tarball is still in the Forgejo Registry. Effect # Effect time : ~5-10 minutes.
# time : ~5-10 minutes.
# #
# Required extra-vars: # Required extra-vars:
# env staging | prod # env staging | prod
@ -16,11 +14,7 @@
# target_color (mode=fast only) the color to flip TO # target_color (mode=fast only) the color to flip TO
# release_sha (mode=full only) the SHA to redeploy # release_sha (mode=full only) the SHA to redeploy
# #
# Caller (workflow_dispatch only — see .forgejo/workflows/rollback.yml): # Caller (workflow_dispatch only — see .forgejo/workflows/rollback.yml).
# ansible-playbook -i inventory/{{env}}.yml playbooks/rollback.yml \
# -e env={{env}} -e mode=fast -e target_color=blue
# ansible-playbook -i inventory/{{env}}.yml playbooks/rollback.yml \
# -e env={{env}} -e mode=full -e release_sha=<previous_sha>
--- ---
- name: Validate inputs - name: Validate inputs
hosts: incus_hosts hosts: incus_hosts
@ -57,27 +51,28 @@
# --------------------------------------------------------------------- # ---------------------------------------------------------------------
# mode=fast → HAProxy flip only. # mode=fast → HAProxy flip only.
# `when:` lives at TASK level (Ansible doesn't accept it at play level).
# --------------------------------------------------------------------- # ---------------------------------------------------------------------
- name: Fast rollback — verify target_color containers are alive - name: Fast rollback — verify target_color containers are alive
hosts: incus_hosts hosts: incus_hosts
become: true become: true
gather_facts: false gather_facts: false
tasks: tasks:
- name: Check each target-color container exists - name: Check each target-color container exists and is RUNNING
ansible.builtin.shell: | ansible.builtin.shell:
set -e cmd: |
CT="{{ veza_container_prefix }}{{ item }}-{{ target_color }}" set -e
if ! incus info "$CT" >/dev/null 2>&1; then CT="{{ veza_container_prefix }}{{ item }}-{{ target_color }}"
echo "MISSING $CT" if ! incus info "$CT" >/dev/null 2>&1; then
exit 1 echo "MISSING $CT"
fi exit 1
STATE=$(incus list "$CT" -c s --format csv) fi
if [ "$STATE" != "RUNNING" ]; then STATE=$(incus list "$CT" -c s --format csv)
echo "$CT is $STATE (not RUNNING)" if [ "$STATE" != "RUNNING" ]; then
exit 1 echo "$CT is $STATE (not RUNNING)"
fi exit 1
echo "OK $CT" fi
args: echo "OK $CT"
executable: /bin/bash executable: /bin/bash
loop: loop:
- backend - backend
@ -85,29 +80,31 @@
- web - web
changed_when: false changed_when: false
register: alive_check register: alive_check
when: mode == 'fast' when: mode == 'fast'
tags: [rollback, fast] tags: [rollback, fast]
- name: Fast rollback — flip HAProxy - name: Fast rollback — flip HAProxy
hosts: "{{ veza_container_prefix + 'haproxy' }}" hosts: haproxy
become: true become: true
gather_facts: true gather_facts: true
vars: tasks:
ansible_connection: community.general.incus - name: Apply veza_haproxy_switch with target_color
ansible_python_interpreter: /usr/bin/python3 ansible.builtin.include_role:
veza_active_color: "{{ target_color }}" name: veza_haproxy_switch
# Fast rollback re-uses the previous SHA from the history file. vars:
veza_release_sha: "{{ lookup('ansible.builtin.file', '/var/lib/veza/active-color.history', errors='ignore') | regex_search('sha=([0-9a-f]+)', '\\1') | default(['rollback'], true) | first }}" veza_active_color: "{{ target_color }}"
roles: # Fast rollback re-uses the previous SHA from the history file.
- veza_haproxy_switch # Fallback to a synthetic 40-char SHA if the file is missing —
when: mode == 'fast' # the role's assert tolerates this for the rollback case.
tags: [rollback, fast] veza_release_sha: "{{ (lookup('ansible.builtin.file', '/var/lib/veza/active-color.history', errors='ignore') | default('', true) | regex_search('sha=([0-9a-f]{40})', '\\1') | default('r0llback' + '0' * 32, true)) }}"
when: mode == 'fast'
tags: [rollback, fast]
# --------------------------------------------------------------------- # ---------------------------------------------------------------------
# mode=full → re-import deploy_app.yml with the rollback SHA. # mode=full → re-run deploy_app.yml with the rollback SHA.
# Functionally identical to a fresh deploy of an older release. # `when:` IS valid on import_playbook (unlike on a regular play).
# --------------------------------------------------------------------- # ---------------------------------------------------------------------
- name: Full rollback — delegate to deploy_app.yml with release_sha={{ veza_release_sha | default('') }} - name: Full rollback — delegate to deploy_app.yml
ansible.builtin.import_playbook: deploy_app.yml ansible.builtin.import_playbook: deploy_app.yml
when: mode == 'full' when: mode == 'full'
tags: [rollback, full] tags: [rollback, full]