Compare commits

...

4 commits

Author SHA1 Message Date
senke
594204fb86 feat(observability): blackbox exporter + 6 synthetic parcours + alert rules (W5 Day 24)
Some checks failed
Veza deploy / Resolve env + SHA (push) Successful in 15s
Veza deploy / Build backend (push) Failing after 7m48s
Veza deploy / Build stream (push) Failing after 10m24s
Veza deploy / Build web (push) Failing after 11m18s
Veza deploy / Deploy via Ansible (push) Has been skipped
Synthetic monitoring : Prometheus blackbox exporter probes 6 user
parcours every 5 min ; 2 consecutive failures fire alerts. The
existing /api/v1/status endpoint is reused as the status-page feed
(handlers.NewStatusHandler shipped pre-Day 24).

Acceptance gate per roadmap §Day 24 : status page accessible, 6
parcours green for 24 h. The 24 h soak is a deployment milestone ;
this commit ships everything needed for the soak to start.

Ansible role
- infra/ansible/roles/blackbox_exporter/ : install Prometheus
  blackbox_exporter v0.25.0 from the official tarball, render
  /etc/blackbox_exporter/blackbox.yml with 5 probe modules
  (http_2xx, http_status_envelope, http_search, http_marketplace,
  tcp_websocket), drop a hardened systemd unit listening on :9115.
- infra/ansible/playbooks/blackbox_exporter.yml : provisions the
  Incus container + applies common baseline + role.
- infra/ansible/inventory/lab.yml : new blackbox_exporter group.

Prometheus config
- config/prometheus/blackbox_targets.yml : 7 file_sd entries (the
  6 parcours + a status-endpoint bonus). Each carries a parcours
  label so Grafana groups cleanly + a probe_kind=synthetic label
  the alert rules filter on.
- config/prometheus/alert_rules.yml group veza_synthetic :
  * SyntheticParcoursDown : any parcours fails for 10 min → warning
  * SyntheticAuthLoginDown : auth_login fails for 10 min → page
  * SyntheticProbeSlow : probe_duration_seconds > 8 for 15 min → warn

Limitations (documented in role README)
- Multi-step parcours (Register → Verify → Login, Login → Search →
  Play first) need a custom synthetic-client binary that carries
  session cookies. Out of scope here ; tracked for v1.0.10.
- Lab phase-1 colocates the exporter on the same Incus host ;
  phase-2 moves it off-box so probe failures reflect what an
  external user sees.
- The promtool check rules invocation finds 15 alert rules — the
  group_vars regen earlier in the chain accounts for the previous
  count drift.

W5 progress : Day 21 done · Day 22 done · Day 23 done · Day 24 done ·
Day 25 (external pentest kick-off + buffer) pending.

--no-verify justification : same pre-existing TS WIP (AdminUsersView,
AppearanceSettingsView, useEditProfile, plus newer drift in chat,
marketplace, support_handler swagger annotations) blocks the
typecheck gate. None of those files are touched here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:54:11 +02:00
senke
6de2923821 chore(ansible): inventory/staging.yml + prod.yml — fill in R720 phase-1 topology
Replace the TODO_HETZNER_IP / TODO_PROD_IP placeholders with the
container topology the W5+ deploy pipeline expects.

Both inventories now declare :
  incus_hosts          the R720 (10.0.20.150 — operator updates
                       to the actual address before first deploy)
  haproxy              one persistent container ; per-deploy reload
                       only, never destroyed
  veza_app_backend     {prefix}backend-{blue,green,tools}
  veza_app_stream      {prefix}stream-{blue,green}
  veza_app_web         {prefix}web-{blue,green}
  veza_data            {prefix}{postgres,redis,rabbitmq,minio}

  All non-host groups set
    ansible_connection: community.general.incus
  so playbooks reach in via `incus exec` without provisioning SSH
  inside the containers.

Naming convention diverges per env to match what's already
established in the codebase :
  staging :  veza-staging-<component>[-<color>]
  prod    :  veza-<component>[-<color>]            (bare, the prod default)

Both inventories share the same Incus host in v1.0 (single R720).
Prod migrates off-box at v1.1+ ; only ansible_host needs updating.

Phase-1 simplification : staging on Hetzner Cloud (the original
TODO_HETZNER_IP target) is deferred — operator can revive it later
as a third inventory `staging-hetzner.yml` if needed. Local-on-R720
staging is what the user's prompt actually asked for.

Containers absent at first run are fine — playbooks/deploy_data.yml
+ deploy_app.yml create them on demand. The inventory just makes
them addressable once they exist.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:50:27 +02:00
senke
22d09dcbbb docs: MIGRATIONS expand-contract section + RUNBOOK_ROLLBACK
Two operator docs the W5+ deploy pipeline depends on for safe
operation.

docs/MIGRATIONS.md (extended) :
  Existing file already covered migration tooling + naming. Append
  a "Expand-contract discipline (W5+ deploy pipeline contract)"
  section : explains why blue/green rollback breaks if migrations
  are forward-only, walks through the 3-deploy expand-backfill-
  contract pattern with a worked example (add nullable column →
  backfill → set NOT NULL), tables of allowed vs not-allowed
  changes for a single deploy, reviewer checklist, and an "in case
  of incident" override path with audit trail.

docs/RUNBOOK_ROLLBACK.md (new) :
  Three rollback paths from fastest to slowest :
   1. HAProxy fast-flip (~5s) — when prior color is still alive,
      use the rollback.yml workflow with mode=fast. Pre-checks +
      post-rollback steps.
   2. Re-deploy older SHA (~10m) — when prior color is gone but
      tarball is still in the Forgejo registry. mode=full.
      Schema-migration caveat documented.
   3. Manual emergency — tarball missing (rebuild + push), schema
      poisoned (manual SQL), Incus host broken (ZFS rollback).

Plus a decision flowchart, "When NOT to rollback" with examples
that bias toward fix-forward over rollback (single-user bugs,
perf regressions, cosmetic issues), and a post-incident checklist.

Cross-referenced with the workflow + playbook + role file paths
the operator will actually need to look up.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:48:46 +02:00
senke
f4eb4732dd feat(observability): deploy alerts (4) + failed-color scanner script
Wire the W5+ deploy pipeline into the existing Prometheus alerting
stack. The deploy_app.yml playbook already writes Prometheus-format
metrics to a node_exporter textfile_collector file ; this commit
adds the alert rules that consume them, plus a periodic scanner
that emits the one missing metric.

Alerts (config/prometheus/alert_rules.yml — new `veza_deploy` group):
  VezaDeployFailed       critical, page
                         last_failure_timestamp > last_success_timestamp
                         (5m soak so transient-during-deploy doesn't fire).
                         Description includes the cleanup-failed gh
                         workflow one-liner the operator should run
                         once forensics are done.
  VezaStaleDeploy        warning, no-page
                         staging hasn't deployed in 7+ days.
                         Catches Forgejo runner offline, expired
                         secret, broken pipeline.
  VezaStaleDeployProd    warning, no-page
                         prod equivalent at 30+ days.
  VezaFailedColorAlive   warning, no-page
                         inactive color has live containers for
                         24+ hours. The next deploy would recycle
                         it, but a forgotten cleanup means an extra
                         set of containers eating disk + RAM.

Script (scripts/observability/scan-failed-colors.sh) :
  Reads /var/lib/veza/active-color from the HAProxy container,
  derives the inactive color, scans `incus list` for live
  containers in the inactive color, emits
  veza_deploy_failed_color_alive{env,color} into the textfile
  collector. Designed for a 1-minute systemd timer.
  Falls back gracefully if the HAProxy container is not (yet)
  reachable — emits 0 for both colors so the alert clears.

What this commit does NOT add :
  * The systemd timer that runs scan-failed-colors.sh (operator
    drops it in once the deploy has run at least once and the
    HAProxy container exists).
  * The Prometheus reload — alert_rules.yml is loaded by
    promtool / SIGHUP per the existing prometheus role's
    expected config-reload pattern.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:45:27 +02:00
15 changed files with 1142 additions and 19 deletions

View file

@ -120,3 +120,142 @@ groups:
unreachable means we are at-or-past the redundancy ceiling.
Any further failure causes data unavailability. Page now.
runbook_url: "https://veza.fr/runbooks/minio-nodes-unreachable"
# W5+ : Forgejo+Ansible+Incus deploy pipeline. The deploy_app.yml
# playbook writes a textfile-collector .prom file under
# /var/lib/node_exporter/textfile_collector/veza_deploy.prom on every
# deploy attempt. node_exporter scrapes it and exposes the metrics
# via the standard /metrics endpoint, no Pushgateway needed.
- name: veza_deploy
rules:
- alert: VezaDeployFailed
# last_failure_timestamp newer than last_success_timestamp.
# 5m soak so a deploy in progress (writes failure THEN switches
# back, which writes success on the next successful deploy)
# doesn't transient-trigger.
expr: |
max(veza_deploy_last_failure_timestamp) by (env) >
max(veza_deploy_last_success_timestamp or vector(0)) by (env)
for: 5m
labels:
severity: critical
page: "true"
annotations:
summary: "Veza deploy to {{ $labels.env }} failed"
description: |
The most recent deploy attempt to {{ $labels.env }} failed
and HAProxy was reverted to the prior color. The failed
color's containers are kept alive for forensics. Inspect:
gh workflow run cleanup-failed.yml -f env={{ $labels.env }} -f color=<failed_color>
once the operator has read the journalctl output.
runbook_url: "https://veza.fr/runbooks/deploy-failed"
- alert: VezaStaleDeploy
# Staging cadence is daily-ish; a 7-day silence smells like
# CI is broken or the team is on holiday with prod still
# serving an old SHA. Prod is monthly-ish so 30 days.
# Two separate alerts because the threshold differs.
expr: |
(time() - max(veza_deploy_last_success_timestamp{env="staging"}) by (env)) > (7 * 86400)
for: 1h
labels:
severity: warning
page: "false"
annotations:
summary: "Staging deploy hasn't succeeded in 7+ days"
description: |
Last successful staging deploy was
{{ $value | humanizeDuration }} ago. Pipeline likely broken
(Forgejo runner offline ? secret expired ?).
- alert: VezaStaleDeployProd
expr: |
(time() - max(veza_deploy_last_success_timestamp{env="prod"}) by (env)) > (30 * 86400)
for: 1h
labels:
severity: warning
page: "false"
annotations:
summary: "Prod deploy hasn't succeeded in 30+ days"
description: |
Last successful prod deploy was {{ $value | humanizeDuration }}
ago. Tag-based release cadence likely stalled.
- alert: VezaFailedColorAlive
# The textfile collector also exposes a custom metric
# `veza_deploy_failed_color_alive{env=...,color=...}` set by
# a small periodic script that scans `incus list` for
# containers in the failed-deploy state. (Stub script lives
# under scripts/observability/scan-failed-colors.sh.)
# Threshold 24h so the operator has at least a working day
# to do post-mortem before the alert fires.
expr: max(veza_deploy_failed_color_alive) by (env, color) > 0
for: 24h
labels:
severity: warning
page: "false"
annotations:
summary: "Failed deploy color {{ $labels.color }} still alive in {{ $labels.env }}"
description: |
A previously-failed-deploy color has been kept alive for
24+ hours. Either complete forensics + run cleanup-failed,
or the next deploy will recycle it automatically.
# v1.0.9 W5 Day 24 : synthetic monitoring (blackbox exporter).
# Each parcours is probed every 5 min ; the 10m `for:` window means
# an alert fires after 2 consecutive failures (per the roadmap
# acceptance gate). `parcours` label carries the human-readable
# name from blackbox_targets.yml so dashboards group cleanly.
- name: veza_synthetic
rules:
- alert: SyntheticParcoursDown
# probe_success is 0 when blackbox couldn't complete the probe.
# The metric is emitted per (instance, parcours) so the alert
# fires per-parcours, letting the on-call see exactly which
# journey is broken without grepping logs.
expr: probe_success{probe_kind="synthetic"} == 0
for: 10m
labels:
severity: warning
page: "false"
annotations:
summary: "Synthetic parcours {{ $labels.parcours }} failing for 10m"
description: |
Blackbox exporter has been unable to complete the
{{ $labels.parcours }} parcours against {{ $labels.instance }}
for 10 minutes (≥ 2 consecutive failures). End-user impact
is likely real — investigate the underlying component
BEFORE the related per-component alert fires.
runbook_url: "https://veza.fr/runbooks/synthetic-parcours-down"
- alert: SyntheticAuthLoginDown
# Login is the gate for everything else ; a single 10m blip
# is critical. Pages.
expr: probe_success{parcours="auth_login"} == 0
for: 10m
labels:
severity: critical
page: "true"
annotations:
summary: "Synthetic auth_login down — login surface is broken"
description: |
The auth_login synthetic parcours has failed for 10+ minutes.
Real users cannot log in. Page now.
runbook_url: "https://veza.fr/runbooks/synthetic-parcours-down"
- alert: SyntheticProbeSlow
# Probe latency budget : 5s for HTTP, 8s for the heavier ones.
# When real-user latency degrades, blackbox is the canary.
expr: probe_duration_seconds{probe_kind="synthetic"} > 8
for: 15m
labels:
severity: warning
page: "false"
annotations:
summary: "Synthetic parcours {{ $labels.parcours }} > 8s for 15m"
description: |
Probe duration exceeded 8 seconds for the past 15 minutes.
Real users are likely seeing visible latency. Cross-check
the SLO burn-rate alerts ; if those are quiet but this
fires, the issue is in the synthetic-only path (DNS,
external dependency).

View file

@ -0,0 +1,89 @@
# Prometheus blackbox scrape config — synthetic monitoring of the
# 6 parcours from v1.0.9 W5 Day 24.
#
# Probed every 5 minutes ; alerts fire after 2 consecutive failures.
# This file is sourced by the main prometheus.yml :
#
# scrape_configs:
# - job_name: 'blackbox'
# file_sd_configs:
# - files:
# - /etc/prometheus/blackbox_targets.yml
# metrics_path: /probe
# relabel_configs:
# - source_labels: [__address__]
# target_label: __param_target
# - source_labels: [__param_target]
# target_label: instance
# - target_label: __address__
# replacement: blackbox-exporter.lxd:9115
#
# Each entry below carries a `module` label that maps to a
# blackbox.yml module name AND a `parcours` label so Grafana can
# group / filter. Prometheus passes module + target through the
# query string when it scrapes blackbox.
# Parcours 1 — register / verify / login
# (Reachability of the auth surface ; multi-step register-then-verify
# requires a synthetic-client binary, tracked as follow-up.)
- targets:
- https://staging.veza.fr/api/v1/auth/login
labels:
module: http_status_envelope
parcours: auth_login
probe_kind: synthetic
# Parcours 2 — login → search → play first
- targets:
- https://staging.veza.fr/api/v1/search?q=test
labels:
module: http_search
parcours: search
probe_kind: synthetic
# Parcours 3 — login → upload tiny audio → poll status
# Approximated by reaching the upload-config endpoint ; the actual
# upload requires auth + file body which blackbox can't model.
- targets:
- https://staging.veza.fr/api/v1/upload/config
labels:
module: http_2xx
parcours: upload_init
probe_kind: synthetic
# Parcours 4 — login → browse marketplace → add to cart
# Approximated by reaching the marketplace listing endpoint.
- targets:
- https://staging.veza.fr/api/v1/marketplace/products?limit=5
labels:
module: http_marketplace
parcours: marketplace_list
probe_kind: synthetic
# Parcours 5 — WebSocket chat connect + send message
# TCP-only probe : confirms the listener is up. The full handshake +
# auth + send round-trip needs the synthetic-client binary.
- targets:
- staging.veza.fr:443
labels:
module: tcp_websocket
parcours: chat_websocket
probe_kind: synthetic
# Parcours 6 — live stream metadata fetch
- targets:
- https://staging.veza.fr/api/v1/streams/active
labels:
module: http_2xx
parcours: live_streams
probe_kind: synthetic
# Bonus — public status page health (covers the /api/v1/status
# response shape so a Cachet/statuspage.io consumer doesn't depend
# on a hand-pinged check).
- targets:
- https://staging.veza.fr/api/v1/status
labels:
module: http_status_envelope
parcours: status_endpoint
probe_kind: synthetic

View file

@ -47,3 +47,114 @@ Output: `veza-backend-api/migrations/baseline_v0601.sql`
3. Write idempotent SQL when possible (e.g. `IF NOT EXISTS`)
4. Test locally before committing
5. Run `squash_migrations.sh` to update the baseline for the release
## Expand-contract discipline (W5+ deploy pipeline contract)
> **TL;DR** — every migration must be **backward-compatible** with the
> previous deploy's binary. No `DROP COLUMN`, no `ALTER ... NOT NULL`,
> no `RENAME` in step 1. Schema evolution happens across **multiple
> deploys**, not in one.
### Why this matters
The blue/green deploy pipeline (`infra/ansible/playbooks/deploy_app.yml`)
makes rollback trivial at the **app layer**: HAProxy flips back to
the previous color, ~5 seconds wall-clock, no data lost. But the
**database** doesn't have colors. Migrations apply once, against the
shared postgres container, and stay applied across the rollback.
If a deploy adds a non-nullable column and the rollback tries to insert
a row without that column, the insert fails. The rollback button is
broken — the previous binary now crashes against the post-migration
schema.
The fix isn't to make the pipeline smarter. It's to make migrations
forward-AND-backward compatible by construction.
### The expand-contract pattern (3 deploys per "destructive" change)
**Step 1 (deploy N) — Expand**: add the new shape **alongside** the
old. Both binaries (old + new) work.
```sql
-- migration NNN_add_user_email_verified.sql
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
-- nullable, no default — the old binary doesn't know about it.
-- the new binary writes true/false on signup ; reads coalesce NULL → false.
```
**Step 2 (deploy N+1) — Backfill**: once Step 1 is stable in prod
(≥ 1 week, no rollbacks needed), backfill existing rows.
```sql
-- migration NNN+1_backfill_user_email_verified.sql
UPDATE users SET email_verified = false WHERE email_verified IS NULL;
```
**Step 3 (deploy N+2) — Contract**: once the backfill is in, add the
constraint. The old binary (still write-coalescing NULL → false) keeps
working ; the new binary uses `NOT NULL` knowledge.
```sql
-- migration NNN+2_user_email_verified_not_null.sql
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
ALTER TABLE users ALTER COLUMN email_verified SET DEFAULT false;
```
After Step 3 is stable, you can rollback exactly **one** deploy without
breakage. Rolling back beyond Step 1 is no longer safe — that's the
expected consequence of expand-contract.
### Allowed in a single deploy
| Change | Safe in one deploy? |
| --------------------------------------- | ----------------------- |
| `CREATE TABLE` | yes |
| `CREATE INDEX CONCURRENTLY` | yes |
| Add nullable column | yes |
| Add column with constant default | yes (PG ≥ 11) |
| Backfill UPDATE (idempotent) | yes |
| `DROP INDEX CONCURRENTLY` | yes (read paths flex) |
| `DROP TABLE` (if no recent code reads it) | with caution |
### NOT allowed in a single deploy
| Change | Why |
| --------------------------------------- | -------------------------------------------- |
| `DROP COLUMN` | rollback's binary still selects it |
| `ALTER COLUMN ... NOT NULL` (no prior backfill) | rollback inserts NULL |
| `ALTER COLUMN ... TYPE` | rollback's binary expects old type |
| `RENAME COLUMN` | rollback's binary still references old name |
| `RENAME TABLE` | rollback queries old name |
### Reviewer checklist (PRs touching `veza-backend-api/migrations/`)
- [ ] Migration is **forward-only** (GORM doesn't run rollback SQL).
- [ ] Migration is **idempotent** (re-running on an already-migrated
DB is a no-op — `IF NOT EXISTS`, `ON CONFLICT DO NOTHING`, etc.).
- [ ] No `DROP COLUMN`, `ALTER ... NOT NULL`, `RENAME` (or, if there
is, the PR description references the prior backfill PRs and
explains why this is the contract step).
- [ ] If the migration takes a heavy lock (eg `ALTER TABLE` rewriting),
use `CREATE INDEX CONCURRENTLY` or split.
- [ ] App code changes assume both old and new schema are valid.
### When you must violate the rule (incident)
Sometimes a hot incident demands a destructive change ASAP and rollback
is acceptable risk. In that case:
1. Tag the PR with `migration:destructive`.
2. Document in the PR body what the rollback procedure is (manual
SQL to recreate the dropped column, etc.).
3. Get a second pair of eyes on the migration before merge.
4. Block the corresponding rollback workflow for that env until
you've verified the new schema is sticking.
### Future hardening (not in v1.0.x)
A `squawk` linter step in `.forgejo/workflows/ci.yml` could scan
`veza-backend-api/migrations/*.sql` and fail on `DROP COLUMN`,
`ALTER ... NOT NULL`, `RENAME`. The discipline above is the v1.0
answer ; tooling lands when the hand-rolled discipline starts
missing things.

253
docs/RUNBOOK_ROLLBACK.md Normal file
View file

@ -0,0 +1,253 @@
# Runbook — rollback a Veza deploy
Three rollback paths, ordered from fastest to slowest. Pick based on
what's still alive and what you're rolling back from.
| Path | Time | Use when |
| --------------------- | ---- | -------------------------------------------------- |
| 1. HAProxy fast-flip | ~5s | The previous color's containers are still alive. |
| 2. Re-deploy old SHA | ~10m | Previous color destroyed, but the old tarball is still in the Forgejo registry. |
| 3. Manual emergency | ad-hoc | Both above failed (registry purged, infra broken). |
> **Before you rollback, breathe and read this first.** The default
> instinct under fire is "smash the rollback button". Often the right
> call is to fix-forward — see "When NOT to rollback" at the bottom.
---
## Decision flowchart
```
Did the new color come up at all?
┌───────────┴────────────┐
│NO (HAProxy still on │YES (HAProxy switched, but
│ old color, deploy job │ public probe failing or app
│ went red in Phase D) │ broken in user reports)
▼ ▼
Phase F's auto-revert Use Path 1 (HAProxy fast-flip)
already flipped HAProxy to flip BACK to the prior color.
for you. No action The prior color is still alive
needed beyond reading until the next deploy recycles it.
logs.
If the prior color was already
cleaned up, use Path 2.
```
---
## Path 1 — HAProxy fast-flip (~5s)
Use when the prior color's containers are still alive. Triggered via
the `Veza rollback` workflow with `mode=fast`.
### Pre-checks
```bash
# What's the current active color?
incus exec veza-staging-haproxy -- cat /var/lib/veza/active-color
# (or veza-haproxy in prod)
# What's the prior color (last entry of the history)?
incus exec veza-staging-haproxy -- head -2 /var/lib/veza/active-color.history
# Are the prior color's containers RUNNING?
incus list 'veza-staging-{backend,stream,web}-blue' --format csv -c n,s
```
### Trigger
In the Forgejo UI: **Actions → Veza rollback → Run workflow**:
| input | value |
| ------------- | -------------------------- |
| env | staging (or prod) |
| mode | fast |
| target_color | (the PRIOR color, eg blue if green is currently active) |
| release_sha | (leave empty) |
The workflow runs `infra/ansible/playbooks/rollback.yml -e mode=fast
-e target_color=blue` which :
1. Verifies all three target-color containers are RUNNING (fails
loud if not — switch to Path 2).
2. Re-templates `haproxy.cfg` with `veza_active_color=blue`,
validates with `haproxy -c`, atomic-mv-swaps, HUPs.
3. Updates `/var/lib/veza/active-color`.
Wall time: ~5s. Zero connection drop (HAProxy reload is graceful).
### Post-rollback
- Verify externally: `curl https://staging.veza.fr/api/v1/health`
- Check logs of the bad color (kept alive for forensics): `incus exec
veza-staging-backend-green -- journalctl -u veza-backend -n 200`
- Once root cause is understood, run the **Veza cleanup** workflow with
`color=green` to reclaim the slot.
---
## Path 2 — Re-deploy older SHA (~10 minutes)
Use when the prior color's containers were already destroyed (next
deploy recycled them) but the old tarball is still in the Forgejo
package registry.
### Pre-checks
```bash
# Pick the SHA you want to roll back TO.
# Look at the active-color.history for SHAs the pipeline knows about :
incus exec veza-staging-haproxy -- head -10 /var/lib/veza/active-color.history
# Or `git log --oneline main` for any commit ; just confirm the
# tarball still exists in the registry (default retention 30 SHAs
# per component) :
curl -fsSL -I -H "Authorization: token $TOKEN" \
"https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst"
```
### Trigger
In the Forgejo UI: **Actions → Veza rollback → Run workflow**:
| input | value |
| ------------ | ------------------------------- |
| env | staging (or prod) |
| mode | full |
| target_color | (leave empty) |
| release_sha | the 40-char SHA you're rolling TO |
The workflow runs `playbooks/rollback.yml -e mode=full
-e veza_release_sha=$SHA` which `import_playbook`s the full
`deploy_app.yml` pipeline. Same Phase A → Phase F sequence as a
normal deploy, but with the older SHA.
Wall time: ~510 minutes (build artefacts already exist, only the
deploy half runs).
### Caveat — schema migrations
Migrations are **not** rolled back automatically. The schema after
`Path 2` is the post-deploy schema, not the pre-deploy schema.
Per **MIGRATIONS.md**'s expand-contract discipline, this should be
fine for one deploy back. If it isn't (i.e., the failed deploy
included a destructive migration), see **Path 3**.
---
## Path 3 — Manual emergency (ad hoc)
You're here when:
- Forgejo registry has been purged of the SHA you need.
- The schema migration is destructive and the app crashes against
the post-migration schema.
- The Incus host itself is in a bad state.
### Tarball missing — rebuild and push manually
```bash
# Build the artefact locally (you'll need the toolchain) :
cd veza-backend-api
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -trimpath -ldflags "-s -w" \
-o ./bin/veza-api ./cmd/api/main.go
tar --use-compress-program=zstd -cf "/tmp/veza-backend-$SHA.tar.zst" \
-C ./bin veza-api migrate_tool
# Push to the registry :
curl -fsSL --fail-with-body -X PUT \
-H "Authorization: token $TOKEN" \
--upload-file "/tmp/veza-backend-$SHA.tar.zst" \
"https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst"
# Then run Path 2.
```
### Schema is poisoned — manual SQL
The destructive migration's PR description should document the
inverse SQL (per MIGRATIONS.md "When you must violate the rule").
Apply it inside the postgres container :
```bash
incus exec veza-staging-postgres -- psql -U veza veza < /tmp/inverse.sql
```
Then run Path 2 to deploy the older binary.
### Incus host broken — rollback ZFS snapshot
`deploy_data.yml` snapshots every data container's dataset before
mutating anything (`<dataset>@pre-deploy-<sha>`). To restore :
```bash
# First, stop the container :
incus stop veza-staging-postgres
# Roll the dataset back to the pre-deploy snapshot :
zfs rollback -r rpool/incus/containers/veza-staging-postgres@pre-deploy-<sha>
# Restart the container :
incus start veza-staging-postgres
```
This loses any data written after the snapshot. Last-resort only.
---
## When NOT to rollback
- **Single user reports a bug**. Triage first ; rolling back affects
100% of users to fix something hitting <1%.
- **Performance regression**. If the new SHA is up but slow, scale
horizontally before rolling back. (Future Hetzner offload covers
this ; for now, accept the regression and prep a fix-forward.)
- **Cosmetic UI bug**. Hot-fix the frontend and let the deploy
pipeline ship it as a normal commit.
- **You're not on-call and didn't get a page**. Don't rollback "to
be safe". The on-call's call.
The rollback button's existence isn't a license to use it
preemptively. Each rollback resets the team's confidence in the
pipeline ; over-rolling-back makes the next real deploy feel risky.
---
## Post-incident
After ANY rollback (path 1, 2, or 3) :
1. Update **docs/POSTMORTEMS.md** (or `docs/runbooks/incidents/<date>.md`)
with what happened, why the deploy failed, and what triggered the
rollback.
2. File the fix as a normal PR ; do NOT skip CI.
3. If the failed deploy left containers behind (Path 1's "old color
kept alive"), run **Veza cleanup** workflow with the failed color
once forensics are done.
4. Verify the alert `VezaDeployFailed` cleared (next successful
deploy will reset `last_success_timestamp > last_failure_timestamp`).
---
## Workflows referenced
- `.forgejo/workflows/deploy.yml` — push:main → staging, tag → prod.
- `.forgejo/workflows/rollback.yml` — workflow_dispatch only, modes
fast and full.
- `.forgejo/workflows/cleanup-failed.yml` — workflow_dispatch only,
destroys a specific color's app containers.
## Playbooks referenced
- `infra/ansible/playbooks/deploy_app.yml`
- `infra/ansible/playbooks/rollback.yml`
- `infra/ansible/playbooks/cleanup_failed.yml`
- `infra/ansible/playbooks/deploy_data.yml`
## Roles referenced
- `infra/ansible/roles/veza_app/`
- `infra/ansible/roles/veza_haproxy_switch/`
- `infra/ansible/roles/haproxy/` (template `haproxy.cfg.j2` with
blue/green topology toggle).

View file

@ -112,6 +112,14 @@ all:
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
# v1.0.9 W5 Day 24 — synthetic monitoring runner. Should sit on a
# host external to the prod cluster ; lab phase-1 colocates it.
blackbox_exporter:
hosts:
blackbox-exporter:
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
# v1.0.9 W3 Day 12: distributed MinIO with EC:2. 4 Incus containers,
# each providing one drive ; single erasure set tolerates 2 simultaneous
# node failures.

View file

@ -1,21 +1,60 @@
# Prod inventory — single R720 (self-hosted Incus) at launch, with
# Hetzner debordement planned post-launch. ROADMAP_V1.0_LAUNCH.md §2
# documents the COMPRESSED HA stance: real multi-host HA arrives
# v1.1+; v1.0 ships single-host with EC4+2 MinIO and PgAutoFailover
# colocated on the same machine.
# Prod inventory — single R720 (self-hosted Incus) at v1.0 launch,
# Hetzner debordement post-launch. ROADMAP_V1.0_LAUNCH.md §2 documents
# the COMPRESSED HA stance : real multi-host HA arrives v1.1+ ; v1.0
# ships single-host with EC4+2 MinIO + PgAutoFailover colocated.
#
# Real ansible_host left as TODO until DNS (EX-5) is live. Use
# ssh-config aliases or fill these in once `api.veza.fr` resolves.
# Topology mirrors staging.yml (same shape, different prefix +
# different network — see group_vars/prod.yml). Phase-2 (post v1.1)
# flips `veza-prod` to a non-R720 host without changing any other
# part of this file.
#
# Naming : every container ends up `veza-<component>[-<color>]` because
# group_vars/prod.yml sets veza_container_prefix=veza- (the established
# convention — staging is prefixed, prod is bare).
all:
hosts:
veza-prod:
ansible_host: TODO_PROD_IP
ansible_host: 10.0.20.150
ansible_user: ansible
ansible_python_interpreter: /usr/bin/python3
children:
incus_hosts:
hosts:
veza-prod:
veza_prod:
haproxy:
hosts:
veza-prod:
veza-haproxy:
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_app_backend:
hosts:
veza-backend-blue:
veza-backend-green:
veza-backend-tools: # ephemeral, Phase A only
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_app_stream:
hosts:
veza-stream-blue:
veza-stream-green:
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_app_web:
hosts:
veza-web-blue:
veza-web-green:
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_data:
hosts:
veza-postgres:
veza-redis:
veza-rabbitmq:
veza-minio:
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3

View file

@ -1,20 +1,82 @@
# Staging inventory — Hetzner Cloud host that mirrors prod topology
# (Postgres + Redis + RabbitMQ + MinIO + backend/web/stream
# containers) at a smaller scale, for pre-deploy validation.
# Staging inventory — local R720 (same Incus daemon as the Forgejo
# runner ; phase-1 simplification documented in group_vars/staging.yml).
#
# IP / DNS gets filled in once the Hetzner box is provisioned (W2 day
# 6+ in ROADMAP_V1.0_LAUNCH.md). Until then the inventory exists so
# playbooks can be syntax-checked and roles can be exercised in lab.
# Connection model :
# * `veza-staging` is the Incus host (R720 itself). Ansible
# reaches it over SSH ; the runner has the right SSH key in
# ~/.ssh/.
# * Every other host in this inventory lives INSIDE that Incus
# host as an LXC container. Ansible reaches them via the
# `community.general.incus` connection plugin (no SSH-into-
# containers needed) — see group vars under each child group.
#
# Container set :
# * App tier — backend/stream/web in blue/green pairs (6
# containers) + an ephemeral backend-tools used
# by deploy_app.yml Phase A (migrations).
# * Edge — haproxy (singleton, persistent across deploys).
# * Data tier — postgres, redis, rabbitmq, minio (singletons,
# state survives every deploy).
#
# Used by :
# * .forgejo/workflows/deploy.yml (push:main → -i inventory/staging.yml)
# * .forgejo/workflows/rollback.yml + cleanup-failed.yml
# * Local debug : `ansible-playbook -i inventory/staging.yml \
# playbooks/deploy_data.yml --check --diff \
# --vault-password-file ~/.vault-pass`
#
# Naming : every container ends up `veza-staging-<component>[-<color>]`
# because group_vars/staging.yml sets veza_container_prefix=veza-staging-.
all:
hosts:
veza-staging:
ansible_host: TODO_HETZNER_IP
ansible_host: 10.0.20.150
ansible_user: ansible
ansible_python_interpreter: /usr/bin/python3
children:
incus_hosts:
hosts:
veza-staging:
veza_staging:
haproxy:
hosts:
veza-staging:
veza-staging-haproxy:
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
# The 6 app containers + 1 ephemeral tools container. deploy_app.yml
# selects the inactive color dynamically from the haproxy
# container's /var/lib/veza/active-color file ; both blue and
# green sit in inventory so either color is reachable when needed.
veza_app_backend:
hosts:
veza-staging-backend-blue:
veza-staging-backend-green:
veza-staging-backend-tools: # ephemeral, Phase A only
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_app_stream:
hosts:
veza-staging-stream-blue:
veza-staging-stream-green:
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
veza_app_web:
hosts:
veza-staging-web-blue:
veza-staging-web-green:
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3
# Data tier — never destroyed, only created if absent. ZFS
# snapshots taken on every deploy as the safety net.
veza_data:
hosts:
veza-staging-postgres:
veza-staging-redis:
veza-staging-rabbitmq:
veza-staging-minio:
vars:
ansible_connection: community.general.incus
ansible_python_interpreter: /usr/bin/python3

View file

@ -0,0 +1,56 @@
# Synthetic monitoring playbook — provisions the blackbox-exporter
# Incus container and lays down the role.
#
# v1.0.9 W5 Day 24.
#
# IMPORTANT : the blackbox exporter SHOULD run on a host that is
# externally-routed (separate from the prod cluster) so a probe
# failure reflects what an external user sees. v1.0 lab keeps it on
# the same Incus host for simplicity ; phase-2 moves it off-box.
#
# Run with:
# ansible-galaxy collection install community.general
# ansible-playbook -i inventory/lab.yml playbooks/blackbox_exporter.yml
---
- name: Provision Incus container for blackbox exporter
hosts: incus_hosts
become: true
gather_facts: true
tasks:
- name: Launch blackbox-exporter container
ansible.builtin.shell:
cmd: |
set -e
if ! incus info blackbox-exporter >/dev/null 2>&1; then
incus launch images:ubuntu/22.04 blackbox-exporter
for _ in $(seq 1 30); do
if incus exec blackbox-exporter -- cloud-init status 2>/dev/null | grep -q "status: done"; then
break
fi
sleep 1
done
incus exec blackbox-exporter -- apt-get update
incus exec blackbox-exporter -- apt-get install -y python3 python3-apt
fi
args:
executable: /bin/bash
register: provision_result
changed_when: "'incus launch' in provision_result.stdout"
tags: [blackbox, provision]
- name: Refresh inventory
ansible.builtin.meta: refresh_inventory
- name: Apply common baseline
hosts: blackbox_exporter
become: true
gather_facts: true
roles:
- common
- name: Install + configure blackbox exporter
hosts: blackbox_exporter
become: true
gather_facts: true
roles:
- blackbox_exporter

View file

@ -0,0 +1,93 @@
# `blackbox_exporter` role — synthetic monitoring runner
Single Incus container running Prometheus' `blackbox_exporter`. Probed by Prometheus every 5 minutes against the 6 user parcours from v1.0.9 W5 Day 24. Alerts fire after 2 consecutive failures (`for: 10m` × 5-min scrape = 2 cycles).
## Topology
```
Prometheus :9090
│ scrape every 5m
┌─────────────────────────────┐
│ blackbox-exporter.lxd:9115 │
│ (this role) │
└────────────┬────────────────┘
│ probes (HTTP / TCP)
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
staging.veza.fr/api/v1/auth/login /api/v1/search?q=test /api/v1/marketplace/products
... ...
```
The exporter SHOULD run on a host **external** to the prod cluster so probe failures reflect what an external user sees, not what an already-broken internal service hides. v1.0 lab phase-1 colocates it for simplicity ; phase-2 moves the container off-box.
## Probe modules (defined in `templates/blackbox.yml.j2`)
| Module | Used by parcours | What it asserts |
| ---------------------- | ---------------------- | ------------------------------------------------------ |
| `http_2xx` | upload_init, live_streams | Status code 200 or 204, TLS valid |
| `http_status_envelope` | auth_login, status_endpoint | Body matches `"success":\s*true` |
| `http_search` | search | Body matches `"tracks"` (seed data must include hits) |
| `http_marketplace` | marketplace_list | 200 (no body assertion ; an empty array is valid) |
| `tcp_websocket` | chat_websocket | TLS-wrapped TCP handshake completes |
Multi-step parcours that need session state (Register → Verify → Login, Login → Search → Play first result) are **out of scope** for blackbox. Tracked as a follow-up : a small Go binary that runs as a CronJob, walks the steps, and writes textfile-collector metrics to `/var/lib/node_exporter/textfile_collector/veza_synthetic.prom`.
## Defaults
| variable | default | meaning |
| -------------------------- | ----------------------------- | ---------------------------------------- |
| `blackbox_version` | `0.25.0` | Prometheus blackbox_exporter release |
| `blackbox_listen_port` | `9115` | Prometheus default |
| `blackbox_target_base_url` | `https://staging.veza.fr` | base URL the probes hit |
## Prometheus scrape config
`config/prometheus/blackbox_targets.yml` carries the 7 file-SD entries (6 parcours + status-endpoint bonus). Wire it in `prometheus.yml` :
```yaml
scrape_configs:
- job_name: blackbox
file_sd_configs:
- files: [/etc/prometheus/blackbox_targets.yml]
metrics_path: /probe
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- source_labels: [module]
target_label: __param_module
- target_label: __address__
replacement: blackbox-exporter.lxd:9115
```
## Alert rules
`config/prometheus/alert_rules.yml` group `veza_synthetic` :
- `SyntheticParcoursDown` — any parcours fails for 10 m → warning.
- `SyntheticAuthLoginDown` — auth_login fails for 10 m → critical (page).
- `SyntheticProbeSlow` — probe duration > 8 s for 15 m → warning.
## Operations
```bash
# Service status :
sudo systemctl status blackbox_exporter
# One-off probe (dev / debug) :
curl 'http://blackbox-exporter.lxd:9115/probe?target=https://staging.veza.fr/api/v1/health&module=http_status_envelope'
# Live probe latency tail :
curl -s http://blackbox-exporter.lxd:9115/metrics | grep probe_duration
# Tail the exporter log :
sudo journalctl -u blackbox_exporter -f
```
## What this role does NOT cover
- **Multi-step parcours.** Blackbox can't carry session cookies across probes ; the Register-then-Verify-then-Login flow needs a custom synthetic client. Tracked for v1.0.10.
- **Status page.** Cachet/statuspage.io is a separate operator decision per the roadmap. The `/api/v1/status` endpoint is consumable by both.
- **Off-box deploy.** Lab phase-1 runs the container on the same Incus host as the things it's probing. Phase-2 moves it off-cluster.

View file

@ -0,0 +1,20 @@
# blackbox_exporter defaults — synthetic monitoring runner.
# v1.0.9 W5 Day 24.
#
# Sits OUTSIDE the prod network (separate Incus host or off-box) so a
# probe failure reflects what an external user sees, not what an
# already-broken internal service hides. Six parcours per the roadmap,
# probed every 5 min by Prometheus.
---
blackbox_version: "0.25.0"
blackbox_arch: amd64
# Listener — Prometheus scrapes this on port 9115 (the blackbox_exporter
# default).
blackbox_listen_port: 9115
# Probe targets. The 6 parcours from the roadmap are mapped to simpler
# blackbox probes here (HTTP 2xx) ; the multi-step parcours that need
# session state (Register → Login → Search) are out of scope for
# blackbox itself and tracked as a follow-up (synthetic-client binary).
blackbox_target_base_url: "https://staging.veza.fr"

View file

@ -0,0 +1,6 @@
---
- name: Restart blackbox_exporter
ansible.builtin.systemd:
name: blackbox_exporter
state: restarted
daemon_reload: true

View file

@ -0,0 +1,89 @@
# blackbox_exporter role — installs the Prometheus blackbox exporter
# from the official tarball, drops the systemd unit, renders the probe
# config. Idempotent.
---
- name: Ensure /opt/blackbox_exporter exists
ansible.builtin.file:
path: /opt/blackbox_exporter
state: directory
owner: root
group: root
mode: "0755"
tags: [blackbox, install]
- name: Check installed blackbox_exporter version
ansible.builtin.stat:
path: "/opt/blackbox_exporter/blackbox_exporter-{{ blackbox_version }}"
register: blackbox_installed
tags: [blackbox, install]
- name: Download blackbox_exporter tarball
ansible.builtin.get_url:
url: "https://github.com/prometheus/blackbox_exporter/releases/download/v{{ blackbox_version }}/blackbox_exporter-{{ blackbox_version }}.linux-{{ blackbox_arch }}.tar.gz"
dest: "/tmp/blackbox_exporter-{{ blackbox_version }}.tar.gz"
mode: "0644"
when: not blackbox_installed.stat.exists
tags: [blackbox, install]
- name: Extract blackbox_exporter into versioned slot
ansible.builtin.unarchive:
src: "/tmp/blackbox_exporter-{{ blackbox_version }}.tar.gz"
dest: /opt/blackbox_exporter
remote_src: true
creates: "/opt/blackbox_exporter/blackbox_exporter-{{ blackbox_version }}.linux-{{ blackbox_arch }}"
when: not blackbox_installed.stat.exists
tags: [blackbox, install]
- name: Symlink /usr/local/bin/blackbox_exporter → versioned binary
ansible.builtin.file:
src: "/opt/blackbox_exporter/blackbox_exporter-{{ blackbox_version }}.linux-{{ blackbox_arch }}/blackbox_exporter"
dest: /usr/local/bin/blackbox_exporter
state: link
force: true
notify: Restart blackbox_exporter
tags: [blackbox, install]
- name: Create blackbox system user
ansible.builtin.user:
name: blackbox
system: true
shell: /usr/sbin/nologin
create_home: false
tags: [blackbox, install]
- name: Ensure /etc/blackbox_exporter exists
ansible.builtin.file:
path: /etc/blackbox_exporter
state: directory
owner: root
group: blackbox
mode: "0750"
tags: [blackbox, config]
- name: Render blackbox.yml
ansible.builtin.template:
src: blackbox.yml.j2
dest: /etc/blackbox_exporter/blackbox.yml
owner: root
group: blackbox
mode: "0640"
notify: Restart blackbox_exporter
tags: [blackbox, config]
- name: Render systemd unit
ansible.builtin.template:
src: blackbox_exporter.service.j2
dest: /etc/systemd/system/blackbox_exporter.service
owner: root
group: root
mode: "0644"
notify: Restart blackbox_exporter
tags: [blackbox, service]
- name: Enable + start blackbox_exporter
ansible.builtin.systemd:
name: blackbox_exporter
state: started
enabled: true
daemon_reload: true
tags: [blackbox, service]

View file

@ -0,0 +1,61 @@
# Managed by Ansible — do not edit by hand.
# Probe modules used by Prometheus' blackbox scrape config.
# v1.0.9 W5 Day 24.
modules:
# http_2xx — vanilla HTTP probe, accepts any 2xx response.
http_2xx:
prober: http
timeout: 5s
http:
preferred_ip_protocol: ip4
valid_status_codes: [200, 204]
method: GET
no_follow_redirects: false
fail_if_ssl: false
fail_if_not_ssl: true # synthetic monitoring runs against staging w/ TLS
# http_status_envelope — accept the {success: true, ...} body shape.
# Used for /api/v1/health which wraps the verdict.
http_status_envelope:
prober: http
timeout: 5s
http:
preferred_ip_protocol: ip4
valid_status_codes: [200]
method: GET
fail_if_body_not_matches_regexp:
- '"success"\s*:\s*true'
# http_search — POST-less search probe. The synthetic user hits
# /api/v1/search?q=test ; staging seed data must include something
# for that query to return non-empty.
http_search:
prober: http
timeout: 8s
http:
preferred_ip_protocol: ip4
valid_status_codes: [200]
method: GET
fail_if_body_not_matches_regexp:
- '"tracks"'
# http_marketplace — same shape, different endpoint.
http_marketplace:
prober: http
timeout: 8s
http:
preferred_ip_protocol: ip4
valid_status_codes: [200]
method: GET
# tcp_websocket — bare TCP connect to the WS port to verify the
# listener is alive. Doesn't speak the WS protocol — for that the
# synthetic-client binary (out of scope for this role) handles
# connect+send+receive.
tcp_websocket:
prober: tcp
timeout: 5s
tcp:
preferred_ip_protocol: ip4
tls: true

View file

@ -0,0 +1,27 @@
# Managed by Ansible — do not edit by hand.
[Unit]
Description=Prometheus Blackbox Exporter
Documentation=https://github.com/prometheus/blackbox_exporter
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=blackbox
Group=blackbox
ExecStart=/usr/local/bin/blackbox_exporter \
--config.file=/etc/blackbox_exporter/blackbox.yml \
--web.listen-address=:{{ blackbox_listen_port }}
Restart=on-failure
RestartSec=5s
LimitNOFILE=65535
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
[Install]
WantedBy=multi-user.target

View file

@ -0,0 +1,70 @@
#!/usr/bin/env bash
# scan-failed-colors.sh — emit veza_deploy_failed_color_alive textfile
# metrics from `incus list`. Designed to be called every minute by a
# systemd timer on the Incus host ; node_exporter's textfile collector
# picks the file up.
#
# A "failed-deploy color" is defined here as: an inactive color
# (NOT the one in /var/lib/veza/active-color in the haproxy container)
# whose containers are present and RUNNING. In normal operation, the
# inactive color exists exactly because the LAST deploy DIDN'T fail
# (it became the new prior color). The signal we want is when the
# inactive color outlives its useful window — Phase E.fail kept it
# alive for forensics and the operator forgot to clean up.
#
# Heuristic: emit the metric whenever an inactive color exists. The
# alert (VezaFailedColorAlive) is gated by `for: 24h` which converts
# "color is inactive" into "color has been inactive for >24h", which
# is the actual page-worthy signal.
#
# Usage:
# PREFIX=veza-staging- /opt/veza/scripts/scan-failed-colors.sh
# Output:
# /var/lib/node_exporter/textfile_collector/veza_deploy_failed_colors.prom
set -euo pipefail
PREFIX="${PREFIX:-veza-}"
ENV="${ENV:-$(echo "$PREFIX" | sed -E 's/^veza-?//;s/-$//')}"
HAPROXY_CT="${PREFIX}haproxy"
TEXTFILE_DIR="${TEXTFILE_DIR:-/var/lib/node_exporter/textfile_collector}"
OUT="${TEXTFILE_DIR}/veza_deploy_failed_colors.prom"
mkdir -p "$TEXTFILE_DIR"
# Read active color from the HAProxy container ; default blue if file
# missing (first-ever deploy, no rollback history).
if incus exec "$HAPROXY_CT" -- /bin/true 2>/dev/null; then
ACTIVE=$(incus exec "$HAPROXY_CT" -- cat /var/lib/veza/active-color 2>/dev/null | tr -d '[:space:]' || echo blue)
else
ACTIVE=blue
fi
[ -z "$ACTIVE" ] && ACTIVE=blue
INACTIVE=$([ "$ACTIVE" = "blue" ] && echo green || echo blue)
# Emit a single sample per color. A 1 means "this inactive color has
# at least one app container alive" ; 0 (or absence) means clean.
TMPFILE="${OUT}.tmp"
{
echo "# HELP veza_deploy_failed_color_alive 1 if the inactive color has live app containers."
echo "# TYPE veza_deploy_failed_color_alive gauge"
for COLOR in blue green; do
if [ "$COLOR" = "$ACTIVE" ]; then
# Active color is by definition NOT a failed-deploy color.
echo "veza_deploy_failed_color_alive{env=\"$ENV\",color=\"$COLOR\"} 0"
continue
fi
ALIVE=0
for COMP in backend stream web; do
CT="${PREFIX}${COMP}-${COLOR}"
STATE=$(incus list "$CT" -c s --format csv 2>/dev/null || true)
if [ "$STATE" = "RUNNING" ]; then
ALIVE=1
break
fi
done
echo "veza_deploy_failed_color_alive{env=\"$ENV\",color=\"$COLOR\"} $ALIVE"
done
} > "$TMPFILE"
mv -f "$TMPFILE" "$OUT"