senke/veza

senke 8200eeba6e chore(ansible): recover group_vars files lost in parallel-commit shuffle

Files originally part of the "split group_vars into all/{main,vault}"
commit got dropped during a rebase/amend when parallel session work
landed on the same area at the same time. The all/main.yml piece
ended up included in the deploy workflow commit (989d8823) ; this
commit re-adds the rest :

  infra/ansible/group_vars/all/vault.yml.example
  infra/ansible/group_vars/staging.yml
  infra/ansible/group_vars/prod.yml
  infra/ansible/group_vars/README.md
  + delete infra/ansible/group_vars/all.yml (superseded by all/main.yml)

Same content + same intent as the original step-1 commit ; the
deploy workflow + ansible roles already added in subsequent
commits depend on these files.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 14:41:14 +02:00

7.6 KiB

Raw Permalink Blame History

Canary release — backend-api

Audience : on-call engineer running a release. Trigger : a new backend-api binary signed-off for prod. Owner : whoever's on the deploy rota that day.

The canary recipe ships the new binary to one backend at a time, watches the SLI for a window, and only continues to the next backend when the SLI stays green. If the SLI breaches at any point, the canary node rolls back automatically to the last-known-good binary.

Trigger conditions

Run the canary script when one of these is true :

A normal feature release. New code path, no schema migration that requires lockstep coordination.
A hot-fix on a Sev-2 or below issue. Sev-1 (security or data-integrity) follows the all-stop rotate path documented in docs/runbooks/INCIDENT_RESPONSE.md instead.

Pre-flight checklist

Migration backward-compat : the latest schema migration is additive only — no DROP COLUMN, no ALTER COLUMN ... TYPE, no ADD COLUMN ... NOT NULL without DEFAULT. The script's pre-deploy hook (scripts/check-migration-backward-compat.sh) refuses to proceed when it finds one ; bypass with FORCE_MIGRATE=1 only after you've split the migration in your head.
Last-known-good binary is preserved. Either : (a) the previous release's veza-api is still on the host at /opt/veza/backend-api/veza-api.previous, OR (b) you have it locally and pass ROLLBACK_BINARY=/path/to/old/veza-api as env to the script.
Prometheus reachable from the deploy host. The SLI monitor queries ${PROM_URL} (default http://prom.lxd:9090) every ${SLI_PROBE_INTERVAL} seconds for 1 hour.
HAProxy admin socket reachable : the script execs into the haproxy Incus container to drive set server ${POOL}/${NODE} state drain|ready via socat.
No game day in the same window. Canary needs a quiet baseline ; chaos drills will push the SLI red and trigger a false rollback.

How

One-shot via Make

make deploy-canary ARTIFACT=/tmp/veza-api-v1.0.10

The Make target wraps the script with reasonable defaults. Override any env (see the script header) by exporting before the make call.

Direct script invocation

ARTIFACT=/tmp/veza-api-v1.0.10 \
ROLLBACK_BINARY=/opt/veza/backend-api/veza-api.previous \
SLI_WINDOW=3600 \
PROM_URL=http://prom.lxd:9090 \
bash scripts/deploy-canary.sh

The script is idempotent on the steps that matter : draining an already-drained server is a no-op ; pushing the same binary twice is a no-op (file mtime invariant). Re-runs after a partial failure are safe.

What happens, in order

Pre-deploy hook runs scripts/check-migration-backward-compat.sh on the new-since-origin/main migration files. Forbidden patterns abort the deploy.
Drain CANARY_NODE (default backend-api-2) via the HAProxy admin socket. Wait until the node has 0 active connections.
Push the binary to /opt/veza/backend-api/veza-api on the canary container. systemctl restart veza-backend-api.
Per-node health check : curl http://127.0.0.1:8080/api/v1/health from inside the container. If the node doesn't return 200 within 60 s, rollback.
Re-enable the canary node in HAProxy.
LB-side health check : curl http://haproxy.lxd${HEALTH_PATH} returns 200 (proves HAProxy sees the node ready and routes through it).
SLI monitor for SLI_WINDOW seconds (default 3600 = 1h). Probes Prometheus every SLI_PROBE_INTERVAL (default 30 s) for :
- p95 of veza_gin_http_request_duration_seconds_bucket < PROM_P95_THRESHOLD_S (0.5 s)
- error rate (5xx ÷ total) < PROM_ERR_RATE_THRESHOLD (0.005 = 0.5%) First red probe → rollback.
Roll the peers : for each PEER_NODES entry (default backend-api-1), repeat steps 2–6 (drain → deploy → health → re-enable → LB check). The peer roll skips the SLI monitor because the canary already proved the SLI ; if a peer-specific failure happens (binary corrupt on push, container disk full), the script bails out.

Rollback path

The script handles the canary rollback automatically when :

The pre-deploy hook fails. Nothing is changed ; nothing to revert.
The canary's health check fails after the deploy. Old binary restored from ROLLBACK_BINARY, canary re-enabled.
The SLI breaches during the monitor window. Same as above.

The script does NOT rollback peers automatically — by the time peers are rolling, the canary has already accumulated a green-SLI window. A peer health failure is an artifact of the deploy step (corrupt push, container memory issue), not of the new binary itself, and re-running after fixing the local issue is safer than ping-ponging the binary.

Manual rollback (full)

When the script doesn't catch the regression — say a slow leak that surfaces after the SLI window closes — the on-call manually drives :

# Find which backend is on the new binary :
incus exec backend-api-1 -- ls -la /opt/veza/backend-api/veza-api
incus exec backend-api-2 -- ls -la /opt/veza/backend-api/veza-api

# Rotate both back to the previous binary :
for ct in backend-api-1 backend-api-2; do
  incus exec "$ct" -- mv /opt/veza/backend-api/veza-api.previous /opt/veza/backend-api/veza-api
  incus exec "$ct" -- systemctl restart veza-backend-api
done

The previous binary is conventionally kept at ${INSTALL_DIR}/veza-api.previous ; the canary script does NOT copy the current binary there before overwriting (deliberate — that's a deploy-pipeline responsibility, not a per-canary responsibility).

Configuration knobs

All of these are env vars — the script header is the source of truth for defaults.

Knob	Default	When to change
`POOL_BACKEND`	`api_pool`	If you renamed the HAProxy backend
`CANARY_NODE`	`backend-api-2`	Toggle which node receives the canary first
`PEER_NODES`	`backend-api-1`	When the fleet grows beyond 2 nodes
`SLI_WINDOW`	`3600` (1 h)	Shorten for hot-fixes (300 = 5 min minimum)
`SLI_PROBE_INTERVAL`	`30` s	Tighter probes catch a leak faster but cost Prom load
`PROM_P95_THRESHOLD_S`	`0.5`	Match the SLO ; loosening it hides regressions
`PROM_ERR_RATE_THRESHOLD`	`0.005` (0.5 %)	Match the SLO
`ROLLBACK_BINARY`	(unset)	Always set in a real run — auto-rollback can't work without it

Acceptance bar (Day 23)

Per docs/ROADMAP_V1.0_LAUNCH.md : 3 canary deploys on staging, 2 normal + 1 with a deliberate rollback (e.g. push a binary that hardcodes a 500 on /api/v1/health). The rollback exercise verifies the script's auto-revert path actually fires.

What this doesn't do

Cross-LB rolls : single haproxy assumed. When phase-2 adds keepalived + a second LB, the canary script will need a --lb-set arg to roll the LB pair too.
Database migrations : split-read-write migrations (e.g. dual-write during a rename) need a multi-step deploy that this script doesn't model. For now, only additive migrations are supported through the canary.
Stream-server canary : the Rust streamer follows a separate playbook (URI-hash routing means a per-track-id affinity, not a per-session affinity). Same principles apply but the script is backend-api-specific.

7.6 KiB Raw Permalink Blame History Unescape Escape