Files originally part of the "split group_vars into all/{main,vault}"
commit got dropped during a rebase/amend when parallel session work
landed on the same area at the same time. The all/main.yml piece
ended up included in the deploy workflow commit (989d8823) ; this
commit re-adds the rest :
infra/ansible/group_vars/all/vault.yml.example
infra/ansible/group_vars/staging.yml
infra/ansible/group_vars/prod.yml
infra/ansible/group_vars/README.md
+ delete infra/ansible/group_vars/all.yml (superseded by all/main.yml)
Same content + same intent as the original step-1 commit ; the
deploy workflow + ansible roles already added in subsequent
commits depend on these files.
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.6 KiB
Canary release — backend-api
Audience : on-call engineer running a release. Trigger : a new backend-api binary signed-off for prod. Owner : whoever's on the deploy rota that day.
The canary recipe ships the new binary to one backend at a time, watches the SLI for a window, and only continues to the next backend when the SLI stays green. If the SLI breaches at any point, the canary node rolls back automatically to the last-known-good binary.
Trigger conditions
Run the canary script when one of these is true :
- A normal feature release. New code path, no schema migration that requires lockstep coordination.
- A hot-fix on a Sev-2 or below issue. Sev-1 (security or data-integrity) follows the all-stop rotate path documented in
docs/runbooks/INCIDENT_RESPONSE.mdinstead.
Pre-flight checklist
- Migration backward-compat : the latest schema migration is additive only — no
DROP COLUMN, noALTER COLUMN ... TYPE, noADD COLUMN ... NOT NULLwithoutDEFAULT. The script's pre-deploy hook (scripts/check-migration-backward-compat.sh) refuses to proceed when it finds one ; bypass withFORCE_MIGRATE=1only after you've split the migration in your head. - Last-known-good binary is preserved. Either : (a) the previous release's
veza-apiis still on the host at/opt/veza/backend-api/veza-api.previous, OR (b) you have it locally and passROLLBACK_BINARY=/path/to/old/veza-apias env to the script. - Prometheus reachable from the deploy host. The SLI monitor queries
${PROM_URL}(defaulthttp://prom.lxd:9090) every${SLI_PROBE_INTERVAL}seconds for 1 hour. - HAProxy admin socket reachable : the script execs into the haproxy Incus container to drive
set server ${POOL}/${NODE} state drain|readyvia socat. - No game day in the same window. Canary needs a quiet baseline ; chaos drills will push the SLI red and trigger a false rollback.
How
One-shot via Make
make deploy-canary ARTIFACT=/tmp/veza-api-v1.0.10
The Make target wraps the script with reasonable defaults. Override any env (see the script header) by exporting before the make call.
Direct script invocation
ARTIFACT=/tmp/veza-api-v1.0.10 \
ROLLBACK_BINARY=/opt/veza/backend-api/veza-api.previous \
SLI_WINDOW=3600 \
PROM_URL=http://prom.lxd:9090 \
bash scripts/deploy-canary.sh
The script is idempotent on the steps that matter : draining an already-drained server is a no-op ; pushing the same binary twice is a no-op (file mtime invariant). Re-runs after a partial failure are safe.
What happens, in order
- Pre-deploy hook runs
scripts/check-migration-backward-compat.shon the new-since-origin/mainmigration files. Forbidden patterns abort the deploy. - Drain
CANARY_NODE(defaultbackend-api-2) via the HAProxy admin socket. Wait until the node has 0 active connections. - Push the binary to
/opt/veza/backend-api/veza-apion the canary container.systemctl restart veza-backend-api. - Per-node health check :
curl http://127.0.0.1:8080/api/v1/healthfrom inside the container. If the node doesn't return 200 within 60 s, rollback. - Re-enable the canary node in HAProxy.
- LB-side health check :
curl http://haproxy.lxd${HEALTH_PATH}returns 200 (proves HAProxy sees the node ready and routes through it). - SLI monitor for
SLI_WINDOWseconds (default 3600 = 1h). Probes Prometheus everySLI_PROBE_INTERVAL(default 30 s) for :- p95 of
veza_gin_http_request_duration_seconds_bucket<PROM_P95_THRESHOLD_S(0.5 s) - error rate (5xx ÷ total) <
PROM_ERR_RATE_THRESHOLD(0.005 = 0.5%) First red probe → rollback.
- p95 of
- Roll the peers : for each
PEER_NODESentry (defaultbackend-api-1), repeat steps 2–6 (drain → deploy → health → re-enable → LB check). The peer roll skips the SLI monitor because the canary already proved the SLI ; if a peer-specific failure happens (binary corrupt on push, container disk full), the script bails out.
Rollback path
The script handles the canary rollback automatically when :
- The pre-deploy hook fails. Nothing is changed ; nothing to revert.
- The canary's health check fails after the deploy. Old binary restored from
ROLLBACK_BINARY, canary re-enabled. - The SLI breaches during the monitor window. Same as above.
The script does NOT rollback peers automatically — by the time peers are rolling, the canary has already accumulated a green-SLI window. A peer health failure is an artifact of the deploy step (corrupt push, container memory issue), not of the new binary itself, and re-running after fixing the local issue is safer than ping-ponging the binary.
Manual rollback (full)
When the script doesn't catch the regression — say a slow leak that surfaces after the SLI window closes — the on-call manually drives :
# Find which backend is on the new binary :
incus exec backend-api-1 -- ls -la /opt/veza/backend-api/veza-api
incus exec backend-api-2 -- ls -la /opt/veza/backend-api/veza-api
# Rotate both back to the previous binary :
for ct in backend-api-1 backend-api-2; do
incus exec "$ct" -- mv /opt/veza/backend-api/veza-api.previous /opt/veza/backend-api/veza-api
incus exec "$ct" -- systemctl restart veza-backend-api
done
The previous binary is conventionally kept at ${INSTALL_DIR}/veza-api.previous ; the canary script does NOT copy the current binary there before overwriting (deliberate — that's a deploy-pipeline responsibility, not a per-canary responsibility).
Configuration knobs
All of these are env vars — the script header is the source of truth for defaults.
| Knob | Default | When to change |
|---|---|---|
POOL_BACKEND |
api_pool |
If you renamed the HAProxy backend |
CANARY_NODE |
backend-api-2 |
Toggle which node receives the canary first |
PEER_NODES |
backend-api-1 |
When the fleet grows beyond 2 nodes |
SLI_WINDOW |
3600 (1 h) |
Shorten for hot-fixes (300 = 5 min minimum) |
SLI_PROBE_INTERVAL |
30 s |
Tighter probes catch a leak faster but cost Prom load |
PROM_P95_THRESHOLD_S |
0.5 |
Match the SLO ; loosening it hides regressions |
PROM_ERR_RATE_THRESHOLD |
0.005 (0.5 %) |
Match the SLO |
ROLLBACK_BINARY |
(unset) | Always set in a real run — auto-rollback can't work without it |
Acceptance bar (Day 23)
Per docs/ROADMAP_V1.0_LAUNCH.md : 3 canary deploys on staging, 2 normal + 1 with a deliberate rollback (e.g. push a binary that hardcodes a 500 on /api/v1/health). The rollback exercise verifies the script's auto-revert path actually fires.
What this doesn't do
- Cross-LB rolls : single haproxy assumed. When phase-2 adds keepalived + a second LB, the canary script will need a
--lb-setarg to roll the LB pair too. - Database migrations : split-read-write migrations (e.g. dual-write during a rename) need a multi-step deploy that this script doesn't model. For now, only additive migrations are supported through the canary.
- Stream-server canary : the Rust streamer follows a separate playbook (URI-hash routing means a per-track-id affinity, not a per-session affinity). Same principles apply but the script is backend-api-specific.