# Canary release — backend-api > **Audience** : on-call engineer running a release. > **Trigger** : a new backend-api binary signed-off for prod. > **Owner** : whoever's on the deploy rota that day. The canary recipe ships the new binary to **one** backend at a time, watches the SLI for a window, and only continues to the next backend when the SLI stays green. If the SLI breaches at any point, the canary node rolls back automatically to the last-known-good binary. ## Trigger conditions Run the canary script when one of these is true : - A normal feature release. New code path, no schema migration that requires lockstep coordination. - A hot-fix on a Sev-2 or below issue. Sev-1 (security or data-integrity) follows the all-stop rotate path documented in `docs/runbooks/INCIDENT_RESPONSE.md` instead. ## Pre-flight checklist - [ ] **Migration backward-compat** : the latest schema migration is additive only — no `DROP COLUMN`, no `ALTER COLUMN ... TYPE`, no `ADD COLUMN ... NOT NULL` without `DEFAULT`. The script's pre-deploy hook (`scripts/check-migration-backward-compat.sh`) refuses to proceed when it finds one ; bypass with `FORCE_MIGRATE=1` only after you've split the migration in your head. - [ ] **Last-known-good binary** is preserved. Either : (a) the previous release's `veza-api` is still on the host at `/opt/veza/backend-api/veza-api.previous`, OR (b) you have it locally and pass `ROLLBACK_BINARY=/path/to/old/veza-api` as env to the script. - [ ] **Prometheus reachable** from the deploy host. The SLI monitor queries `${PROM_URL}` (default `http://prom.lxd:9090`) every `${SLI_PROBE_INTERVAL}` seconds for 1 hour. - [ ] **HAProxy admin socket reachable** : the script execs into the haproxy Incus container to drive `set server ${POOL}/${NODE} state drain|ready` via socat. - [ ] **No game day in the same window.** Canary needs a quiet baseline ; chaos drills will push the SLI red and trigger a false rollback. ## How ### One-shot via Make ```bash make deploy-canary ARTIFACT=/tmp/veza-api-v1.0.10 ``` The Make target wraps the script with reasonable defaults. Override any env (see the script header) by exporting before the `make` call. ### Direct script invocation ```bash ARTIFACT=/tmp/veza-api-v1.0.10 \ ROLLBACK_BINARY=/opt/veza/backend-api/veza-api.previous \ SLI_WINDOW=3600 \ PROM_URL=http://prom.lxd:9090 \ bash scripts/deploy-canary.sh ``` The script is idempotent on the steps that matter : draining an already-drained server is a no-op ; pushing the same binary twice is a no-op (file mtime invariant). Re-runs after a partial failure are safe. ## What happens, in order 1. **Pre-deploy hook** runs `scripts/check-migration-backward-compat.sh` on the new-since-`origin/main` migration files. Forbidden patterns abort the deploy. 2. **Drain `CANARY_NODE`** (default `backend-api-2`) via the HAProxy admin socket. Wait until the node has 0 active connections. 3. **Push the binary** to `/opt/veza/backend-api/veza-api` on the canary container. `systemctl restart veza-backend-api`. 4. **Per-node health check** : `curl http://127.0.0.1:8080/api/v1/health` from inside the container. If the node doesn't return 200 within 60 s, rollback. 5. **Re-enable** the canary node in HAProxy. 6. **LB-side health check** : `curl http://haproxy.lxd${HEALTH_PATH}` returns 200 (proves HAProxy sees the node ready and routes through it). 7. **SLI monitor** for `SLI_WINDOW` seconds (default 3600 = 1h). Probes Prometheus every `SLI_PROBE_INTERVAL` (default 30 s) for : - p95 of `veza_gin_http_request_duration_seconds_bucket` < `PROM_P95_THRESHOLD_S` (0.5 s) - error rate (5xx ÷ total) < `PROM_ERR_RATE_THRESHOLD` (0.005 = 0.5%) First red probe → rollback. 8. **Roll the peers** : for each `PEER_NODES` entry (default `backend-api-1`), repeat steps 2–6 (drain → deploy → health → re-enable → LB check). The peer roll skips the SLI monitor because the canary already proved the SLI ; if a peer-specific failure happens (binary corrupt on push, container disk full), the script bails out. ## Rollback path The script handles the canary rollback automatically when : - The pre-deploy hook fails. Nothing is changed ; nothing to revert. - The canary's health check fails after the deploy. Old binary restored from `ROLLBACK_BINARY`, canary re-enabled. - The SLI breaches during the monitor window. Same as above. The script does **NOT** rollback peers automatically — by the time peers are rolling, the canary has already accumulated a green-SLI window. A peer health failure is an artifact of the deploy step (corrupt push, container memory issue), not of the new binary itself, and re-running after fixing the local issue is safer than ping-ponging the binary. ## Manual rollback (full) When the script doesn't catch the regression — say a slow leak that surfaces after the SLI window closes — the on-call manually drives : ```bash # Find which backend is on the new binary : incus exec backend-api-1 -- ls -la /opt/veza/backend-api/veza-api incus exec backend-api-2 -- ls -la /opt/veza/backend-api/veza-api # Rotate both back to the previous binary : for ct in backend-api-1 backend-api-2; do incus exec "$ct" -- mv /opt/veza/backend-api/veza-api.previous /opt/veza/backend-api/veza-api incus exec "$ct" -- systemctl restart veza-backend-api done ``` The previous binary is conventionally kept at `${INSTALL_DIR}/veza-api.previous` ; the canary script does NOT copy the current binary there before overwriting (deliberate — that's a deploy-pipeline responsibility, not a per-canary responsibility). ## Configuration knobs All of these are env vars — the script header is the source of truth for defaults. | Knob | Default | When to change | | ----------------------------- | ----------------------------- | ----------------------------------------------------- | | `POOL_BACKEND` | `api_pool` | If you renamed the HAProxy backend | | `CANARY_NODE` | `backend-api-2` | Toggle which node receives the canary first | | `PEER_NODES` | `backend-api-1` | When the fleet grows beyond 2 nodes | | `SLI_WINDOW` | `3600` (1 h) | Shorten for hot-fixes (300 = 5 min minimum) | | `SLI_PROBE_INTERVAL` | `30` s | Tighter probes catch a leak faster but cost Prom load | | `PROM_P95_THRESHOLD_S` | `0.5` | Match the SLO ; loosening it hides regressions | | `PROM_ERR_RATE_THRESHOLD` | `0.005` (0.5 %) | Match the SLO | | `ROLLBACK_BINARY` | (unset) | Always set in a real run — auto-rollback can't work without it | ## Acceptance bar (Day 23) Per `docs/ROADMAP_V1.0_LAUNCH.md` : 3 canary deploys on staging, 2 normal + 1 with a deliberate rollback (e.g. push a binary that hardcodes a 500 on `/api/v1/health`). The rollback exercise verifies the script's auto-revert path actually fires. ## What this doesn't do - **Cross-LB rolls** : single haproxy assumed. When phase-2 adds keepalived + a second LB, the canary script will need a `--lb-set` arg to roll the LB pair too. - **Database migrations** : split-read-write migrations (e.g. dual-write during a rename) need a multi-step deploy that this script doesn't model. For now, only additive migrations are supported through the canary. - **Stream-server canary** : the Rust streamer follows a separate playbook (URI-hash routing means a per-track-id affinity, not a per-session affinity). Same principles apply but the script is backend-api-specific.