veza/docs/CANARY_RELEASE.md
senke 8200eeba6e chore(ansible): recover group_vars files lost in parallel-commit shuffle
Files originally part of the "split group_vars into all/{main,vault}"
commit got dropped during a rebase/amend when parallel session work
landed on the same area at the same time. The all/main.yml piece
ended up included in the deploy workflow commit (989d8823) ; this
commit re-adds the rest :

  infra/ansible/group_vars/all/vault.yml.example
  infra/ansible/group_vars/staging.yml
  infra/ansible/group_vars/prod.yml
  infra/ansible/group_vars/README.md
  + delete infra/ansible/group_vars/all.yml (superseded by all/main.yml)

Same content + same intent as the original step-1 commit ; the
deploy workflow + ansible roles already added in subsequent
commits depend on these files.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:41:14 +02:00

111 lines
7.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Canary release — backend-api
> **Audience** : on-call engineer running a release.
> **Trigger** : a new backend-api binary signed-off for prod.
> **Owner** : whoever's on the deploy rota that day.
The canary recipe ships the new binary to **one** backend at a time, watches the SLI for a window, and only continues to the next backend when the SLI stays green. If the SLI breaches at any point, the canary node rolls back automatically to the last-known-good binary.
## Trigger conditions
Run the canary script when one of these is true :
- A normal feature release. New code path, no schema migration that requires lockstep coordination.
- A hot-fix on a Sev-2 or below issue. Sev-1 (security or data-integrity) follows the all-stop rotate path documented in `docs/runbooks/INCIDENT_RESPONSE.md` instead.
## Pre-flight checklist
- [ ] **Migration backward-compat** : the latest schema migration is additive only — no `DROP COLUMN`, no `ALTER COLUMN ... TYPE`, no `ADD COLUMN ... NOT NULL` without `DEFAULT`. The script's pre-deploy hook (`scripts/check-migration-backward-compat.sh`) refuses to proceed when it finds one ; bypass with `FORCE_MIGRATE=1` only after you've split the migration in your head.
- [ ] **Last-known-good binary** is preserved. Either : (a) the previous release's `veza-api` is still on the host at `/opt/veza/backend-api/veza-api.previous`, OR (b) you have it locally and pass `ROLLBACK_BINARY=/path/to/old/veza-api` as env to the script.
- [ ] **Prometheus reachable** from the deploy host. The SLI monitor queries `${PROM_URL}` (default `http://prom.lxd:9090`) every `${SLI_PROBE_INTERVAL}` seconds for 1 hour.
- [ ] **HAProxy admin socket reachable** : the script execs into the haproxy Incus container to drive `set server ${POOL}/${NODE} state drain|ready` via socat.
- [ ] **No game day in the same window.** Canary needs a quiet baseline ; chaos drills will push the SLI red and trigger a false rollback.
## How
### One-shot via Make
```bash
make deploy-canary ARTIFACT=/tmp/veza-api-v1.0.10
```
The Make target wraps the script with reasonable defaults. Override any env (see the script header) by exporting before the `make` call.
### Direct script invocation
```bash
ARTIFACT=/tmp/veza-api-v1.0.10 \
ROLLBACK_BINARY=/opt/veza/backend-api/veza-api.previous \
SLI_WINDOW=3600 \
PROM_URL=http://prom.lxd:9090 \
bash scripts/deploy-canary.sh
```
The script is idempotent on the steps that matter : draining an already-drained server is a no-op ; pushing the same binary twice is a no-op (file mtime invariant). Re-runs after a partial failure are safe.
## What happens, in order
1. **Pre-deploy hook** runs `scripts/check-migration-backward-compat.sh` on the new-since-`origin/main` migration files. Forbidden patterns abort the deploy.
2. **Drain `CANARY_NODE`** (default `backend-api-2`) via the HAProxy admin socket. Wait until the node has 0 active connections.
3. **Push the binary** to `/opt/veza/backend-api/veza-api` on the canary container. `systemctl restart veza-backend-api`.
4. **Per-node health check** : `curl http://127.0.0.1:8080/api/v1/health` from inside the container. If the node doesn't return 200 within 60 s, rollback.
5. **Re-enable** the canary node in HAProxy.
6. **LB-side health check** : `curl http://haproxy.lxd${HEALTH_PATH}` returns 200 (proves HAProxy sees the node ready and routes through it).
7. **SLI monitor** for `SLI_WINDOW` seconds (default 3600 = 1h). Probes Prometheus every `SLI_PROBE_INTERVAL` (default 30 s) for :
- p95 of `veza_gin_http_request_duration_seconds_bucket` < `PROM_P95_THRESHOLD_S` (0.5 s)
- error rate (5xx ÷ total) < `PROM_ERR_RATE_THRESHOLD` (0.005 = 0.5%)
First red probe → rollback.
8. **Roll the peers** : for each `PEER_NODES` entry (default `backend-api-1`), repeat steps 26 (drain → deploy → health → re-enable → LB check). The peer roll skips the SLI monitor because the canary already proved the SLI ; if a peer-specific failure happens (binary corrupt on push, container disk full), the script bails out.
## Rollback path
The script handles the canary rollback automatically when :
- The pre-deploy hook fails. Nothing is changed ; nothing to revert.
- The canary's health check fails after the deploy. Old binary restored from `ROLLBACK_BINARY`, canary re-enabled.
- The SLI breaches during the monitor window. Same as above.
The script does **NOT** rollback peers automatically — by the time peers are rolling, the canary has already accumulated a green-SLI window. A peer health failure is an artifact of the deploy step (corrupt push, container memory issue), not of the new binary itself, and re-running after fixing the local issue is safer than ping-ponging the binary.
## Manual rollback (full)
When the script doesn't catch the regression — say a slow leak that surfaces after the SLI window closes — the on-call manually drives :
```bash
# Find which backend is on the new binary :
incus exec backend-api-1 -- ls -la /opt/veza/backend-api/veza-api
incus exec backend-api-2 -- ls -la /opt/veza/backend-api/veza-api
# Rotate both back to the previous binary :
for ct in backend-api-1 backend-api-2; do
incus exec "$ct" -- mv /opt/veza/backend-api/veza-api.previous /opt/veza/backend-api/veza-api
incus exec "$ct" -- systemctl restart veza-backend-api
done
```
The previous binary is conventionally kept at `${INSTALL_DIR}/veza-api.previous` ; the canary script does NOT copy the current binary there before overwriting (deliberate — that's a deploy-pipeline responsibility, not a per-canary responsibility).
## Configuration knobs
All of these are env vars — the script header is the source of truth for defaults.
| Knob | Default | When to change |
| ----------------------------- | ----------------------------- | ----------------------------------------------------- |
| `POOL_BACKEND` | `api_pool` | If you renamed the HAProxy backend |
| `CANARY_NODE` | `backend-api-2` | Toggle which node receives the canary first |
| `PEER_NODES` | `backend-api-1` | When the fleet grows beyond 2 nodes |
| `SLI_WINDOW` | `3600` (1 h) | Shorten for hot-fixes (300 = 5 min minimum) |
| `SLI_PROBE_INTERVAL` | `30` s | Tighter probes catch a leak faster but cost Prom load |
| `PROM_P95_THRESHOLD_S` | `0.5` | Match the SLO ; loosening it hides regressions |
| `PROM_ERR_RATE_THRESHOLD` | `0.005` (0.5 %) | Match the SLO |
| `ROLLBACK_BINARY` | (unset) | Always set in a real run — auto-rollback can't work without it |
## Acceptance bar (Day 23)
Per `docs/ROADMAP_V1.0_LAUNCH.md` : 3 canary deploys on staging, 2 normal + 1 with a deliberate rollback (e.g. push a binary that hardcodes a 500 on `/api/v1/health`). The rollback exercise verifies the script's auto-revert path actually fires.
## What this doesn't do
- **Cross-LB rolls** : single haproxy assumed. When phase-2 adds keepalived + a second LB, the canary script will need a `--lb-set` arg to roll the LB pair too.
- **Database migrations** : split-read-write migrations (e.g. dual-write during a rename) need a multi-step deploy that this script doesn't model. For now, only additive migrations are supported through the canary.
- **Stream-server canary** : the Rust streamer follows a separate playbook (URI-hash routing means a per-track-id affinity, not a per-session affinity). Same principles apply but the script is backend-api-specific.