veza/docs/CANARY_RELEASE.md

112 lines
7.6 KiB
Markdown
Raw Permalink Normal View History

# Canary release — backend-api
> **Audience** : on-call engineer running a release.
> **Trigger** : a new backend-api binary signed-off for prod.
> **Owner** : whoever's on the deploy rota that day.
The canary recipe ships the new binary to **one** backend at a time, watches the SLI for a window, and only continues to the next backend when the SLI stays green. If the SLI breaches at any point, the canary node rolls back automatically to the last-known-good binary.
## Trigger conditions
Run the canary script when one of these is true :
- A normal feature release. New code path, no schema migration that requires lockstep coordination.
- A hot-fix on a Sev-2 or below issue. Sev-1 (security or data-integrity) follows the all-stop rotate path documented in `docs/runbooks/INCIDENT_RESPONSE.md` instead.
## Pre-flight checklist
- [ ] **Migration backward-compat** : the latest schema migration is additive only — no `DROP COLUMN`, no `ALTER COLUMN ... TYPE`, no `ADD COLUMN ... NOT NULL` without `DEFAULT`. The script's pre-deploy hook (`scripts/check-migration-backward-compat.sh`) refuses to proceed when it finds one ; bypass with `FORCE_MIGRATE=1` only after you've split the migration in your head.
- [ ] **Last-known-good binary** is preserved. Either : (a) the previous release's `veza-api` is still on the host at `/opt/veza/backend-api/veza-api.previous`, OR (b) you have it locally and pass `ROLLBACK_BINARY=/path/to/old/veza-api` as env to the script.
- [ ] **Prometheus reachable** from the deploy host. The SLI monitor queries `${PROM_URL}` (default `http://prom.lxd:9090`) every `${SLI_PROBE_INTERVAL}` seconds for 1 hour.
- [ ] **HAProxy admin socket reachable** : the script execs into the haproxy Incus container to drive `set server ${POOL}/${NODE} state drain|ready` via socat.
- [ ] **No game day in the same window.** Canary needs a quiet baseline ; chaos drills will push the SLI red and trigger a false rollback.
## How
### One-shot via Make
```bash
make deploy-canary ARTIFACT=/tmp/veza-api-v1.0.10
```
The Make target wraps the script with reasonable defaults. Override any env (see the script header) by exporting before the `make` call.
### Direct script invocation
```bash
ARTIFACT=/tmp/veza-api-v1.0.10 \
ROLLBACK_BINARY=/opt/veza/backend-api/veza-api.previous \
SLI_WINDOW=3600 \
PROM_URL=http://prom.lxd:9090 \
bash scripts/deploy-canary.sh
```
The script is idempotent on the steps that matter : draining an already-drained server is a no-op ; pushing the same binary twice is a no-op (file mtime invariant). Re-runs after a partial failure are safe.
## What happens, in order
1. **Pre-deploy hook** runs `scripts/check-migration-backward-compat.sh` on the new-since-`origin/main` migration files. Forbidden patterns abort the deploy.
2. **Drain `CANARY_NODE`** (default `backend-api-2`) via the HAProxy admin socket. Wait until the node has 0 active connections.
3. **Push the binary** to `/opt/veza/backend-api/veza-api` on the canary container. `systemctl restart veza-backend-api`.
4. **Per-node health check** : `curl http://127.0.0.1:8080/api/v1/health` from inside the container. If the node doesn't return 200 within 60 s, rollback.
5. **Re-enable** the canary node in HAProxy.
6. **LB-side health check** : `curl http://haproxy.lxd${HEALTH_PATH}` returns 200 (proves HAProxy sees the node ready and routes through it).
7. **SLI monitor** for `SLI_WINDOW` seconds (default 3600 = 1h). Probes Prometheus every `SLI_PROBE_INTERVAL` (default 30 s) for :
- p95 of `veza_gin_http_request_duration_seconds_bucket` < `PROM_P95_THRESHOLD_S` (0.5 s)
- error rate (5xx ÷ total) < `PROM_ERR_RATE_THRESHOLD` (0.005 = 0.5%)
First red probe → rollback.
8. **Roll the peers** : for each `PEER_NODES` entry (default `backend-api-1`), repeat steps 26 (drain → deploy → health → re-enable → LB check). The peer roll skips the SLI monitor because the canary already proved the SLI ; if a peer-specific failure happens (binary corrupt on push, container disk full), the script bails out.
## Rollback path
The script handles the canary rollback automatically when :
- The pre-deploy hook fails. Nothing is changed ; nothing to revert.
- The canary's health check fails after the deploy. Old binary restored from `ROLLBACK_BINARY`, canary re-enabled.
- The SLI breaches during the monitor window. Same as above.
The script does **NOT** rollback peers automatically — by the time peers are rolling, the canary has already accumulated a green-SLI window. A peer health failure is an artifact of the deploy step (corrupt push, container memory issue), not of the new binary itself, and re-running after fixing the local issue is safer than ping-ponging the binary.
## Manual rollback (full)
When the script doesn't catch the regression — say a slow leak that surfaces after the SLI window closes — the on-call manually drives :
```bash
# Find which backend is on the new binary :
incus exec backend-api-1 -- ls -la /opt/veza/backend-api/veza-api
incus exec backend-api-2 -- ls -la /opt/veza/backend-api/veza-api
# Rotate both back to the previous binary :
for ct in backend-api-1 backend-api-2; do
incus exec "$ct" -- mv /opt/veza/backend-api/veza-api.previous /opt/veza/backend-api/veza-api
incus exec "$ct" -- systemctl restart veza-backend-api
done
```
The previous binary is conventionally kept at `${INSTALL_DIR}/veza-api.previous` ; the canary script does NOT copy the current binary there before overwriting (deliberate — that's a deploy-pipeline responsibility, not a per-canary responsibility).
## Configuration knobs
All of these are env vars — the script header is the source of truth for defaults.
| Knob | Default | When to change |
| ----------------------------- | ----------------------------- | ----------------------------------------------------- |
| `POOL_BACKEND` | `api_pool` | If you renamed the HAProxy backend |
| `CANARY_NODE` | `backend-api-2` | Toggle which node receives the canary first |
| `PEER_NODES` | `backend-api-1` | When the fleet grows beyond 2 nodes |
| `SLI_WINDOW` | `3600` (1 h) | Shorten for hot-fixes (300 = 5 min minimum) |
| `SLI_PROBE_INTERVAL` | `30` s | Tighter probes catch a leak faster but cost Prom load |
| `PROM_P95_THRESHOLD_S` | `0.5` | Match the SLO ; loosening it hides regressions |
| `PROM_ERR_RATE_THRESHOLD` | `0.005` (0.5 %) | Match the SLO |
| `ROLLBACK_BINARY` | (unset) | Always set in a real run — auto-rollback can't work without it |
## Acceptance bar (Day 23)
Per `docs/ROADMAP_V1.0_LAUNCH.md` : 3 canary deploys on staging, 2 normal + 1 with a deliberate rollback (e.g. push a binary that hardcodes a 500 on `/api/v1/health`). The rollback exercise verifies the script's auto-revert path actually fires.
## What this doesn't do
- **Cross-LB rolls** : single haproxy assumed. When phase-2 adds keepalived + a second LB, the canary script will need a `--lb-set` arg to roll the LB pair too.
- **Database migrations** : split-read-write migrations (e.g. dual-write during a rename) need a multi-step deploy that this script doesn't model. For now, only additive migrations are supported through the canary.
- **Stream-server canary** : the Rust streamer follows a separate playbook (URI-hash routing means a per-track-id affinity, not a per-session affinity). Same principles apply but the script is backend-api-specific.