veza/docs/CANARY_RELEASE.md

# Canary release — backend-api

> **Audience** : on-call engineer running a release.
> **Trigger** : a new backend-api binary signed-off for prod.
> **Owner** : whoever's on the deploy rota that day.

The canary recipe ships the new binary to **one** backend at a time, watches the SLI for a window, and only continues to the next backend when the SLI stays green. If the SLI breaches at any point, the canary node rolls back automatically to the last-known-good binary.

## Trigger conditions

Run the canary script when one of these is true :

- A normal feature release. New code path, no schema migration that requires lockstep coordination.
- A hot-fix on a Sev-2 or below issue. Sev-1 (security or data-integrity) follows the all-stop rotate path documented in `docs/runbooks/INCIDENT_RESPONSE.md` instead.

## Pre-flight checklist

- [ ] **Migration backward-compat** : the latest schema migration is additive only — no `DROP COLUMN`, no `ALTER COLUMN ... TYPE`, no `ADD COLUMN ... NOT NULL` without `DEFAULT`. The script's pre-deploy hook (`scripts/check-migration-backward-compat.sh`) refuses to proceed when it finds one ; bypass with `FORCE_MIGRATE=1` only after you've split the migration in your head.
- [ ] **Last-known-good binary** is preserved. Either : (a) the previous release's `veza-api` is still on the host at `/opt/veza/backend-api/veza-api.previous`, OR (b) you have it locally and pass `ROLLBACK_BINARY=/path/to/old/veza-api` as env to the script.
- [ ] **Prometheus reachable** from the deploy host. The SLI monitor queries `${PROM_URL}` (default `http://prom.lxd:9090`) every `${SLI_PROBE_INTERVAL}` seconds for 1 hour.
- [ ] **HAProxy admin socket reachable** : the script execs into the haproxy Incus container to drive `set server ${POOL}/${NODE} state drain|ready` via socat.
- [ ] **No game day in the same window.** Canary needs a quiet baseline ; chaos drills will push the SLI red and trigger a false rollback.

## How

### One-shot via Make

```bash
make deploy-canary ARTIFACT=/tmp/veza-api-v1.0.10
```

The Make target wraps the script with reasonable defaults. Override any env (see the script header) by exporting before the `make` call.

### Direct script invocation

```bash
ARTIFACT=/tmp/veza-api-v1.0.10 \
ROLLBACK_BINARY=/opt/veza/backend-api/veza-api.previous \
SLI_WINDOW=3600 \
PROM_URL=http://prom.lxd:9090 \
bash scripts/deploy-canary.sh
```

The script is idempotent on the steps that matter : draining an already-drained server is a no-op ; pushing the same binary twice is a no-op (file mtime invariant). Re-runs after a partial failure are safe.

## What happens, in order

1. **Pre-deploy hook** runs `scripts/check-migration-backward-compat.sh` on the new-since-`origin/main` migration files. Forbidden patterns abort the deploy.
2. **Drain `CANARY_NODE`** (default `backend-api-2`) via the HAProxy admin socket. Wait until the node has 0 active connections.
3. **Push the binary** to `/opt/veza/backend-api/veza-api` on the canary container. `systemctl restart veza-backend-api`.
4. **Per-node health check** : `curl http://127.0.0.1:8080/api/v1/health` from inside the container. If the node doesn't return 200 within 60 s, rollback.
5. **Re-enable** the canary node in HAProxy.
6. **LB-side health check** : `curl http://haproxy.lxd${HEALTH_PATH}` returns 200 (proves HAProxy sees the node ready and routes through it).
7. **SLI monitor** for `SLI_WINDOW` seconds (default 3600 = 1h). Probes Prometheus every `SLI_PROBE_INTERVAL` (default 30 s) for :
    - p95 of `veza_gin_http_request_duration_seconds_bucket` < `PROM_P95_THRESHOLD_S` (0.5 s)
    - error rate (5xx ÷ total) < `PROM_ERR_RATE_THRESHOLD` (0.005 = 0.5%)
   First red probe → rollback.
8. **Roll the peers** : for each `PEER_NODES` entry (default `backend-api-1`), repeat steps 2–6 (drain → deploy → health → re-enable → LB check). The peer roll skips the SLI monitor because the canary already proved the SLI ; if a peer-specific failure happens (binary corrupt on push, container disk full), the script bails out.

## Rollback path

The script handles the canary rollback automatically when :

- The pre-deploy hook fails. Nothing is changed ; nothing to revert.
- The canary's health check fails after the deploy. Old binary restored from `ROLLBACK_BINARY`, canary re-enabled.
- The SLI breaches during the monitor window. Same as above.

The script does **NOT** rollback peers automatically — by the time peers are rolling, the canary has already accumulated a green-SLI window. A peer health failure is an artifact of the deploy step (corrupt push, container memory issue), not of the new binary itself, and re-running after fixing the local issue is safer than ping-ponging the binary.

## Manual rollback (full)

When the script doesn't catch the regression — say a slow leak that surfaces after the SLI window closes — the on-call manually drives :

```bash
# Find which backend is on the new binary :
incus exec backend-api-1 -- ls -la /opt/veza/backend-api/veza-api
incus exec backend-api-2 -- ls -la /opt/veza/backend-api/veza-api

# Rotate both back to the previous binary :
for ct in backend-api-1 backend-api-2; do
  incus exec "$ct" -- mv /opt/veza/backend-api/veza-api.previous /opt/veza/backend-api/veza-api
  incus exec "$ct" -- systemctl restart veza-backend-api
done
```

The previous binary is conventionally kept at `${INSTALL_DIR}/veza-api.previous` ; the canary script does NOT copy the current binary there before overwriting (deliberate — that's a deploy-pipeline responsibility, not a per-canary responsibility).

## Configuration knobs

All of these are env vars — the script header is the source of truth for defaults.

| Knob                          | Default                       | When to change                                        |
| ----------------------------- | ----------------------------- | ----------------------------------------------------- |
| `POOL_BACKEND`                | `api_pool`                    | If you renamed the HAProxy backend                    |
| `CANARY_NODE`                 | `backend-api-2`               | Toggle which node receives the canary first           |
| `PEER_NODES`                  | `backend-api-1`               | When the fleet grows beyond 2 nodes                   |
| `SLI_WINDOW`                  | `3600` (1 h)                  | Shorten for hot-fixes (300 = 5 min minimum)           |
| `SLI_PROBE_INTERVAL`          | `30` s                        | Tighter probes catch a leak faster but cost Prom load |
| `PROM_P95_THRESHOLD_S`        | `0.5`                         | Match the SLO ; loosening it hides regressions        |
| `PROM_ERR_RATE_THRESHOLD`     | `0.005` (0.5 %)               | Match the SLO                                         |
| `ROLLBACK_BINARY`             | (unset)                       | Always set in a real run — auto-rollback can't work without it |

## Acceptance bar (Day 23)

Per `docs/ROADMAP_V1.0_LAUNCH.md` : 3 canary deploys on staging, 2 normal + 1 with a deliberate rollback (e.g. push a binary that hardcodes a 500 on `/api/v1/health`). The rollback exercise verifies the script's auto-revert path actually fires.

## What this doesn't do

- **Cross-LB rolls** : single haproxy assumed. When phase-2 adds keepalived + a second LB, the canary script will need a `--lb-set` arg to roll the LB pair too.
- **Database migrations** : split-read-write migrations (e.g. dual-write during a rename) need a multi-step deploy that this script doesn't model. For now, only additive migrations are supported through the canary.
- **Stream-server canary** : the Rust streamer follows a separate playbook (URI-hash routing means a per-track-id affinity, not a per-session affinity). Same principles apply but the script is backend-api-specific.