diff --git a/docs/MIGRATIONS.md b/docs/MIGRATIONS.md index 9edd26afa..ef0019763 100644 --- a/docs/MIGRATIONS.md +++ b/docs/MIGRATIONS.md @@ -47,3 +47,114 @@ Output: `veza-backend-api/migrations/baseline_v0601.sql` 3. Write idempotent SQL when possible (e.g. `IF NOT EXISTS`) 4. Test locally before committing 5. Run `squash_migrations.sh` to update the baseline for the release + +## Expand-contract discipline (W5+ deploy pipeline contract) + +> **TL;DR** — every migration must be **backward-compatible** with the +> previous deploy's binary. No `DROP COLUMN`, no `ALTER ... NOT NULL`, +> no `RENAME` in step 1. Schema evolution happens across **multiple +> deploys**, not in one. + +### Why this matters + +The blue/green deploy pipeline (`infra/ansible/playbooks/deploy_app.yml`) +makes rollback trivial at the **app layer**: HAProxy flips back to +the previous color, ~5 seconds wall-clock, no data lost. But the +**database** doesn't have colors. Migrations apply once, against the +shared postgres container, and stay applied across the rollback. + +If a deploy adds a non-nullable column and the rollback tries to insert +a row without that column, the insert fails. The rollback button is +broken — the previous binary now crashes against the post-migration +schema. + +The fix isn't to make the pipeline smarter. It's to make migrations +forward-AND-backward compatible by construction. + +### The expand-contract pattern (3 deploys per "destructive" change) + +**Step 1 (deploy N) — Expand**: add the new shape **alongside** the +old. Both binaries (old + new) work. + +```sql +-- migration NNN_add_user_email_verified.sql +ALTER TABLE users ADD COLUMN email_verified BOOLEAN; +-- nullable, no default — the old binary doesn't know about it. +-- the new binary writes true/false on signup ; reads coalesce NULL → false. +``` + +**Step 2 (deploy N+1) — Backfill**: once Step 1 is stable in prod +(≥ 1 week, no rollbacks needed), backfill existing rows. + +```sql +-- migration NNN+1_backfill_user_email_verified.sql +UPDATE users SET email_verified = false WHERE email_verified IS NULL; +``` + +**Step 3 (deploy N+2) — Contract**: once the backfill is in, add the +constraint. The old binary (still write-coalescing NULL → false) keeps +working ; the new binary uses `NOT NULL` knowledge. + +```sql +-- migration NNN+2_user_email_verified_not_null.sql +ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL; +ALTER TABLE users ALTER COLUMN email_verified SET DEFAULT false; +``` + +After Step 3 is stable, you can rollback exactly **one** deploy without +breakage. Rolling back beyond Step 1 is no longer safe — that's the +expected consequence of expand-contract. + +### Allowed in a single deploy + +| Change | Safe in one deploy? | +| --------------------------------------- | ----------------------- | +| `CREATE TABLE` | yes | +| `CREATE INDEX CONCURRENTLY` | yes | +| Add nullable column | yes | +| Add column with constant default | yes (PG ≥ 11) | +| Backfill UPDATE (idempotent) | yes | +| `DROP INDEX CONCURRENTLY` | yes (read paths flex) | +| `DROP TABLE` (if no recent code reads it) | with caution | + +### NOT allowed in a single deploy + +| Change | Why | +| --------------------------------------- | -------------------------------------------- | +| `DROP COLUMN` | rollback's binary still selects it | +| `ALTER COLUMN ... NOT NULL` (no prior backfill) | rollback inserts NULL | +| `ALTER COLUMN ... TYPE` | rollback's binary expects old type | +| `RENAME COLUMN` | rollback's binary still references old name | +| `RENAME TABLE` | rollback queries old name | + +### Reviewer checklist (PRs touching `veza-backend-api/migrations/`) + +- [ ] Migration is **forward-only** (GORM doesn't run rollback SQL). +- [ ] Migration is **idempotent** (re-running on an already-migrated + DB is a no-op — `IF NOT EXISTS`, `ON CONFLICT DO NOTHING`, etc.). +- [ ] No `DROP COLUMN`, `ALTER ... NOT NULL`, `RENAME` (or, if there + is, the PR description references the prior backfill PRs and + explains why this is the contract step). +- [ ] If the migration takes a heavy lock (eg `ALTER TABLE` rewriting), + use `CREATE INDEX CONCURRENTLY` or split. +- [ ] App code changes assume both old and new schema are valid. + +### When you must violate the rule (incident) + +Sometimes a hot incident demands a destructive change ASAP and rollback +is acceptable risk. In that case: + +1. Tag the PR with `migration:destructive`. +2. Document in the PR body what the rollback procedure is (manual + SQL to recreate the dropped column, etc.). +3. Get a second pair of eyes on the migration before merge. +4. Block the corresponding rollback workflow for that env until + you've verified the new schema is sticking. + +### Future hardening (not in v1.0.x) + +A `squawk` linter step in `.forgejo/workflows/ci.yml` could scan +`veza-backend-api/migrations/*.sql` and fail on `DROP COLUMN`, +`ALTER ... NOT NULL`, `RENAME`. The discipline above is the v1.0 +answer ; tooling lands when the hand-rolled discipline starts +missing things. diff --git a/docs/RUNBOOK_ROLLBACK.md b/docs/RUNBOOK_ROLLBACK.md new file mode 100644 index 000000000..be87730ba --- /dev/null +++ b/docs/RUNBOOK_ROLLBACK.md @@ -0,0 +1,253 @@ +# Runbook — rollback a Veza deploy + +Three rollback paths, ordered from fastest to slowest. Pick based on +what's still alive and what you're rolling back from. + +| Path | Time | Use when | +| --------------------- | ---- | -------------------------------------------------- | +| 1. HAProxy fast-flip | ~5s | The previous color's containers are still alive. | +| 2. Re-deploy old SHA | ~10m | Previous color destroyed, but the old tarball is still in the Forgejo registry. | +| 3. Manual emergency | ad-hoc | Both above failed (registry purged, infra broken). | + +> **Before you rollback, breathe and read this first.** The default +> instinct under fire is "smash the rollback button". Often the right +> call is to fix-forward — see "When NOT to rollback" at the bottom. + +--- + +## Decision flowchart + +``` + Did the new color come up at all? + │ + ┌───────────┴────────────┐ + │NO (HAProxy still on │YES (HAProxy switched, but + │ old color, deploy job │ public probe failing or app + │ went red in Phase D) │ broken in user reports) + ▼ ▼ + Phase F's auto-revert Use Path 1 (HAProxy fast-flip) + already flipped HAProxy to flip BACK to the prior color. + for you. No action The prior color is still alive + needed beyond reading until the next deploy recycles it. + logs. + If the prior color was already + cleaned up, use Path 2. +``` + +--- + +## Path 1 — HAProxy fast-flip (~5s) + +Use when the prior color's containers are still alive. Triggered via +the `Veza rollback` workflow with `mode=fast`. + +### Pre-checks + +```bash +# What's the current active color? +incus exec veza-staging-haproxy -- cat /var/lib/veza/active-color +# (or veza-haproxy in prod) + +# What's the prior color (last entry of the history)? +incus exec veza-staging-haproxy -- head -2 /var/lib/veza/active-color.history + +# Are the prior color's containers RUNNING? +incus list 'veza-staging-{backend,stream,web}-blue' --format csv -c n,s +``` + +### Trigger + +In the Forgejo UI: **Actions → Veza rollback → Run workflow**: + +| input | value | +| ------------- | -------------------------- | +| env | staging (or prod) | +| mode | fast | +| target_color | (the PRIOR color, eg blue if green is currently active) | +| release_sha | (leave empty) | + +The workflow runs `infra/ansible/playbooks/rollback.yml -e mode=fast +-e target_color=blue` which : + +1. Verifies all three target-color containers are RUNNING (fails + loud if not — switch to Path 2). +2. Re-templates `haproxy.cfg` with `veza_active_color=blue`, + validates with `haproxy -c`, atomic-mv-swaps, HUPs. +3. Updates `/var/lib/veza/active-color`. + +Wall time: ~5s. Zero connection drop (HAProxy reload is graceful). + +### Post-rollback + +- Verify externally: `curl https://staging.veza.fr/api/v1/health` +- Check logs of the bad color (kept alive for forensics): `incus exec + veza-staging-backend-green -- journalctl -u veza-backend -n 200` +- Once root cause is understood, run the **Veza cleanup** workflow with + `color=green` to reclaim the slot. + +--- + +## Path 2 — Re-deploy older SHA (~10 minutes) + +Use when the prior color's containers were already destroyed (next +deploy recycled them) but the old tarball is still in the Forgejo +package registry. + +### Pre-checks + +```bash +# Pick the SHA you want to roll back TO. +# Look at the active-color.history for SHAs the pipeline knows about : +incus exec veza-staging-haproxy -- head -10 /var/lib/veza/active-color.history + +# Or `git log --oneline main` for any commit ; just confirm the +# tarball still exists in the registry (default retention 30 SHAs +# per component) : +curl -fsSL -I -H "Authorization: token $TOKEN" \ + "https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst" +``` + +### Trigger + +In the Forgejo UI: **Actions → Veza rollback → Run workflow**: + +| input | value | +| ------------ | ------------------------------- | +| env | staging (or prod) | +| mode | full | +| target_color | (leave empty) | +| release_sha | the 40-char SHA you're rolling TO | + +The workflow runs `playbooks/rollback.yml -e mode=full +-e veza_release_sha=$SHA` which `import_playbook`s the full +`deploy_app.yml` pipeline. Same Phase A → Phase F sequence as a +normal deploy, but with the older SHA. + +Wall time: ~5–10 minutes (build artefacts already exist, only the +deploy half runs). + +### Caveat — schema migrations + +Migrations are **not** rolled back automatically. The schema after +`Path 2` is the post-deploy schema, not the pre-deploy schema. +Per **MIGRATIONS.md**'s expand-contract discipline, this should be +fine for one deploy back. If it isn't (i.e., the failed deploy +included a destructive migration), see **Path 3**. + +--- + +## Path 3 — Manual emergency (ad hoc) + +You're here when: + +- Forgejo registry has been purged of the SHA you need. +- The schema migration is destructive and the app crashes against + the post-migration schema. +- The Incus host itself is in a bad state. + +### Tarball missing — rebuild and push manually + +```bash +# Build the artefact locally (you'll need the toolchain) : +cd veza-backend-api +GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -trimpath -ldflags "-s -w" \ + -o ./bin/veza-api ./cmd/api/main.go +tar --use-compress-program=zstd -cf "/tmp/veza-backend-$SHA.tar.zst" \ + -C ./bin veza-api migrate_tool + +# Push to the registry : +curl -fsSL --fail-with-body -X PUT \ + -H "Authorization: token $TOKEN" \ + --upload-file "/tmp/veza-backend-$SHA.tar.zst" \ + "https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst" + +# Then run Path 2. +``` + +### Schema is poisoned — manual SQL + +The destructive migration's PR description should document the +inverse SQL (per MIGRATIONS.md "When you must violate the rule"). +Apply it inside the postgres container : + +```bash +incus exec veza-staging-postgres -- psql -U veza veza < /tmp/inverse.sql +``` + +Then run Path 2 to deploy the older binary. + +### Incus host broken — rollback ZFS snapshot + +`deploy_data.yml` snapshots every data container's dataset before +mutating anything (`@pre-deploy-`). To restore : + +```bash +# First, stop the container : +incus stop veza-staging-postgres + +# Roll the dataset back to the pre-deploy snapshot : +zfs rollback -r rpool/incus/containers/veza-staging-postgres@pre-deploy- + +# Restart the container : +incus start veza-staging-postgres +``` + +This loses any data written after the snapshot. Last-resort only. + +--- + +## When NOT to rollback + +- **Single user reports a bug**. Triage first ; rolling back affects + 100% of users to fix something hitting <1%. +- **Performance regression**. If the new SHA is up but slow, scale + horizontally before rolling back. (Future Hetzner offload covers + this ; for now, accept the regression and prep a fix-forward.) +- **Cosmetic UI bug**. Hot-fix the frontend and let the deploy + pipeline ship it as a normal commit. +- **You're not on-call and didn't get a page**. Don't rollback "to + be safe". The on-call's call. + +The rollback button's existence isn't a license to use it +preemptively. Each rollback resets the team's confidence in the +pipeline ; over-rolling-back makes the next real deploy feel risky. + +--- + +## Post-incident + +After ANY rollback (path 1, 2, or 3) : + +1. Update **docs/POSTMORTEMS.md** (or `docs/runbooks/incidents/.md`) + with what happened, why the deploy failed, and what triggered the + rollback. +2. File the fix as a normal PR ; do NOT skip CI. +3. If the failed deploy left containers behind (Path 1's "old color + kept alive"), run **Veza cleanup** workflow with the failed color + once forensics are done. +4. Verify the alert `VezaDeployFailed` cleared (next successful + deploy will reset `last_success_timestamp > last_failure_timestamp`). + +--- + +## Workflows referenced + +- `.forgejo/workflows/deploy.yml` — push:main → staging, tag → prod. +- `.forgejo/workflows/rollback.yml` — workflow_dispatch only, modes + fast and full. +- `.forgejo/workflows/cleanup-failed.yml` — workflow_dispatch only, + destroys a specific color's app containers. + +## Playbooks referenced + +- `infra/ansible/playbooks/deploy_app.yml` +- `infra/ansible/playbooks/rollback.yml` +- `infra/ansible/playbooks/cleanup_failed.yml` +- `infra/ansible/playbooks/deploy_data.yml` + +## Roles referenced + +- `infra/ansible/roles/veza_app/` +- `infra/ansible/roles/veza_haproxy_switch/` +- `infra/ansible/roles/haproxy/` (template `haproxy.cfg.j2` with + blue/green topology toggle).