docs: MIGRATIONS expand-contract section + RUNBOOK_ROLLBACK

Two operator docs the W5+ deploy pipeline depends on for safe operation. docs/MIGRATIONS.md (extended) : Existing file already covered migration tooling + naming. Append a "Expand-contract discipline (W5+ deploy pipeline contract)" section : explains why blue/green rollback breaks if migrations are forward-only, walks through the 3-deploy expand-backfill- contract pattern with a worked example (add nullable column → backfill → set NOT NULL), tables of allowed vs not-allowed changes for a single deploy, reviewer checklist, and an "in case of incident" override path with audit trail. docs/RUNBOOK_ROLLBACK.md (new) : Three rollback paths from fastest to slowest : 1. HAProxy fast-flip (~5s) — when prior color is still alive, use the rollback.yml workflow with mode=fast. Pre-checks + post-rollback steps. 2. Re-deploy older SHA (~10m) — when prior color is gone but tarball is still in the Forgejo registry. mode=full. Schema-migration caveat documented. 3. Manual emergency — tarball missing (rebuild + push), schema poisoned (manual SQL), Incus host broken (ZFS rollback). Plus a decision flowchart, "When NOT to rollback" with examples that bias toward fix-forward over rollback (single-user bugs, perf regressions, cosmetic issues), and a post-incident checklist. Cross-referenced with the workflow + playbook + role file paths the operator will actually need to look up. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:48:46 +02:00 · 2026-04-29 14:48:46 +02:00 · 22d09dcbbb
commit 22d09dcbbb
parent f4eb4732dd
2 changed files with 364 additions and 0 deletions
--- a/docs/MIGRATIONS.md
+++ b/docs/MIGRATIONS.md
@ -47,3 +47,114 @@ Output: `veza-backend-api/migrations/baseline_v0601.sql`
 3. Write idempotent SQL when possible (e.g. `IF NOT EXISTS`)
 4. Test locally before committing
 5. Run `squash_migrations.sh` to update the baseline for the release
+
+## Expand-contract discipline (W5+ deploy pipeline contract)
+
+> **TL;DR** — every migration must be **backward-compatible** with the
+> previous deploy's binary. No `DROP COLUMN`, no `ALTER ... NOT NULL`,
+> no `RENAME` in step 1. Schema evolution happens across **multiple
+> deploys**, not in one.
+
+### Why this matters
+
+The blue/green deploy pipeline (`infra/ansible/playbooks/deploy_app.yml`)
+makes rollback trivial at the **app layer**: HAProxy flips back to
+the previous color, ~5 seconds wall-clock, no data lost. But the
+**database** doesn't have colors. Migrations apply once, against the
+shared postgres container, and stay applied across the rollback.
+
+If a deploy adds a non-nullable column and the rollback tries to insert
+a row without that column, the insert fails. The rollback button is
+broken — the previous binary now crashes against the post-migration
+schema.
+
+The fix isn't to make the pipeline smarter. It's to make migrations
+forward-AND-backward compatible by construction.
+
+### The expand-contract pattern (3 deploys per "destructive" change)
+
+**Step 1 (deploy N) — Expand**: add the new shape **alongside** the
+old. Both binaries (old + new) work.
+
+```sql
+-- migration NNN_add_user_email_verified.sql
+ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
+-- nullable, no default — the old binary doesn't know about it.
+-- the new binary writes true/false on signup ; reads coalesce NULL → false.
+```
+
+**Step 2 (deploy N+1) — Backfill**: once Step 1 is stable in prod
+(≥ 1 week, no rollbacks needed), backfill existing rows.
+
+```sql
+-- migration NNN+1_backfill_user_email_verified.sql
+UPDATE users SET email_verified = false WHERE email_verified IS NULL;
+```
+
+**Step 3 (deploy N+2) — Contract**: once the backfill is in, add the
+constraint. The old binary (still write-coalescing NULL → false) keeps
+working ; the new binary uses `NOT NULL` knowledge.
+
+```sql
+-- migration NNN+2_user_email_verified_not_null.sql
+ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
+ALTER TABLE users ALTER COLUMN email_verified SET DEFAULT false;
+```
+
+After Step 3 is stable, you can rollback exactly **one** deploy without
+breakage. Rolling back beyond Step 1 is no longer safe — that's the
+expected consequence of expand-contract.
+
+### Allowed in a single deploy
+
+| Change                                  | Safe in one deploy?     |
+| --------------------------------------- | ----------------------- |
+| `CREATE TABLE`                          | yes                     |
+| `CREATE INDEX CONCURRENTLY`             | yes                     |
+| Add nullable column                     | yes                     |
+| Add column with constant default        | yes (PG ≥ 11)           |
+| Backfill UPDATE (idempotent)            | yes                     |
+| `DROP INDEX CONCURRENTLY`               | yes (read paths flex)   |
+| `DROP TABLE` (if no recent code reads it) | with caution           |
+
+### NOT allowed in a single deploy
+
+| Change                                  | Why                                          |
+| --------------------------------------- | -------------------------------------------- |
+| `DROP COLUMN`                           | rollback's binary still selects it           |
+| `ALTER COLUMN ... NOT NULL` (no prior backfill) | rollback inserts NULL                |
+| `ALTER COLUMN ... TYPE`                 | rollback's binary expects old type           |
+| `RENAME COLUMN`                         | rollback's binary still references old name  |
+| `RENAME TABLE`                          | rollback queries old name                    |
+
+### Reviewer checklist (PRs touching `veza-backend-api/migrations/`)
+
+- [ ] Migration is **forward-only** (GORM doesn't run rollback SQL).
+- [ ] Migration is **idempotent** (re-running on an already-migrated
+      DB is a no-op — `IF NOT EXISTS`, `ON CONFLICT DO NOTHING`, etc.).
+- [ ] No `DROP COLUMN`, `ALTER ... NOT NULL`, `RENAME` (or, if there
+      is, the PR description references the prior backfill PRs and
+      explains why this is the contract step).
+- [ ] If the migration takes a heavy lock (eg `ALTER TABLE` rewriting),
+      use `CREATE INDEX CONCURRENTLY` or split.
+- [ ] App code changes assume both old and new schema are valid.
+
+### When you must violate the rule (incident)
+
+Sometimes a hot incident demands a destructive change ASAP and rollback
+is acceptable risk. In that case:
+
+1. Tag the PR with `migration:destructive`.
+2. Document in the PR body what the rollback procedure is (manual
+   SQL to recreate the dropped column, etc.).
+3. Get a second pair of eyes on the migration before merge.
+4. Block the corresponding rollback workflow for that env until
+   you've verified the new schema is sticking.
+
+### Future hardening (not in v1.0.x)
+
+A `squawk` linter step in `.forgejo/workflows/ci.yml` could scan
+`veza-backend-api/migrations/*.sql` and fail on `DROP COLUMN`,
+`ALTER ... NOT NULL`, `RENAME`. The discipline above is the v1.0
+answer ; tooling lands when the hand-rolled discipline starts
+missing things.
--- a/docs/RUNBOOK_ROLLBACK.md
+++ b/docs/RUNBOOK_ROLLBACK.md
@ -0,0 +1,253 @@
+# Runbook — rollback a Veza deploy
+
+Three rollback paths, ordered from fastest to slowest. Pick based on
+what's still alive and what you're rolling back from.
+
+| Path                  | Time | Use when                                           |
+| --------------------- | ---- | -------------------------------------------------- |
+| 1. HAProxy fast-flip  | ~5s  | The previous color's containers are still alive.   |
+| 2. Re-deploy old SHA  | ~10m | Previous color destroyed, but the old tarball is still in the Forgejo registry. |
+| 3. Manual emergency   | ad-hoc | Both above failed (registry purged, infra broken). |
+
+> **Before you rollback, breathe and read this first.** The default
+> instinct under fire is "smash the rollback button". Often the right
+> call is to fix-forward — see "When NOT to rollback" at the bottom.
+
+---
+
+## Decision flowchart
+
+```
+        Did the new color come up at all?
+                    │
+        ┌───────────┴────────────┐
+        │NO (HAProxy still on    │YES (HAProxy switched, but
+        │ old color, deploy job  │ public probe failing or app
+        │ went red in Phase D)   │ broken in user reports)
+        ▼                        ▼
+   Phase F's auto-revert    Use Path 1 (HAProxy fast-flip)
+   already flipped HAProxy  to flip BACK to the prior color.
+   for you. No action       The prior color is still alive
+   needed beyond reading    until the next deploy recycles it.
+   logs.                    
+                            If the prior color was already
+                            cleaned up, use Path 2.
+```
+
+---
+
+## Path 1 — HAProxy fast-flip (~5s)
+
+Use when the prior color's containers are still alive. Triggered via
+the `Veza rollback` workflow with `mode=fast`.
+
+### Pre-checks
+
+```bash
+# What's the current active color?
+incus exec veza-staging-haproxy -- cat /var/lib/veza/active-color
+# (or veza-haproxy in prod)
+
+# What's the prior color (last entry of the history)?
+incus exec veza-staging-haproxy -- head -2 /var/lib/veza/active-color.history
+
+# Are the prior color's containers RUNNING?
+incus list 'veza-staging-{backend,stream,web}-blue' --format csv -c n,s
+```
+
+### Trigger
+
+In the Forgejo UI: **Actions → Veza rollback → Run workflow**:
+
+| input         | value                      |
+| ------------- | -------------------------- |
+| env           | staging (or prod)          |
+| mode          | fast                       |
+| target_color  | (the PRIOR color, eg blue if green is currently active) |
+| release_sha   | (leave empty)              |
+
+The workflow runs `infra/ansible/playbooks/rollback.yml -e mode=fast
+-e target_color=blue` which :
+
+1. Verifies all three target-color containers are RUNNING (fails
+   loud if not — switch to Path 2).
+2. Re-templates `haproxy.cfg` with `veza_active_color=blue`,
+   validates with `haproxy -c`, atomic-mv-swaps, HUPs.
+3. Updates `/var/lib/veza/active-color`.
+
+Wall time: ~5s. Zero connection drop (HAProxy reload is graceful).
+
+### Post-rollback
+
+- Verify externally: `curl https://staging.veza.fr/api/v1/health`
+- Check logs of the bad color (kept alive for forensics): `incus exec
+  veza-staging-backend-green -- journalctl -u veza-backend -n 200`
+- Once root cause is understood, run the **Veza cleanup** workflow with
+  `color=green` to reclaim the slot.
+
+---
+
+## Path 2 — Re-deploy older SHA (~10 minutes)
+
+Use when the prior color's containers were already destroyed (next
+deploy recycled them) but the old tarball is still in the Forgejo
+package registry.
+
+### Pre-checks
+
+```bash
+# Pick the SHA you want to roll back TO.
+# Look at the active-color.history for SHAs the pipeline knows about :
+incus exec veza-staging-haproxy -- head -10 /var/lib/veza/active-color.history
+
+# Or `git log --oneline main` for any commit ; just confirm the
+# tarball still exists in the registry (default retention 30 SHAs
+# per component) :
+curl -fsSL -I -H "Authorization: token $TOKEN" \
+  "https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst"
+```
+
+### Trigger
+
+In the Forgejo UI: **Actions → Veza rollback → Run workflow**:
+
+| input        | value                           |
+| ------------ | ------------------------------- |
+| env          | staging (or prod)               |
+| mode         | full                            |
+| target_color | (leave empty)                   |
+| release_sha  | the 40-char SHA you're rolling TO |
+
+The workflow runs `playbooks/rollback.yml -e mode=full
+-e veza_release_sha=$SHA` which `import_playbook`s the full
+`deploy_app.yml` pipeline. Same Phase A → Phase F sequence as a
+normal deploy, but with the older SHA.
+
+Wall time: ~5–10 minutes (build artefacts already exist, only the
+deploy half runs).
+
+### Caveat — schema migrations
+
+Migrations are **not** rolled back automatically. The schema after
+`Path 2` is the post-deploy schema, not the pre-deploy schema.
+Per **MIGRATIONS.md**'s expand-contract discipline, this should be
+fine for one deploy back. If it isn't (i.e., the failed deploy
+included a destructive migration), see **Path 3**.
+
+---
+
+## Path 3 — Manual emergency (ad hoc)
+
+You're here when:
+
+- Forgejo registry has been purged of the SHA you need.
+- The schema migration is destructive and the app crashes against
+  the post-migration schema.
+- The Incus host itself is in a bad state.
+
+### Tarball missing — rebuild and push manually
+
+```bash
+# Build the artefact locally (you'll need the toolchain) :
+cd veza-backend-api
+GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -trimpath -ldflags "-s -w" \
+  -o ./bin/veza-api ./cmd/api/main.go
+tar --use-compress-program=zstd -cf "/tmp/veza-backend-$SHA.tar.zst" \
+  -C ./bin veza-api migrate_tool
+
+# Push to the registry :
+curl -fsSL --fail-with-body -X PUT \
+  -H "Authorization: token $TOKEN" \
+  --upload-file "/tmp/veza-backend-$SHA.tar.zst" \
+  "https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst"
+
+# Then run Path 2.
+```
+
+### Schema is poisoned — manual SQL
+
+The destructive migration's PR description should document the
+inverse SQL (per MIGRATIONS.md "When you must violate the rule").
+Apply it inside the postgres container :
+
+```bash
+incus exec veza-staging-postgres -- psql -U veza veza < /tmp/inverse.sql
+```
+
+Then run Path 2 to deploy the older binary.
+
+### Incus host broken — rollback ZFS snapshot
+
+`deploy_data.yml` snapshots every data container's dataset before
+mutating anything (`<dataset>@pre-deploy-<sha>`). To restore :
+
+```bash
+# First, stop the container :
+incus stop veza-staging-postgres
+
+# Roll the dataset back to the pre-deploy snapshot :
+zfs rollback -r rpool/incus/containers/veza-staging-postgres@pre-deploy-<sha>
+
+# Restart the container :
+incus start veza-staging-postgres
+```
+
+This loses any data written after the snapshot. Last-resort only.
+
+---
+
+## When NOT to rollback
+
+- **Single user reports a bug**. Triage first ; rolling back affects
+  100% of users to fix something hitting <1%.
+- **Performance regression**. If the new SHA is up but slow, scale
+  horizontally before rolling back. (Future Hetzner offload covers
+  this ; for now, accept the regression and prep a fix-forward.)
+- **Cosmetic UI bug**. Hot-fix the frontend and let the deploy
+  pipeline ship it as a normal commit.
+- **You're not on-call and didn't get a page**. Don't rollback "to
+  be safe". The on-call's call.
+
+The rollback button's existence isn't a license to use it
+preemptively. Each rollback resets the team's confidence in the
+pipeline ; over-rolling-back makes the next real deploy feel risky.
+
+---
+
+## Post-incident
+
+After ANY rollback (path 1, 2, or 3) :
+
+1. Update **docs/POSTMORTEMS.md** (or `docs/runbooks/incidents/<date>.md`)
+   with what happened, why the deploy failed, and what triggered the
+   rollback.
+2. File the fix as a normal PR ; do NOT skip CI.
+3. If the failed deploy left containers behind (Path 1's "old color
+   kept alive"), run **Veza cleanup** workflow with the failed color
+   once forensics are done.
+4. Verify the alert `VezaDeployFailed` cleared (next successful
+   deploy will reset `last_success_timestamp > last_failure_timestamp`).
+
+---
+
+## Workflows referenced
+
+- `.forgejo/workflows/deploy.yml` — push:main → staging, tag → prod.
+- `.forgejo/workflows/rollback.yml` — workflow_dispatch only, modes
+  fast and full.
+- `.forgejo/workflows/cleanup-failed.yml` — workflow_dispatch only,
+  destroys a specific color's app containers.
+
+## Playbooks referenced
+
+- `infra/ansible/playbooks/deploy_app.yml`
+- `infra/ansible/playbooks/rollback.yml`
+- `infra/ansible/playbooks/cleanup_failed.yml`
+- `infra/ansible/playbooks/deploy_data.yml`
+
+## Roles referenced
+
+- `infra/ansible/roles/veza_app/`
+- `infra/ansible/roles/veza_haproxy_switch/`
+- `infra/ansible/roles/haproxy/` (template `haproxy.cfg.j2` with
+  blue/green topology toggle).