docs: MIGRATIONS expand-contract section + RUNBOOK_ROLLBACK

Two operator docs the W5+ deploy pipeline depends on for safe operation. docs/MIGRATIONS.md (extended) : Existing file already covered migration tooling + naming. Append a "Expand-contract discipline (W5+ deploy pipeline contract)" section : explains why blue/green rollback breaks if migrations are forward-only, walks through the 3-deploy expand-backfill- contract pattern with a worked example (add nullable column → backfill → set NOT NULL), tables of allowed vs not-allowed changes for a single deploy, reviewer checklist, and an "in case of incident" override path with audit trail. docs/RUNBOOK_ROLLBACK.md (new) : Three rollback paths from fastest to slowest : 1. HAProxy fast-flip (~5s) — when prior color is still alive, use the rollback.yml workflow with mode=fast. Pre-checks + post-rollback steps. 2. Re-deploy older SHA (~10m) — when prior color is gone but tarball is still in the Forgejo registry. mode=full. Schema-migration caveat documented. 3. Manual emergency — tarball missing (rebuild + push), schema poisoned (manual SQL), Incus host broken (ZFS rollback). Plus a decision flowchart, "When NOT to rollback" with examples that bias toward fix-forward over rollback (single-user bugs, perf regressions, cosmetic issues), and a post-incident checklist. Cross-referenced with the workflow + playbook + role file paths the operator will actually need to look up. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:48:46 +02:00 · 2026-04-29 14:48:46 +02:00 · 22d09dcbbb
commit 22d09dcbbb
parent f4eb4732dd
2 changed files with 364 additions and 0 deletions
--- a/docs/MIGRATIONS.md
+++ b/docs/MIGRATIONS.md
@ -47,3 +47,114 @@ Output: `veza-backend-api/migrations/baseline_v0601.sql`
 3. Write idempotent SQL when possible (e.g. `IF NOT EXISTS`)
 4. Test locally before committing
 5. Run `squash_migrations.sh` to update the baseline for the release
 ## Expand-contract discipline (W5+ deploy pipeline contract)
 > **TL;DR** — every migration must be **backward-compatible** with the
 > previous deploy's binary. No `DROP COLUMN`, no `ALTER ... NOT NULL`,
 > no `RENAME` in step 1. Schema evolution happens across **multiple
 > deploys**, not in one.
 ### Why this matters
 The blue/green deploy pipeline (`infra/ansible/playbooks/deploy_app.yml`)
 makes rollback trivial at the **app layer**: HAProxy flips back to
 the previous color, ~5 seconds wall-clock, no data lost. But the
 **database** doesn't have colors. Migrations apply once, against the
 shared postgres container, and stay applied across the rollback.
 If a deploy adds a non-nullable column and the rollback tries to insert
 a row without that column, the insert fails. The rollback button is
 broken — the previous binary now crashes against the post-migration
 schema.
 The fix isn't to make the pipeline smarter. It's to make migrations
 forward-AND-backward compatible by construction.
 ### The expand-contract pattern (3 deploys per "destructive" change)
 **Step 1 (deploy N) — Expand**: add the new shape **alongside** the
 old. Both binaries (old + new) work.
 ```sql
 -- migration NNN_add_user_email_verified.sql
 ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
 -- nullable, no default — the old binary doesn't know about it.
 -- the new binary writes true/false on signup ; reads coalesce NULL → false.
 ```
 **Step 2 (deploy N+1) — Backfill**: once Step 1 is stable in prod
 (≥ 1 week, no rollbacks needed), backfill existing rows.
 ```sql
 -- migration NNN+1_backfill_user_email_verified.sql
 UPDATE users SET email_verified = false WHERE email_verified IS NULL;
 ```
 **Step 3 (deploy N+2) — Contract**: once the backfill is in, add the
 constraint. The old binary (still write-coalescing NULL → false) keeps
 working ; the new binary uses `NOT NULL` knowledge.
 ```sql
 -- migration NNN+2_user_email_verified_not_null.sql
 ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
 ALTER TABLE users ALTER COLUMN email_verified SET DEFAULT false;
 ```
 After Step 3 is stable, you can rollback exactly **one** deploy without
 breakage. Rolling back beyond Step 1 is no longer safe — that's the
 expected consequence of expand-contract.
 ### Allowed in a single deploy
 | Change                                  | Safe in one deploy?     |
 | --------------------------------------- | ----------------------- |
 | `CREATE TABLE`                          | yes                     |
 | `CREATE INDEX CONCURRENTLY`             | yes                     |
 | Add nullable column                     | yes                     |
 | Add column with constant default        | yes (PG ≥ 11)           |
 | Backfill UPDATE (idempotent)            | yes                     |
 | `DROP INDEX CONCURRENTLY`               | yes (read paths flex)   |
 | `DROP TABLE` (if no recent code reads it) | with caution           |
 ### NOT allowed in a single deploy
 | Change                                  | Why                                          |
 | --------------------------------------- | -------------------------------------------- |
 | `DROP COLUMN`                           | rollback's binary still selects it           |
 | `ALTER COLUMN ... NOT NULL` (no prior backfill) | rollback inserts NULL                |
 | `ALTER COLUMN ... TYPE`                 | rollback's binary expects old type           |
 | `RENAME COLUMN`                         | rollback's binary still references old name  |
 | `RENAME TABLE`                          | rollback queries old name                    |
 ### Reviewer checklist (PRs touching `veza-backend-api/migrations/`)
 - [ ] Migration is **forward-only** (GORM doesn't run rollback SQL).
 - [ ] Migration is **idempotent** (re-running on an already-migrated
      DB is a no-op — `IF NOT EXISTS`, `ON CONFLICT DO NOTHING`, etc.).
 - [ ] No `DROP COLUMN`, `ALTER ... NOT NULL`, `RENAME` (or, if there
      is, the PR description references the prior backfill PRs and
      explains why this is the contract step).
 - [ ] If the migration takes a heavy lock (eg `ALTER TABLE` rewriting),
      use `CREATE INDEX CONCURRENTLY` or split.
 - [ ] App code changes assume both old and new schema are valid.
 ### When you must violate the rule (incident)
 Sometimes a hot incident demands a destructive change ASAP and rollback
 is acceptable risk. In that case:
 1. Tag the PR with `migration:destructive`.
 2. Document in the PR body what the rollback procedure is (manual
   SQL to recreate the dropped column, etc.).
 3. Get a second pair of eyes on the migration before merge.
 4. Block the corresponding rollback workflow for that env until
   you've verified the new schema is sticking.
 ### Future hardening (not in v1.0.x)
 A `squawk` linter step in `.forgejo/workflows/ci.yml` could scan
 `veza-backend-api/migrations/*.sql` and fail on `DROP COLUMN`,
 `ALTER ... NOT NULL`, `RENAME`. The discipline above is the v1.0
 answer ; tooling lands when the hand-rolled discipline starts
 missing things.
--- a/docs/RUNBOOK_ROLLBACK.md
+++ b/docs/RUNBOOK_ROLLBACK.md
@ -0,0 +1,253 @@
 # Runbook — rollback a Veza deploy
 Three rollback paths, ordered from fastest to slowest. Pick based on
 what's still alive and what you're rolling back from.
 | Path                  | Time | Use when                                           |
 | --------------------- | ---- | -------------------------------------------------- |
 | 1. HAProxy fast-flip  | ~5s  | The previous color's containers are still alive.   |
 | 2. Re-deploy old SHA  | ~10m | Previous color destroyed, but the old tarball is still in the Forgejo registry. |
 | 3. Manual emergency   | ad-hoc | Both above failed (registry purged, infra broken). |
 > **Before you rollback, breathe and read this first.** The default
 > instinct under fire is "smash the rollback button". Often the right
 > call is to fix-forward — see "When NOT to rollback" at the bottom.
 ---
 ## Decision flowchart
 ```
        Did the new color come up at all?
                    │
        ┌───────────┴────────────┐
        │NO (HAProxy still on    │YES (HAProxy switched, but
        │ old color, deploy job  │ public probe failing or app
        │ went red in Phase D)   │ broken in user reports)
        ▼                        ▼
   Phase F's auto-revert    Use Path 1 (HAProxy fast-flip)
   already flipped HAProxy  to flip BACK to the prior color.
   for you. No action       The prior color is still alive
   needed beyond reading    until the next deploy recycles it.
   logs.                    
                            If the prior color was already
                            cleaned up, use Path 2.
 ```
 ---
 ## Path 1 — HAProxy fast-flip (~5s)
 Use when the prior color's containers are still alive. Triggered via
 the `Veza rollback` workflow with `mode=fast`.
 ### Pre-checks
 ```bash
 # What's the current active color?
 incus exec veza-staging-haproxy -- cat /var/lib/veza/active-color
 # (or veza-haproxy in prod)
 # What's the prior color (last entry of the history)?
 incus exec veza-staging-haproxy -- head -2 /var/lib/veza/active-color.history
 # Are the prior color's containers RUNNING?
 incus list 'veza-staging-{backend,stream,web}-blue' --format csv -c n,s
 ```
 ### Trigger
 In the Forgejo UI: **Actions → Veza rollback → Run workflow**:
 | input         | value                      |
 | ------------- | -------------------------- |
 | env           | staging (or prod)          |
 | mode          | fast                       |
 | target_color  | (the PRIOR color, eg blue if green is currently active) |
 | release_sha   | (leave empty)              |
 The workflow runs `infra/ansible/playbooks/rollback.yml -e mode=fast
 -e target_color=blue` which :
 1. Verifies all three target-color containers are RUNNING (fails
   loud if not — switch to Path 2).
 2. Re-templates `haproxy.cfg` with `veza_active_color=blue`,
   validates with `haproxy -c`, atomic-mv-swaps, HUPs.
 3. Updates `/var/lib/veza/active-color`.
 Wall time: ~5s. Zero connection drop (HAProxy reload is graceful).
 ### Post-rollback
 - Verify externally: `curl https://staging.veza.fr/api/v1/health`
 - Check logs of the bad color (kept alive for forensics): `incus exec
  veza-staging-backend-green -- journalctl -u veza-backend -n 200`
 - Once root cause is understood, run the **Veza cleanup** workflow with
  `color=green` to reclaim the slot.
 ---
 ## Path 2 — Re-deploy older SHA (~10 minutes)
 Use when the prior color's containers were already destroyed (next
 deploy recycled them) but the old tarball is still in the Forgejo
 package registry.
 ### Pre-checks
 ```bash
 # Pick the SHA you want to roll back TO.
 # Look at the active-color.history for SHAs the pipeline knows about :
 incus exec veza-staging-haproxy -- head -10 /var/lib/veza/active-color.history
 # Or `git log --oneline main` for any commit ; just confirm the
 # tarball still exists in the registry (default retention 30 SHAs
 # per component) :
 curl -fsSL -I -H "Authorization: token $TOKEN" \
  "https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst"
 ```
 ### Trigger
 In the Forgejo UI: **Actions → Veza rollback → Run workflow**:
 | input        | value                           |
 | ------------ | ------------------------------- |
 | env          | staging (or prod)               |
 | mode         | full                            |
 | target_color | (leave empty)                   |
 | release_sha  | the 40-char SHA you're rolling TO |
 The workflow runs `playbooks/rollback.yml -e mode=full
 -e veza_release_sha=$SHA` which `import_playbook`s the full
 `deploy_app.yml` pipeline. Same Phase A → Phase F sequence as a
 normal deploy, but with the older SHA.
 Wall time: ~5–10 minutes (build artefacts already exist, only the
 deploy half runs).
 ### Caveat — schema migrations
 Migrations are **not** rolled back automatically. The schema after
 `Path 2` is the post-deploy schema, not the pre-deploy schema.
 Per **MIGRATIONS.md**'s expand-contract discipline, this should be
 fine for one deploy back. If it isn't (i.e., the failed deploy
 included a destructive migration), see **Path 3**.
 ---
 ## Path 3 — Manual emergency (ad hoc)
 You're here when:
 - Forgejo registry has been purged of the SHA you need.
 - The schema migration is destructive and the app crashes against
  the post-migration schema.
 - The Incus host itself is in a bad state.
 ### Tarball missing — rebuild and push manually
 ```bash
 # Build the artefact locally (you'll need the toolchain) :
 cd veza-backend-api
 GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -trimpath -ldflags "-s -w" \
  -o ./bin/veza-api ./cmd/api/main.go
 tar --use-compress-program=zstd -cf "/tmp/veza-backend-$SHA.tar.zst" \
  -C ./bin veza-api migrate_tool
 # Push to the registry :
 curl -fsSL --fail-with-body -X PUT \
  -H "Authorization: token $TOKEN" \
  --upload-file "/tmp/veza-backend-$SHA.tar.zst" \
  "https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst"
 # Then run Path 2.
 ```
 ### Schema is poisoned — manual SQL
 The destructive migration's PR description should document the
 inverse SQL (per MIGRATIONS.md "When you must violate the rule").
 Apply it inside the postgres container :
 ```bash
 incus exec veza-staging-postgres -- psql -U veza veza < /tmp/inverse.sql
 ```
 Then run Path 2 to deploy the older binary.
 ### Incus host broken — rollback ZFS snapshot
 `deploy_data.yml` snapshots every data container's dataset before
 mutating anything (`<dataset>@pre-deploy-<sha>`). To restore :
 ```bash
 # First, stop the container :
 incus stop veza-staging-postgres
 # Roll the dataset back to the pre-deploy snapshot :
 zfs rollback -r rpool/incus/containers/veza-staging-postgres@pre-deploy-<sha>
 # Restart the container :
 incus start veza-staging-postgres
 ```
 This loses any data written after the snapshot. Last-resort only.
 ---
 ## When NOT to rollback
 - **Single user reports a bug**. Triage first ; rolling back affects
  100% of users to fix something hitting <1%.
 - **Performance regression**. If the new SHA is up but slow, scale
  horizontally before rolling back. (Future Hetzner offload covers
  this ; for now, accept the regression and prep a fix-forward.)
 - **Cosmetic UI bug**. Hot-fix the frontend and let the deploy
  pipeline ship it as a normal commit.
 - **You're not on-call and didn't get a page**. Don't rollback "to
  be safe". The on-call's call.
 The rollback button's existence isn't a license to use it
 preemptively. Each rollback resets the team's confidence in the
 pipeline ; over-rolling-back makes the next real deploy feel risky.
 ---
 ## Post-incident
 After ANY rollback (path 1, 2, or 3) :
 1. Update **docs/POSTMORTEMS.md** (or `docs/runbooks/incidents/<date>.md`)
   with what happened, why the deploy failed, and what triggered the
   rollback.
 2. File the fix as a normal PR ; do NOT skip CI.
 3. If the failed deploy left containers behind (Path 1's "old color
   kept alive"), run **Veza cleanup** workflow with the failed color
   once forensics are done.
 4. Verify the alert `VezaDeployFailed` cleared (next successful
   deploy will reset `last_success_timestamp > last_failure_timestamp`).
 ---
 ## Workflows referenced
 - `.forgejo/workflows/deploy.yml` — push:main → staging, tag → prod.
 - `.forgejo/workflows/rollback.yml` — workflow_dispatch only, modes
  fast and full.
 - `.forgejo/workflows/cleanup-failed.yml` — workflow_dispatch only,
  destroys a specific color's app containers.
 ## Playbooks referenced
 - `infra/ansible/playbooks/deploy_app.yml`
 - `infra/ansible/playbooks/rollback.yml`
 - `infra/ansible/playbooks/cleanup_failed.yml`
 - `infra/ansible/playbooks/deploy_data.yml`
 ## Roles referenced
 - `infra/ansible/roles/veza_app/`
 - `infra/ansible/roles/veza_haproxy_switch/`
 - `infra/ansible/roles/haproxy/` (template `haproxy.cfg.j2` with
  blue/green topology toggle).