docs: MIGRATIONS expand-contract section + RUNBOOK_ROLLBACK
Two operator docs the W5+ deploy pipeline depends on for safe
operation.
docs/MIGRATIONS.md (extended) :
Existing file already covered migration tooling + naming. Append
a "Expand-contract discipline (W5+ deploy pipeline contract)"
section : explains why blue/green rollback breaks if migrations
are forward-only, walks through the 3-deploy expand-backfill-
contract pattern with a worked example (add nullable column →
backfill → set NOT NULL), tables of allowed vs not-allowed
changes for a single deploy, reviewer checklist, and an "in case
of incident" override path with audit trail.
docs/RUNBOOK_ROLLBACK.md (new) :
Three rollback paths from fastest to slowest :
1. HAProxy fast-flip (~5s) — when prior color is still alive,
use the rollback.yml workflow with mode=fast. Pre-checks +
post-rollback steps.
2. Re-deploy older SHA (~10m) — when prior color is gone but
tarball is still in the Forgejo registry. mode=full.
Schema-migration caveat documented.
3. Manual emergency — tarball missing (rebuild + push), schema
poisoned (manual SQL), Incus host broken (ZFS rollback).
Plus a decision flowchart, "When NOT to rollback" with examples
that bias toward fix-forward over rollback (single-user bugs,
perf regressions, cosmetic issues), and a post-incident checklist.
Cross-referenced with the workflow + playbook + role file paths
the operator will actually need to look up.
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
f4eb4732dd
commit
22d09dcbbb
2 changed files with 364 additions and 0 deletions
|
|
@ -47,3 +47,114 @@ Output: `veza-backend-api/migrations/baseline_v0601.sql`
|
|||
3. Write idempotent SQL when possible (e.g. `IF NOT EXISTS`)
|
||||
4. Test locally before committing
|
||||
5. Run `squash_migrations.sh` to update the baseline for the release
|
||||
|
||||
## Expand-contract discipline (W5+ deploy pipeline contract)
|
||||
|
||||
> **TL;DR** — every migration must be **backward-compatible** with the
|
||||
> previous deploy's binary. No `DROP COLUMN`, no `ALTER ... NOT NULL`,
|
||||
> no `RENAME` in step 1. Schema evolution happens across **multiple
|
||||
> deploys**, not in one.
|
||||
|
||||
### Why this matters
|
||||
|
||||
The blue/green deploy pipeline (`infra/ansible/playbooks/deploy_app.yml`)
|
||||
makes rollback trivial at the **app layer**: HAProxy flips back to
|
||||
the previous color, ~5 seconds wall-clock, no data lost. But the
|
||||
**database** doesn't have colors. Migrations apply once, against the
|
||||
shared postgres container, and stay applied across the rollback.
|
||||
|
||||
If a deploy adds a non-nullable column and the rollback tries to insert
|
||||
a row without that column, the insert fails. The rollback button is
|
||||
broken — the previous binary now crashes against the post-migration
|
||||
schema.
|
||||
|
||||
The fix isn't to make the pipeline smarter. It's to make migrations
|
||||
forward-AND-backward compatible by construction.
|
||||
|
||||
### The expand-contract pattern (3 deploys per "destructive" change)
|
||||
|
||||
**Step 1 (deploy N) — Expand**: add the new shape **alongside** the
|
||||
old. Both binaries (old + new) work.
|
||||
|
||||
```sql
|
||||
-- migration NNN_add_user_email_verified.sql
|
||||
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
|
||||
-- nullable, no default — the old binary doesn't know about it.
|
||||
-- the new binary writes true/false on signup ; reads coalesce NULL → false.
|
||||
```
|
||||
|
||||
**Step 2 (deploy N+1) — Backfill**: once Step 1 is stable in prod
|
||||
(≥ 1 week, no rollbacks needed), backfill existing rows.
|
||||
|
||||
```sql
|
||||
-- migration NNN+1_backfill_user_email_verified.sql
|
||||
UPDATE users SET email_verified = false WHERE email_verified IS NULL;
|
||||
```
|
||||
|
||||
**Step 3 (deploy N+2) — Contract**: once the backfill is in, add the
|
||||
constraint. The old binary (still write-coalescing NULL → false) keeps
|
||||
working ; the new binary uses `NOT NULL` knowledge.
|
||||
|
||||
```sql
|
||||
-- migration NNN+2_user_email_verified_not_null.sql
|
||||
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
|
||||
ALTER TABLE users ALTER COLUMN email_verified SET DEFAULT false;
|
||||
```
|
||||
|
||||
After Step 3 is stable, you can rollback exactly **one** deploy without
|
||||
breakage. Rolling back beyond Step 1 is no longer safe — that's the
|
||||
expected consequence of expand-contract.
|
||||
|
||||
### Allowed in a single deploy
|
||||
|
||||
| Change | Safe in one deploy? |
|
||||
| --------------------------------------- | ----------------------- |
|
||||
| `CREATE TABLE` | yes |
|
||||
| `CREATE INDEX CONCURRENTLY` | yes |
|
||||
| Add nullable column | yes |
|
||||
| Add column with constant default | yes (PG ≥ 11) |
|
||||
| Backfill UPDATE (idempotent) | yes |
|
||||
| `DROP INDEX CONCURRENTLY` | yes (read paths flex) |
|
||||
| `DROP TABLE` (if no recent code reads it) | with caution |
|
||||
|
||||
### NOT allowed in a single deploy
|
||||
|
||||
| Change | Why |
|
||||
| --------------------------------------- | -------------------------------------------- |
|
||||
| `DROP COLUMN` | rollback's binary still selects it |
|
||||
| `ALTER COLUMN ... NOT NULL` (no prior backfill) | rollback inserts NULL |
|
||||
| `ALTER COLUMN ... TYPE` | rollback's binary expects old type |
|
||||
| `RENAME COLUMN` | rollback's binary still references old name |
|
||||
| `RENAME TABLE` | rollback queries old name |
|
||||
|
||||
### Reviewer checklist (PRs touching `veza-backend-api/migrations/`)
|
||||
|
||||
- [ ] Migration is **forward-only** (GORM doesn't run rollback SQL).
|
||||
- [ ] Migration is **idempotent** (re-running on an already-migrated
|
||||
DB is a no-op — `IF NOT EXISTS`, `ON CONFLICT DO NOTHING`, etc.).
|
||||
- [ ] No `DROP COLUMN`, `ALTER ... NOT NULL`, `RENAME` (or, if there
|
||||
is, the PR description references the prior backfill PRs and
|
||||
explains why this is the contract step).
|
||||
- [ ] If the migration takes a heavy lock (eg `ALTER TABLE` rewriting),
|
||||
use `CREATE INDEX CONCURRENTLY` or split.
|
||||
- [ ] App code changes assume both old and new schema are valid.
|
||||
|
||||
### When you must violate the rule (incident)
|
||||
|
||||
Sometimes a hot incident demands a destructive change ASAP and rollback
|
||||
is acceptable risk. In that case:
|
||||
|
||||
1. Tag the PR with `migration:destructive`.
|
||||
2. Document in the PR body what the rollback procedure is (manual
|
||||
SQL to recreate the dropped column, etc.).
|
||||
3. Get a second pair of eyes on the migration before merge.
|
||||
4. Block the corresponding rollback workflow for that env until
|
||||
you've verified the new schema is sticking.
|
||||
|
||||
### Future hardening (not in v1.0.x)
|
||||
|
||||
A `squawk` linter step in `.forgejo/workflows/ci.yml` could scan
|
||||
`veza-backend-api/migrations/*.sql` and fail on `DROP COLUMN`,
|
||||
`ALTER ... NOT NULL`, `RENAME`. The discipline above is the v1.0
|
||||
answer ; tooling lands when the hand-rolled discipline starts
|
||||
missing things.
|
||||
|
|
|
|||
253
docs/RUNBOOK_ROLLBACK.md
Normal file
253
docs/RUNBOOK_ROLLBACK.md
Normal file
|
|
@ -0,0 +1,253 @@
|
|||
# Runbook — rollback a Veza deploy
|
||||
|
||||
Three rollback paths, ordered from fastest to slowest. Pick based on
|
||||
what's still alive and what you're rolling back from.
|
||||
|
||||
| Path | Time | Use when |
|
||||
| --------------------- | ---- | -------------------------------------------------- |
|
||||
| 1. HAProxy fast-flip | ~5s | The previous color's containers are still alive. |
|
||||
| 2. Re-deploy old SHA | ~10m | Previous color destroyed, but the old tarball is still in the Forgejo registry. |
|
||||
| 3. Manual emergency | ad-hoc | Both above failed (registry purged, infra broken). |
|
||||
|
||||
> **Before you rollback, breathe and read this first.** The default
|
||||
> instinct under fire is "smash the rollback button". Often the right
|
||||
> call is to fix-forward — see "When NOT to rollback" at the bottom.
|
||||
|
||||
---
|
||||
|
||||
## Decision flowchart
|
||||
|
||||
```
|
||||
Did the new color come up at all?
|
||||
│
|
||||
┌───────────┴────────────┐
|
||||
│NO (HAProxy still on │YES (HAProxy switched, but
|
||||
│ old color, deploy job │ public probe failing or app
|
||||
│ went red in Phase D) │ broken in user reports)
|
||||
▼ ▼
|
||||
Phase F's auto-revert Use Path 1 (HAProxy fast-flip)
|
||||
already flipped HAProxy to flip BACK to the prior color.
|
||||
for you. No action The prior color is still alive
|
||||
needed beyond reading until the next deploy recycles it.
|
||||
logs.
|
||||
If the prior color was already
|
||||
cleaned up, use Path 2.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Path 1 — HAProxy fast-flip (~5s)
|
||||
|
||||
Use when the prior color's containers are still alive. Triggered via
|
||||
the `Veza rollback` workflow with `mode=fast`.
|
||||
|
||||
### Pre-checks
|
||||
|
||||
```bash
|
||||
# What's the current active color?
|
||||
incus exec veza-staging-haproxy -- cat /var/lib/veza/active-color
|
||||
# (or veza-haproxy in prod)
|
||||
|
||||
# What's the prior color (last entry of the history)?
|
||||
incus exec veza-staging-haproxy -- head -2 /var/lib/veza/active-color.history
|
||||
|
||||
# Are the prior color's containers RUNNING?
|
||||
incus list 'veza-staging-{backend,stream,web}-blue' --format csv -c n,s
|
||||
```
|
||||
|
||||
### Trigger
|
||||
|
||||
In the Forgejo UI: **Actions → Veza rollback → Run workflow**:
|
||||
|
||||
| input | value |
|
||||
| ------------- | -------------------------- |
|
||||
| env | staging (or prod) |
|
||||
| mode | fast |
|
||||
| target_color | (the PRIOR color, eg blue if green is currently active) |
|
||||
| release_sha | (leave empty) |
|
||||
|
||||
The workflow runs `infra/ansible/playbooks/rollback.yml -e mode=fast
|
||||
-e target_color=blue` which :
|
||||
|
||||
1. Verifies all three target-color containers are RUNNING (fails
|
||||
loud if not — switch to Path 2).
|
||||
2. Re-templates `haproxy.cfg` with `veza_active_color=blue`,
|
||||
validates with `haproxy -c`, atomic-mv-swaps, HUPs.
|
||||
3. Updates `/var/lib/veza/active-color`.
|
||||
|
||||
Wall time: ~5s. Zero connection drop (HAProxy reload is graceful).
|
||||
|
||||
### Post-rollback
|
||||
|
||||
- Verify externally: `curl https://staging.veza.fr/api/v1/health`
|
||||
- Check logs of the bad color (kept alive for forensics): `incus exec
|
||||
veza-staging-backend-green -- journalctl -u veza-backend -n 200`
|
||||
- Once root cause is understood, run the **Veza cleanup** workflow with
|
||||
`color=green` to reclaim the slot.
|
||||
|
||||
---
|
||||
|
||||
## Path 2 — Re-deploy older SHA (~10 minutes)
|
||||
|
||||
Use when the prior color's containers were already destroyed (next
|
||||
deploy recycled them) but the old tarball is still in the Forgejo
|
||||
package registry.
|
||||
|
||||
### Pre-checks
|
||||
|
||||
```bash
|
||||
# Pick the SHA you want to roll back TO.
|
||||
# Look at the active-color.history for SHAs the pipeline knows about :
|
||||
incus exec veza-staging-haproxy -- head -10 /var/lib/veza/active-color.history
|
||||
|
||||
# Or `git log --oneline main` for any commit ; just confirm the
|
||||
# tarball still exists in the registry (default retention 30 SHAs
|
||||
# per component) :
|
||||
curl -fsSL -I -H "Authorization: token $TOKEN" \
|
||||
"https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst"
|
||||
```
|
||||
|
||||
### Trigger
|
||||
|
||||
In the Forgejo UI: **Actions → Veza rollback → Run workflow**:
|
||||
|
||||
| input | value |
|
||||
| ------------ | ------------------------------- |
|
||||
| env | staging (or prod) |
|
||||
| mode | full |
|
||||
| target_color | (leave empty) |
|
||||
| release_sha | the 40-char SHA you're rolling TO |
|
||||
|
||||
The workflow runs `playbooks/rollback.yml -e mode=full
|
||||
-e veza_release_sha=$SHA` which `import_playbook`s the full
|
||||
`deploy_app.yml` pipeline. Same Phase A → Phase F sequence as a
|
||||
normal deploy, but with the older SHA.
|
||||
|
||||
Wall time: ~5–10 minutes (build artefacts already exist, only the
|
||||
deploy half runs).
|
||||
|
||||
### Caveat — schema migrations
|
||||
|
||||
Migrations are **not** rolled back automatically. The schema after
|
||||
`Path 2` is the post-deploy schema, not the pre-deploy schema.
|
||||
Per **MIGRATIONS.md**'s expand-contract discipline, this should be
|
||||
fine for one deploy back. If it isn't (i.e., the failed deploy
|
||||
included a destructive migration), see **Path 3**.
|
||||
|
||||
---
|
||||
|
||||
## Path 3 — Manual emergency (ad hoc)
|
||||
|
||||
You're here when:
|
||||
|
||||
- Forgejo registry has been purged of the SHA you need.
|
||||
- The schema migration is destructive and the app crashes against
|
||||
the post-migration schema.
|
||||
- The Incus host itself is in a bad state.
|
||||
|
||||
### Tarball missing — rebuild and push manually
|
||||
|
||||
```bash
|
||||
# Build the artefact locally (you'll need the toolchain) :
|
||||
cd veza-backend-api
|
||||
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -trimpath -ldflags "-s -w" \
|
||||
-o ./bin/veza-api ./cmd/api/main.go
|
||||
tar --use-compress-program=zstd -cf "/tmp/veza-backend-$SHA.tar.zst" \
|
||||
-C ./bin veza-api migrate_tool
|
||||
|
||||
# Push to the registry :
|
||||
curl -fsSL --fail-with-body -X PUT \
|
||||
-H "Authorization: token $TOKEN" \
|
||||
--upload-file "/tmp/veza-backend-$SHA.tar.zst" \
|
||||
"https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst"
|
||||
|
||||
# Then run Path 2.
|
||||
```
|
||||
|
||||
### Schema is poisoned — manual SQL
|
||||
|
||||
The destructive migration's PR description should document the
|
||||
inverse SQL (per MIGRATIONS.md "When you must violate the rule").
|
||||
Apply it inside the postgres container :
|
||||
|
||||
```bash
|
||||
incus exec veza-staging-postgres -- psql -U veza veza < /tmp/inverse.sql
|
||||
```
|
||||
|
||||
Then run Path 2 to deploy the older binary.
|
||||
|
||||
### Incus host broken — rollback ZFS snapshot
|
||||
|
||||
`deploy_data.yml` snapshots every data container's dataset before
|
||||
mutating anything (`<dataset>@pre-deploy-<sha>`). To restore :
|
||||
|
||||
```bash
|
||||
# First, stop the container :
|
||||
incus stop veza-staging-postgres
|
||||
|
||||
# Roll the dataset back to the pre-deploy snapshot :
|
||||
zfs rollback -r rpool/incus/containers/veza-staging-postgres@pre-deploy-<sha>
|
||||
|
||||
# Restart the container :
|
||||
incus start veza-staging-postgres
|
||||
```
|
||||
|
||||
This loses any data written after the snapshot. Last-resort only.
|
||||
|
||||
---
|
||||
|
||||
## When NOT to rollback
|
||||
|
||||
- **Single user reports a bug**. Triage first ; rolling back affects
|
||||
100% of users to fix something hitting <1%.
|
||||
- **Performance regression**. If the new SHA is up but slow, scale
|
||||
horizontally before rolling back. (Future Hetzner offload covers
|
||||
this ; for now, accept the regression and prep a fix-forward.)
|
||||
- **Cosmetic UI bug**. Hot-fix the frontend and let the deploy
|
||||
pipeline ship it as a normal commit.
|
||||
- **You're not on-call and didn't get a page**. Don't rollback "to
|
||||
be safe". The on-call's call.
|
||||
|
||||
The rollback button's existence isn't a license to use it
|
||||
preemptively. Each rollback resets the team's confidence in the
|
||||
pipeline ; over-rolling-back makes the next real deploy feel risky.
|
||||
|
||||
---
|
||||
|
||||
## Post-incident
|
||||
|
||||
After ANY rollback (path 1, 2, or 3) :
|
||||
|
||||
1. Update **docs/POSTMORTEMS.md** (or `docs/runbooks/incidents/<date>.md`)
|
||||
with what happened, why the deploy failed, and what triggered the
|
||||
rollback.
|
||||
2. File the fix as a normal PR ; do NOT skip CI.
|
||||
3. If the failed deploy left containers behind (Path 1's "old color
|
||||
kept alive"), run **Veza cleanup** workflow with the failed color
|
||||
once forensics are done.
|
||||
4. Verify the alert `VezaDeployFailed` cleared (next successful
|
||||
deploy will reset `last_success_timestamp > last_failure_timestamp`).
|
||||
|
||||
---
|
||||
|
||||
## Workflows referenced
|
||||
|
||||
- `.forgejo/workflows/deploy.yml` — push:main → staging, tag → prod.
|
||||
- `.forgejo/workflows/rollback.yml` — workflow_dispatch only, modes
|
||||
fast and full.
|
||||
- `.forgejo/workflows/cleanup-failed.yml` — workflow_dispatch only,
|
||||
destroys a specific color's app containers.
|
||||
|
||||
## Playbooks referenced
|
||||
|
||||
- `infra/ansible/playbooks/deploy_app.yml`
|
||||
- `infra/ansible/playbooks/rollback.yml`
|
||||
- `infra/ansible/playbooks/cleanup_failed.yml`
|
||||
- `infra/ansible/playbooks/deploy_data.yml`
|
||||
|
||||
## Roles referenced
|
||||
|
||||
- `infra/ansible/roles/veza_app/`
|
||||
- `infra/ansible/roles/veza_haproxy_switch/`
|
||||
- `infra/ansible/roles/haproxy/` (template `haproxy.cfg.j2` with
|
||||
blue/green topology toggle).
|
||||
Loading…
Reference in a new issue