veza/docs/MIGRATIONS.md
senke 22d09dcbbb docs: MIGRATIONS expand-contract section + RUNBOOK_ROLLBACK
Two operator docs the W5+ deploy pipeline depends on for safe
operation.

docs/MIGRATIONS.md (extended) :
  Existing file already covered migration tooling + naming. Append
  a "Expand-contract discipline (W5+ deploy pipeline contract)"
  section : explains why blue/green rollback breaks if migrations
  are forward-only, walks through the 3-deploy expand-backfill-
  contract pattern with a worked example (add nullable column →
  backfill → set NOT NULL), tables of allowed vs not-allowed
  changes for a single deploy, reviewer checklist, and an "in case
  of incident" override path with audit trail.

docs/RUNBOOK_ROLLBACK.md (new) :
  Three rollback paths from fastest to slowest :
   1. HAProxy fast-flip (~5s) — when prior color is still alive,
      use the rollback.yml workflow with mode=fast. Pre-checks +
      post-rollback steps.
   2. Re-deploy older SHA (~10m) — when prior color is gone but
      tarball is still in the Forgejo registry. mode=full.
      Schema-migration caveat documented.
   3. Manual emergency — tarball missing (rebuild + push), schema
      poisoned (manual SQL), Incus host broken (ZFS rollback).

Plus a decision flowchart, "When NOT to rollback" with examples
that bias toward fix-forward over rollback (single-user bugs,
perf regressions, cosmetic issues), and a post-incident checklist.

Cross-referenced with the workflow + playbook + role file paths
the operator will actually need to look up.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:48:46 +02:00

160 lines
6.5 KiB
Markdown

# Database Migrations
## Overview
Veza uses SQL migrations stored in `veza-backend-api/migrations/`. Migrations are applied in order by filename (lexicographic sort).
## Migration Naming
- Format: `NNN_description.sql` (e.g. `101_product_reviews.sql`)
- Use snake_case for descriptions
- Down migrations (rollback): `NNN_description_down.sql` when needed
## Squash Script
The `scripts/squash_migrations.sh` script generates a baseline SQL file that concatenates all migrations into a single file. This is useful for:
- Fresh database setup
- Creating a clean baseline for new environments
- Versioned releases (e.g. `baseline_v0601.sql`)
### Usage
```bash
# From project root
./scripts/squash_migrations.sh
```
Output: `veza-backend-api/migrations/baseline_v0601.sql`
### Procedure
1. Run the script after adding new migrations
2. Update the version in the script (e.g. `baseline_v0601.sql`) for each release
3. Update the migration range comment (e.g. `001-113`) to reflect the latest migration number
4. The baseline file is auto-generated; do not edit it manually
## Recent Migrations
| # | File | Description |
|---|------|-------------|
| 116 | `116_seller_transfers_retry.sql` | v0.701: Add `retry_count`, `next_retry_at` to `seller_transfers`; index for failed retries |
## Adding New Migrations
1. Create a new file: `veza-backend-api/migrations/NNN_description.sql`
2. Use the next available number (check existing migrations)
3. Write idempotent SQL when possible (e.g. `IF NOT EXISTS`)
4. Test locally before committing
5. Run `squash_migrations.sh` to update the baseline for the release
## Expand-contract discipline (W5+ deploy pipeline contract)
> **TL;DR** — every migration must be **backward-compatible** with the
> previous deploy's binary. No `DROP COLUMN`, no `ALTER ... NOT NULL`,
> no `RENAME` in step 1. Schema evolution happens across **multiple
> deploys**, not in one.
### Why this matters
The blue/green deploy pipeline (`infra/ansible/playbooks/deploy_app.yml`)
makes rollback trivial at the **app layer**: HAProxy flips back to
the previous color, ~5 seconds wall-clock, no data lost. But the
**database** doesn't have colors. Migrations apply once, against the
shared postgres container, and stay applied across the rollback.
If a deploy adds a non-nullable column and the rollback tries to insert
a row without that column, the insert fails. The rollback button is
broken — the previous binary now crashes against the post-migration
schema.
The fix isn't to make the pipeline smarter. It's to make migrations
forward-AND-backward compatible by construction.
### The expand-contract pattern (3 deploys per "destructive" change)
**Step 1 (deploy N) — Expand**: add the new shape **alongside** the
old. Both binaries (old + new) work.
```sql
-- migration NNN_add_user_email_verified.sql
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
-- nullable, no default — the old binary doesn't know about it.
-- the new binary writes true/false on signup ; reads coalesce NULL → false.
```
**Step 2 (deploy N+1) — Backfill**: once Step 1 is stable in prod
(≥ 1 week, no rollbacks needed), backfill existing rows.
```sql
-- migration NNN+1_backfill_user_email_verified.sql
UPDATE users SET email_verified = false WHERE email_verified IS NULL;
```
**Step 3 (deploy N+2) — Contract**: once the backfill is in, add the
constraint. The old binary (still write-coalescing NULL → false) keeps
working ; the new binary uses `NOT NULL` knowledge.
```sql
-- migration NNN+2_user_email_verified_not_null.sql
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
ALTER TABLE users ALTER COLUMN email_verified SET DEFAULT false;
```
After Step 3 is stable, you can rollback exactly **one** deploy without
breakage. Rolling back beyond Step 1 is no longer safe — that's the
expected consequence of expand-contract.
### Allowed in a single deploy
| Change | Safe in one deploy? |
| --------------------------------------- | ----------------------- |
| `CREATE TABLE` | yes |
| `CREATE INDEX CONCURRENTLY` | yes |
| Add nullable column | yes |
| Add column with constant default | yes (PG ≥ 11) |
| Backfill UPDATE (idempotent) | yes |
| `DROP INDEX CONCURRENTLY` | yes (read paths flex) |
| `DROP TABLE` (if no recent code reads it) | with caution |
### NOT allowed in a single deploy
| Change | Why |
| --------------------------------------- | -------------------------------------------- |
| `DROP COLUMN` | rollback's binary still selects it |
| `ALTER COLUMN ... NOT NULL` (no prior backfill) | rollback inserts NULL |
| `ALTER COLUMN ... TYPE` | rollback's binary expects old type |
| `RENAME COLUMN` | rollback's binary still references old name |
| `RENAME TABLE` | rollback queries old name |
### Reviewer checklist (PRs touching `veza-backend-api/migrations/`)
- [ ] Migration is **forward-only** (GORM doesn't run rollback SQL).
- [ ] Migration is **idempotent** (re-running on an already-migrated
DB is a no-op — `IF NOT EXISTS`, `ON CONFLICT DO NOTHING`, etc.).
- [ ] No `DROP COLUMN`, `ALTER ... NOT NULL`, `RENAME` (or, if there
is, the PR description references the prior backfill PRs and
explains why this is the contract step).
- [ ] If the migration takes a heavy lock (eg `ALTER TABLE` rewriting),
use `CREATE INDEX CONCURRENTLY` or split.
- [ ] App code changes assume both old and new schema are valid.
### When you must violate the rule (incident)
Sometimes a hot incident demands a destructive change ASAP and rollback
is acceptable risk. In that case:
1. Tag the PR with `migration:destructive`.
2. Document in the PR body what the rollback procedure is (manual
SQL to recreate the dropped column, etc.).
3. Get a second pair of eyes on the migration before merge.
4. Block the corresponding rollback workflow for that env until
you've verified the new schema is sticking.
### Future hardening (not in v1.0.x)
A `squawk` linter step in `.forgejo/workflows/ci.yml` could scan
`veza-backend-api/migrations/*.sql` and fail on `DROP COLUMN`,
`ALTER ... NOT NULL`, `RENAME`. The discipline above is the v1.0
answer ; tooling lands when the hand-rolled discipline starts
missing things.