veza/docs/MIGRATIONS.md
senke 22d09dcbbb docs: MIGRATIONS expand-contract section + RUNBOOK_ROLLBACK
Two operator docs the W5+ deploy pipeline depends on for safe
operation.

docs/MIGRATIONS.md (extended) :
  Existing file already covered migration tooling + naming. Append
  a "Expand-contract discipline (W5+ deploy pipeline contract)"
  section : explains why blue/green rollback breaks if migrations
  are forward-only, walks through the 3-deploy expand-backfill-
  contract pattern with a worked example (add nullable column →
  backfill → set NOT NULL), tables of allowed vs not-allowed
  changes for a single deploy, reviewer checklist, and an "in case
  of incident" override path with audit trail.

docs/RUNBOOK_ROLLBACK.md (new) :
  Three rollback paths from fastest to slowest :
   1. HAProxy fast-flip (~5s) — when prior color is still alive,
      use the rollback.yml workflow with mode=fast. Pre-checks +
      post-rollback steps.
   2. Re-deploy older SHA (~10m) — when prior color is gone but
      tarball is still in the Forgejo registry. mode=full.
      Schema-migration caveat documented.
   3. Manual emergency — tarball missing (rebuild + push), schema
      poisoned (manual SQL), Incus host broken (ZFS rollback).

Plus a decision flowchart, "When NOT to rollback" with examples
that bias toward fix-forward over rollback (single-user bugs,
perf regressions, cosmetic issues), and a post-incident checklist.

Cross-referenced with the workflow + playbook + role file paths
the operator will actually need to look up.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:48:46 +02:00

6.5 KiB

Database Migrations

Overview

Veza uses SQL migrations stored in veza-backend-api/migrations/. Migrations are applied in order by filename (lexicographic sort).

Migration Naming

  • Format: NNN_description.sql (e.g. 101_product_reviews.sql)
  • Use snake_case for descriptions
  • Down migrations (rollback): NNN_description_down.sql when needed

Squash Script

The scripts/squash_migrations.sh script generates a baseline SQL file that concatenates all migrations into a single file. This is useful for:

  • Fresh database setup
  • Creating a clean baseline for new environments
  • Versioned releases (e.g. baseline_v0601.sql)

Usage

# From project root
./scripts/squash_migrations.sh

Output: veza-backend-api/migrations/baseline_v0601.sql

Procedure

  1. Run the script after adding new migrations
  2. Update the version in the script (e.g. baseline_v0601.sql) for each release
  3. Update the migration range comment (e.g. 001-113) to reflect the latest migration number
  4. The baseline file is auto-generated; do not edit it manually

Recent Migrations

# File Description
116 116_seller_transfers_retry.sql v0.701: Add retry_count, next_retry_at to seller_transfers; index for failed retries

Adding New Migrations

  1. Create a new file: veza-backend-api/migrations/NNN_description.sql
  2. Use the next available number (check existing migrations)
  3. Write idempotent SQL when possible (e.g. IF NOT EXISTS)
  4. Test locally before committing
  5. Run squash_migrations.sh to update the baseline for the release

Expand-contract discipline (W5+ deploy pipeline contract)

TL;DR — every migration must be backward-compatible with the previous deploy's binary. No DROP COLUMN, no ALTER ... NOT NULL, no RENAME in step 1. Schema evolution happens across multiple deploys, not in one.

Why this matters

The blue/green deploy pipeline (infra/ansible/playbooks/deploy_app.yml) makes rollback trivial at the app layer: HAProxy flips back to the previous color, ~5 seconds wall-clock, no data lost. But the database doesn't have colors. Migrations apply once, against the shared postgres container, and stay applied across the rollback.

If a deploy adds a non-nullable column and the rollback tries to insert a row without that column, the insert fails. The rollback button is broken — the previous binary now crashes against the post-migration schema.

The fix isn't to make the pipeline smarter. It's to make migrations forward-AND-backward compatible by construction.

The expand-contract pattern (3 deploys per "destructive" change)

Step 1 (deploy N) — Expand: add the new shape alongside the old. Both binaries (old + new) work.

-- migration NNN_add_user_email_verified.sql
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
-- nullable, no default — the old binary doesn't know about it.
-- the new binary writes true/false on signup ; reads coalesce NULL → false.

Step 2 (deploy N+1) — Backfill: once Step 1 is stable in prod (≥ 1 week, no rollbacks needed), backfill existing rows.

-- migration NNN+1_backfill_user_email_verified.sql
UPDATE users SET email_verified = false WHERE email_verified IS NULL;

Step 3 (deploy N+2) — Contract: once the backfill is in, add the constraint. The old binary (still write-coalescing NULL → false) keeps working ; the new binary uses NOT NULL knowledge.

-- migration NNN+2_user_email_verified_not_null.sql
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
ALTER TABLE users ALTER COLUMN email_verified SET DEFAULT false;

After Step 3 is stable, you can rollback exactly one deploy without breakage. Rolling back beyond Step 1 is no longer safe — that's the expected consequence of expand-contract.

Allowed in a single deploy

Change Safe in one deploy?
CREATE TABLE yes
CREATE INDEX CONCURRENTLY yes
Add nullable column yes
Add column with constant default yes (PG ≥ 11)
Backfill UPDATE (idempotent) yes
DROP INDEX CONCURRENTLY yes (read paths flex)
DROP TABLE (if no recent code reads it) with caution

NOT allowed in a single deploy

Change Why
DROP COLUMN rollback's binary still selects it
ALTER COLUMN ... NOT NULL (no prior backfill) rollback inserts NULL
ALTER COLUMN ... TYPE rollback's binary expects old type
RENAME COLUMN rollback's binary still references old name
RENAME TABLE rollback queries old name

Reviewer checklist (PRs touching veza-backend-api/migrations/)

  • Migration is forward-only (GORM doesn't run rollback SQL).
  • Migration is idempotent (re-running on an already-migrated DB is a no-op — IF NOT EXISTS, ON CONFLICT DO NOTHING, etc.).
  • No DROP COLUMN, ALTER ... NOT NULL, RENAME (or, if there is, the PR description references the prior backfill PRs and explains why this is the contract step).
  • If the migration takes a heavy lock (eg ALTER TABLE rewriting), use CREATE INDEX CONCURRENTLY or split.
  • App code changes assume both old and new schema are valid.

When you must violate the rule (incident)

Sometimes a hot incident demands a destructive change ASAP and rollback is acceptable risk. In that case:

  1. Tag the PR with migration:destructive.
  2. Document in the PR body what the rollback procedure is (manual SQL to recreate the dropped column, etc.).
  3. Get a second pair of eyes on the migration before merge.
  4. Block the corresponding rollback workflow for that env until you've verified the new schema is sticking.

Future hardening (not in v1.0.x)

A squawk linter step in .forgejo/workflows/ci.yml could scan veza-backend-api/migrations/*.sql and fail on DROP COLUMN, ALTER ... NOT NULL, RENAME. The discipline above is the v1.0 answer ; tooling lands when the hand-rolled discipline starts missing things.