senke/veza

senke 22d09dcbbb docs: MIGRATIONS expand-contract section + RUNBOOK_ROLLBACK

Two operator docs the W5+ deploy pipeline depends on for safe
operation.

docs/MIGRATIONS.md (extended) :
  Existing file already covered migration tooling + naming. Append
  a "Expand-contract discipline (W5+ deploy pipeline contract)"
  section : explains why blue/green rollback breaks if migrations
  are forward-only, walks through the 3-deploy expand-backfill-
  contract pattern with a worked example (add nullable column →
  backfill → set NOT NULL), tables of allowed vs not-allowed
  changes for a single deploy, reviewer checklist, and an "in case
  of incident" override path with audit trail.

docs/RUNBOOK_ROLLBACK.md (new) :
  Three rollback paths from fastest to slowest :
   1. HAProxy fast-flip (~5s) — when prior color is still alive,
      use the rollback.yml workflow with mode=fast. Pre-checks +
      post-rollback steps.
   2. Re-deploy older SHA (~10m) — when prior color is gone but
      tarball is still in the Forgejo registry. mode=full.
      Schema-migration caveat documented.
   3. Manual emergency — tarball missing (rebuild + push), schema
      poisoned (manual SQL), Incus host broken (ZFS rollback).

Plus a decision flowchart, "When NOT to rollback" with examples
that bias toward fix-forward over rollback (single-user bugs,
perf regressions, cosmetic issues), and a post-incident checklist.

Cross-referenced with the workflow + playbook + role file paths
the operator will actually need to look up.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 14:48:46 +02:00

6.5 KiB

Raw Blame History

Database Migrations

Overview

Veza uses SQL migrations stored in veza-backend-api/migrations/. Migrations are applied in order by filename (lexicographic sort).

Migration Naming

Format: NNN_description.sql (e.g. 101_product_reviews.sql)
Use snake_case for descriptions
Down migrations (rollback): NNN_description_down.sql when needed

Squash Script

The scripts/squash_migrations.sh script generates a baseline SQL file that concatenates all migrations into a single file. This is useful for:

Fresh database setup
Creating a clean baseline for new environments
Versioned releases (e.g. baseline_v0601.sql)

Usage

# From project root
./scripts/squash_migrations.sh

Output: veza-backend-api/migrations/baseline_v0601.sql

Procedure

Run the script after adding new migrations
Update the version in the script (e.g. baseline_v0601.sql) for each release
Update the migration range comment (e.g. 001-113) to reflect the latest migration number
The baseline file is auto-generated; do not edit it manually

Recent Migrations

#	File	Description
116	`116_seller_transfers_retry.sql`	v0.701: Add `retry_count`, `next_retry_at` to `seller_transfers`; index for failed retries

Adding New Migrations

Create a new file: veza-backend-api/migrations/NNN_description.sql
Use the next available number (check existing migrations)
Write idempotent SQL when possible (e.g. IF NOT EXISTS)
Test locally before committing
Run squash_migrations.sh to update the baseline for the release

Expand-contract discipline (W5+ deploy pipeline contract)

TL;DR — every migration must be backward-compatible with the previous deploy's binary. No DROP COLUMN, no ALTER ... NOT NULL, no RENAME in step 1. Schema evolution happens across multiple deploys, not in one.

Why this matters

The blue/green deploy pipeline (infra/ansible/playbooks/deploy_app.yml) makes rollback trivial at the app layer: HAProxy flips back to the previous color, ~5 seconds wall-clock, no data lost. But the database doesn't have colors. Migrations apply once, against the shared postgres container, and stay applied across the rollback.

If a deploy adds a non-nullable column and the rollback tries to insert a row without that column, the insert fails. The rollback button is broken — the previous binary now crashes against the post-migration schema.

The fix isn't to make the pipeline smarter. It's to make migrations forward-AND-backward compatible by construction.

The expand-contract pattern (3 deploys per "destructive" change)

Step 1 (deploy N) — Expand: add the new shape alongside the old. Both binaries (old + new) work.

-- migration NNN_add_user_email_verified.sql
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
-- nullable, no default — the old binary doesn't know about it.
-- the new binary writes true/false on signup ; reads coalesce NULL → false.

Step 2 (deploy N+1) — Backfill: once Step 1 is stable in prod (≥ 1 week, no rollbacks needed), backfill existing rows.

-- migration NNN+1_backfill_user_email_verified.sql
UPDATE users SET email_verified = false WHERE email_verified IS NULL;

Step 3 (deploy N+2) — Contract: once the backfill is in, add the constraint. The old binary (still write-coalescing NULL → false) keeps working ; the new binary uses NOT NULL knowledge.

-- migration NNN+2_user_email_verified_not_null.sql
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
ALTER TABLE users ALTER COLUMN email_verified SET DEFAULT false;

After Step 3 is stable, you can rollback exactly one deploy without breakage. Rolling back beyond Step 1 is no longer safe — that's the expected consequence of expand-contract.

Allowed in a single deploy

Change	Safe in one deploy?
`CREATE TABLE`	yes
`CREATE INDEX CONCURRENTLY`	yes
Add nullable column	yes
Add column with constant default	yes (PG ≥ 11)
Backfill UPDATE (idempotent)	yes
`DROP INDEX CONCURRENTLY`	yes (read paths flex)
`DROP TABLE` (if no recent code reads it)	with caution

NOT allowed in a single deploy

Change	Why
`DROP COLUMN`	rollback's binary still selects it
`ALTER COLUMN ... NOT NULL` (no prior backfill)	rollback inserts NULL
`ALTER COLUMN ... TYPE`	rollback's binary expects old type
`RENAME COLUMN`	rollback's binary still references old name
`RENAME TABLE`	rollback queries old name

Reviewer checklist (PRs touching `veza-backend-api/migrations/`)

Migration is forward-only (GORM doesn't run rollback SQL).
Migration is idempotent (re-running on an already-migrated DB is a no-op — IF NOT EXISTS, ON CONFLICT DO NOTHING, etc.).
No DROP COLUMN, ALTER ... NOT NULL, RENAME (or, if there is, the PR description references the prior backfill PRs and explains why this is the contract step).
If the migration takes a heavy lock (eg ALTER TABLE rewriting), use CREATE INDEX CONCURRENTLY or split.
App code changes assume both old and new schema are valid.

When you must violate the rule (incident)

Sometimes a hot incident demands a destructive change ASAP and rollback is acceptable risk. In that case:

Tag the PR with migration:destructive.
Document in the PR body what the rollback procedure is (manual SQL to recreate the dropped column, etc.).
Get a second pair of eyes on the migration before merge.
Block the corresponding rollback workflow for that env until you've verified the new schema is sticking.

Future hardening (not in v1.0.x)

A squawk linter step in .forgejo/workflows/ci.yml could scan veza-backend-api/migrations/*.sql and fail on DROP COLUMN, ALTER ... NOT NULL, RENAME. The discipline above is the v1.0 answer ; tooling lands when the hand-rolled discipline starts missing things.

6.5 KiB Raw Blame History