Two operator docs the W5+ deploy pipeline depends on for safe
operation.
docs/MIGRATIONS.md (extended) :
Existing file already covered migration tooling + naming. Append
a "Expand-contract discipline (W5+ deploy pipeline contract)"
section : explains why blue/green rollback breaks if migrations
are forward-only, walks through the 3-deploy expand-backfill-
contract pattern with a worked example (add nullable column →
backfill → set NOT NULL), tables of allowed vs not-allowed
changes for a single deploy, reviewer checklist, and an "in case
of incident" override path with audit trail.
docs/RUNBOOK_ROLLBACK.md (new) :
Three rollback paths from fastest to slowest :
1. HAProxy fast-flip (~5s) — when prior color is still alive,
use the rollback.yml workflow with mode=fast. Pre-checks +
post-rollback steps.
2. Re-deploy older SHA (~10m) — when prior color is gone but
tarball is still in the Forgejo registry. mode=full.
Schema-migration caveat documented.
3. Manual emergency — tarball missing (rebuild + push), schema
poisoned (manual SQL), Incus host broken (ZFS rollback).
Plus a decision flowchart, "When NOT to rollback" with examples
that bias toward fix-forward over rollback (single-user bugs,
perf regressions, cosmetic issues), and a post-incident checklist.
Cross-referenced with the workflow + playbook + role file paths
the operator will actually need to look up.
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.5 KiB
Database Migrations
Overview
Veza uses SQL migrations stored in veza-backend-api/migrations/. Migrations are applied in order by filename (lexicographic sort).
Migration Naming
- Format:
NNN_description.sql(e.g.101_product_reviews.sql) - Use snake_case for descriptions
- Down migrations (rollback):
NNN_description_down.sqlwhen needed
Squash Script
The scripts/squash_migrations.sh script generates a baseline SQL file that concatenates all migrations into a single file. This is useful for:
- Fresh database setup
- Creating a clean baseline for new environments
- Versioned releases (e.g.
baseline_v0601.sql)
Usage
# From project root
./scripts/squash_migrations.sh
Output: veza-backend-api/migrations/baseline_v0601.sql
Procedure
- Run the script after adding new migrations
- Update the version in the script (e.g.
baseline_v0601.sql) for each release - Update the migration range comment (e.g.
001-113) to reflect the latest migration number - The baseline file is auto-generated; do not edit it manually
Recent Migrations
| # | File | Description |
|---|---|---|
| 116 | 116_seller_transfers_retry.sql |
v0.701: Add retry_count, next_retry_at to seller_transfers; index for failed retries |
Adding New Migrations
- Create a new file:
veza-backend-api/migrations/NNN_description.sql - Use the next available number (check existing migrations)
- Write idempotent SQL when possible (e.g.
IF NOT EXISTS) - Test locally before committing
- Run
squash_migrations.shto update the baseline for the release
Expand-contract discipline (W5+ deploy pipeline contract)
TL;DR — every migration must be backward-compatible with the previous deploy's binary. No
DROP COLUMN, noALTER ... NOT NULL, noRENAMEin step 1. Schema evolution happens across multiple deploys, not in one.
Why this matters
The blue/green deploy pipeline (infra/ansible/playbooks/deploy_app.yml)
makes rollback trivial at the app layer: HAProxy flips back to
the previous color, ~5 seconds wall-clock, no data lost. But the
database doesn't have colors. Migrations apply once, against the
shared postgres container, and stay applied across the rollback.
If a deploy adds a non-nullable column and the rollback tries to insert a row without that column, the insert fails. The rollback button is broken — the previous binary now crashes against the post-migration schema.
The fix isn't to make the pipeline smarter. It's to make migrations forward-AND-backward compatible by construction.
The expand-contract pattern (3 deploys per "destructive" change)
Step 1 (deploy N) — Expand: add the new shape alongside the old. Both binaries (old + new) work.
-- migration NNN_add_user_email_verified.sql
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
-- nullable, no default — the old binary doesn't know about it.
-- the new binary writes true/false on signup ; reads coalesce NULL → false.
Step 2 (deploy N+1) — Backfill: once Step 1 is stable in prod (≥ 1 week, no rollbacks needed), backfill existing rows.
-- migration NNN+1_backfill_user_email_verified.sql
UPDATE users SET email_verified = false WHERE email_verified IS NULL;
Step 3 (deploy N+2) — Contract: once the backfill is in, add the
constraint. The old binary (still write-coalescing NULL → false) keeps
working ; the new binary uses NOT NULL knowledge.
-- migration NNN+2_user_email_verified_not_null.sql
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
ALTER TABLE users ALTER COLUMN email_verified SET DEFAULT false;
After Step 3 is stable, you can rollback exactly one deploy without breakage. Rolling back beyond Step 1 is no longer safe — that's the expected consequence of expand-contract.
Allowed in a single deploy
| Change | Safe in one deploy? |
|---|---|
CREATE TABLE |
yes |
CREATE INDEX CONCURRENTLY |
yes |
| Add nullable column | yes |
| Add column with constant default | yes (PG ≥ 11) |
| Backfill UPDATE (idempotent) | yes |
DROP INDEX CONCURRENTLY |
yes (read paths flex) |
DROP TABLE (if no recent code reads it) |
with caution |
NOT allowed in a single deploy
| Change | Why |
|---|---|
DROP COLUMN |
rollback's binary still selects it |
ALTER COLUMN ... NOT NULL (no prior backfill) |
rollback inserts NULL |
ALTER COLUMN ... TYPE |
rollback's binary expects old type |
RENAME COLUMN |
rollback's binary still references old name |
RENAME TABLE |
rollback queries old name |
Reviewer checklist (PRs touching veza-backend-api/migrations/)
- Migration is forward-only (GORM doesn't run rollback SQL).
- Migration is idempotent (re-running on an already-migrated
DB is a no-op —
IF NOT EXISTS,ON CONFLICT DO NOTHING, etc.). - No
DROP COLUMN,ALTER ... NOT NULL,RENAME(or, if there is, the PR description references the prior backfill PRs and explains why this is the contract step). - If the migration takes a heavy lock (eg
ALTER TABLErewriting), useCREATE INDEX CONCURRENTLYor split. - App code changes assume both old and new schema are valid.
When you must violate the rule (incident)
Sometimes a hot incident demands a destructive change ASAP and rollback is acceptable risk. In that case:
- Tag the PR with
migration:destructive. - Document in the PR body what the rollback procedure is (manual SQL to recreate the dropped column, etc.).
- Get a second pair of eyes on the migration before merge.
- Block the corresponding rollback workflow for that env until you've verified the new schema is sticking.
Future hardening (not in v1.0.x)
A squawk linter step in .forgejo/workflows/ci.yml could scan
veza-backend-api/migrations/*.sql and fail on DROP COLUMN,
ALTER ... NOT NULL, RENAME. The discipline above is the v1.0
answer ; tooling lands when the hand-rolled discipline starts
missing things.