veza/docs/MIGRATIONS.md

# Database Migrations

## Overview

Veza uses SQL migrations stored in `veza-backend-api/migrations/`. Migrations are applied in order by filename (lexicographic sort).

## Migration Naming

- Format: `NNN_description.sql` (e.g. `101_product_reviews.sql`)
- Use snake_case for descriptions
- Down migrations (rollback): `NNN_description_down.sql` when needed

## Squash Script

The `scripts/squash_migrations.sh` script generates a baseline SQL file that concatenates all migrations into a single file. This is useful for:

- Fresh database setup
- Creating a clean baseline for new environments
- Versioned releases (e.g. `baseline_v0601.sql`)

### Usage

```bash
# From project root
./scripts/squash_migrations.sh
```

Output: `veza-backend-api/migrations/baseline_v0601.sql`

### Procedure

1. Run the script after adding new migrations
2. Update the version in the script (e.g. `baseline_v0601.sql`) for each release
3. Update the migration range comment (e.g. `001-113`) to reflect the latest migration number
4. The baseline file is auto-generated; do not edit it manually

## Recent Migrations

| # | File | Description |
|---|------|-------------|
| 116 | `116_seller_transfers_retry.sql` | v0.701: Add `retry_count`, `next_retry_at` to `seller_transfers`; index for failed retries |

## Adding New Migrations

1. Create a new file: `veza-backend-api/migrations/NNN_description.sql`
2. Use the next available number (check existing migrations)
3. Write idempotent SQL when possible (e.g. `IF NOT EXISTS`)
4. Test locally before committing
5. Run `squash_migrations.sh` to update the baseline for the release

## Expand-contract discipline (W5+ deploy pipeline contract)

> **TL;DR** — every migration must be **backward-compatible** with the
> previous deploy's binary. No `DROP COLUMN`, no `ALTER ... NOT NULL`,
> no `RENAME` in step 1. Schema evolution happens across **multiple
> deploys**, not in one.

### Why this matters

The blue/green deploy pipeline (`infra/ansible/playbooks/deploy_app.yml`)
makes rollback trivial at the **app layer**: HAProxy flips back to
the previous color, ~5 seconds wall-clock, no data lost. But the
**database** doesn't have colors. Migrations apply once, against the
shared postgres container, and stay applied across the rollback.

If a deploy adds a non-nullable column and the rollback tries to insert
a row without that column, the insert fails. The rollback button is
broken — the previous binary now crashes against the post-migration
schema.

The fix isn't to make the pipeline smarter. It's to make migrations
forward-AND-backward compatible by construction.

### The expand-contract pattern (3 deploys per "destructive" change)

**Step 1 (deploy N) — Expand**: add the new shape **alongside** the
old. Both binaries (old + new) work.

```sql
-- migration NNN_add_user_email_verified.sql
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
-- nullable, no default — the old binary doesn't know about it.
-- the new binary writes true/false on signup ; reads coalesce NULL → false.
```

**Step 2 (deploy N+1) — Backfill**: once Step 1 is stable in prod
(≥ 1 week, no rollbacks needed), backfill existing rows.

```sql
-- migration NNN+1_backfill_user_email_verified.sql
UPDATE users SET email_verified = false WHERE email_verified IS NULL;
```

**Step 3 (deploy N+2) — Contract**: once the backfill is in, add the
constraint. The old binary (still write-coalescing NULL → false) keeps
working ; the new binary uses `NOT NULL` knowledge.

```sql
-- migration NNN+2_user_email_verified_not_null.sql
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
ALTER TABLE users ALTER COLUMN email_verified SET DEFAULT false;
```

After Step 3 is stable, you can rollback exactly **one** deploy without
breakage. Rolling back beyond Step 1 is no longer safe — that's the
expected consequence of expand-contract.

### Allowed in a single deploy

| Change                                  | Safe in one deploy?     |
| --------------------------------------- | ----------------------- |
| `CREATE TABLE`                          | yes                     |
| `CREATE INDEX CONCURRENTLY`             | yes                     |
| Add nullable column                     | yes                     |
| Add column with constant default        | yes (PG ≥ 11)           |
| Backfill UPDATE (idempotent)            | yes                     |
| `DROP INDEX CONCURRENTLY`               | yes (read paths flex)   |
| `DROP TABLE` (if no recent code reads it) | with caution           |

### NOT allowed in a single deploy

| Change                                  | Why                                          |
| --------------------------------------- | -------------------------------------------- |
| `DROP COLUMN`                           | rollback's binary still selects it           |
| `ALTER COLUMN ... NOT NULL` (no prior backfill) | rollback inserts NULL                |
| `ALTER COLUMN ... TYPE`                 | rollback's binary expects old type           |
| `RENAME COLUMN`                         | rollback's binary still references old name  |
| `RENAME TABLE`                          | rollback queries old name                    |

### Reviewer checklist (PRs touching `veza-backend-api/migrations/`)

- [ ] Migration is **forward-only** (GORM doesn't run rollback SQL).
- [ ] Migration is **idempotent** (re-running on an already-migrated
      DB is a no-op — `IF NOT EXISTS`, `ON CONFLICT DO NOTHING`, etc.).
- [ ] No `DROP COLUMN`, `ALTER ... NOT NULL`, `RENAME` (or, if there
      is, the PR description references the prior backfill PRs and
      explains why this is the contract step).
- [ ] If the migration takes a heavy lock (eg `ALTER TABLE` rewriting),
      use `CREATE INDEX CONCURRENTLY` or split.
- [ ] App code changes assume both old and new schema are valid.

### When you must violate the rule (incident)

Sometimes a hot incident demands a destructive change ASAP and rollback
is acceptable risk. In that case:

1. Tag the PR with `migration:destructive`.
2. Document in the PR body what the rollback procedure is (manual
   SQL to recreate the dropped column, etc.).
3. Get a second pair of eyes on the migration before merge.
4. Block the corresponding rollback workflow for that env until
   you've verified the new schema is sticking.

### Future hardening (not in v1.0.x)

A `squawk` linter step in `.forgejo/workflows/ci.yml` could scan
`veza-backend-api/migrations/*.sql` and fail on `DROP COLUMN`,
`ALTER ... NOT NULL`, `RENAME`. The discipline above is the v1.0
answer ; tooling lands when the hand-rolled discipline starts
missing things.