veza/docs/RUNBOOK_ROLLBACK.md
senke 22d09dcbbb docs: MIGRATIONS expand-contract section + RUNBOOK_ROLLBACK
Two operator docs the W5+ deploy pipeline depends on for safe
operation.

docs/MIGRATIONS.md (extended) :
  Existing file already covered migration tooling + naming. Append
  a "Expand-contract discipline (W5+ deploy pipeline contract)"
  section : explains why blue/green rollback breaks if migrations
  are forward-only, walks through the 3-deploy expand-backfill-
  contract pattern with a worked example (add nullable column →
  backfill → set NOT NULL), tables of allowed vs not-allowed
  changes for a single deploy, reviewer checklist, and an "in case
  of incident" override path with audit trail.

docs/RUNBOOK_ROLLBACK.md (new) :
  Three rollback paths from fastest to slowest :
   1. HAProxy fast-flip (~5s) — when prior color is still alive,
      use the rollback.yml workflow with mode=fast. Pre-checks +
      post-rollback steps.
   2. Re-deploy older SHA (~10m) — when prior color is gone but
      tarball is still in the Forgejo registry. mode=full.
      Schema-migration caveat documented.
   3. Manual emergency — tarball missing (rebuild + push), schema
      poisoned (manual SQL), Incus host broken (ZFS rollback).

Plus a decision flowchart, "When NOT to rollback" with examples
that bias toward fix-forward over rollback (single-user bugs,
perf regressions, cosmetic issues), and a post-incident checklist.

Cross-referenced with the workflow + playbook + role file paths
the operator will actually need to look up.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:48:46 +02:00

8.7 KiB
Raw Permalink Blame History

Runbook — rollback a Veza deploy

Three rollback paths, ordered from fastest to slowest. Pick based on what's still alive and what you're rolling back from.

Path Time Use when
1. HAProxy fast-flip ~5s The previous color's containers are still alive.
2. Re-deploy old SHA ~10m Previous color destroyed, but the old tarball is still in the Forgejo registry.
3. Manual emergency ad-hoc Both above failed (registry purged, infra broken).

Before you rollback, breathe and read this first. The default instinct under fire is "smash the rollback button". Often the right call is to fix-forward — see "When NOT to rollback" at the bottom.


Decision flowchart

        Did the new color come up at all?
                    │
        ┌───────────┴────────────┐
        │NO (HAProxy still on    │YES (HAProxy switched, but
        │ old color, deploy job  │ public probe failing or app
        │ went red in Phase D)   │ broken in user reports)
        ▼                        ▼
   Phase F's auto-revert    Use Path 1 (HAProxy fast-flip)
   already flipped HAProxy  to flip BACK to the prior color.
   for you. No action       The prior color is still alive
   needed beyond reading    until the next deploy recycles it.
   logs.                    
                            If the prior color was already
                            cleaned up, use Path 2.

Path 1 — HAProxy fast-flip (~5s)

Use when the prior color's containers are still alive. Triggered via the Veza rollback workflow with mode=fast.

Pre-checks

# What's the current active color?
incus exec veza-staging-haproxy -- cat /var/lib/veza/active-color
# (or veza-haproxy in prod)

# What's the prior color (last entry of the history)?
incus exec veza-staging-haproxy -- head -2 /var/lib/veza/active-color.history

# Are the prior color's containers RUNNING?
incus list 'veza-staging-{backend,stream,web}-blue' --format csv -c n,s

Trigger

In the Forgejo UI: Actions → Veza rollback → Run workflow:

input value
env staging (or prod)
mode fast
target_color (the PRIOR color, eg blue if green is currently active)
release_sha (leave empty)

The workflow runs infra/ansible/playbooks/rollback.yml -e mode=fast -e target_color=blue which :

  1. Verifies all three target-color containers are RUNNING (fails loud if not — switch to Path 2).
  2. Re-templates haproxy.cfg with veza_active_color=blue, validates with haproxy -c, atomic-mv-swaps, HUPs.
  3. Updates /var/lib/veza/active-color.

Wall time: ~5s. Zero connection drop (HAProxy reload is graceful).

Post-rollback

  • Verify externally: curl https://staging.veza.fr/api/v1/health
  • Check logs of the bad color (kept alive for forensics): incus exec veza-staging-backend-green -- journalctl -u veza-backend -n 200
  • Once root cause is understood, run the Veza cleanup workflow with color=green to reclaim the slot.

Path 2 — Re-deploy older SHA (~10 minutes)

Use when the prior color's containers were already destroyed (next deploy recycled them) but the old tarball is still in the Forgejo package registry.

Pre-checks

# Pick the SHA you want to roll back TO.
# Look at the active-color.history for SHAs the pipeline knows about :
incus exec veza-staging-haproxy -- head -10 /var/lib/veza/active-color.history

# Or `git log --oneline main` for any commit ; just confirm the
# tarball still exists in the registry (default retention 30 SHAs
# per component) :
curl -fsSL -I -H "Authorization: token $TOKEN" \
  "https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst"

Trigger

In the Forgejo UI: Actions → Veza rollback → Run workflow:

input value
env staging (or prod)
mode full
target_color (leave empty)
release_sha the 40-char SHA you're rolling TO

The workflow runs playbooks/rollback.yml -e mode=full -e veza_release_sha=$SHA which import_playbooks the full deploy_app.yml pipeline. Same Phase A → Phase F sequence as a normal deploy, but with the older SHA.

Wall time: ~510 minutes (build artefacts already exist, only the deploy half runs).

Caveat — schema migrations

Migrations are not rolled back automatically. The schema after Path 2 is the post-deploy schema, not the pre-deploy schema. Per MIGRATIONS.md's expand-contract discipline, this should be fine for one deploy back. If it isn't (i.e., the failed deploy included a destructive migration), see Path 3.


Path 3 — Manual emergency (ad hoc)

You're here when:

  • Forgejo registry has been purged of the SHA you need.
  • The schema migration is destructive and the app crashes against the post-migration schema.
  • The Incus host itself is in a bad state.

Tarball missing — rebuild and push manually

# Build the artefact locally (you'll need the toolchain) :
cd veza-backend-api
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -trimpath -ldflags "-s -w" \
  -o ./bin/veza-api ./cmd/api/main.go
tar --use-compress-program=zstd -cf "/tmp/veza-backend-$SHA.tar.zst" \
  -C ./bin veza-api migrate_tool

# Push to the registry :
curl -fsSL --fail-with-body -X PUT \
  -H "Authorization: token $TOKEN" \
  --upload-file "/tmp/veza-backend-$SHA.tar.zst" \
  "https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst"

# Then run Path 2.

Schema is poisoned — manual SQL

The destructive migration's PR description should document the inverse SQL (per MIGRATIONS.md "When you must violate the rule"). Apply it inside the postgres container :

incus exec veza-staging-postgres -- psql -U veza veza < /tmp/inverse.sql

Then run Path 2 to deploy the older binary.

Incus host broken — rollback ZFS snapshot

deploy_data.yml snapshots every data container's dataset before mutating anything (<dataset>@pre-deploy-<sha>). To restore :

# First, stop the container :
incus stop veza-staging-postgres

# Roll the dataset back to the pre-deploy snapshot :
zfs rollback -r rpool/incus/containers/veza-staging-postgres@pre-deploy-<sha>

# Restart the container :
incus start veza-staging-postgres

This loses any data written after the snapshot. Last-resort only.


When NOT to rollback

  • Single user reports a bug. Triage first ; rolling back affects 100% of users to fix something hitting <1%.
  • Performance regression. If the new SHA is up but slow, scale horizontally before rolling back. (Future Hetzner offload covers this ; for now, accept the regression and prep a fix-forward.)
  • Cosmetic UI bug. Hot-fix the frontend and let the deploy pipeline ship it as a normal commit.
  • You're not on-call and didn't get a page. Don't rollback "to be safe". The on-call's call.

The rollback button's existence isn't a license to use it preemptively. Each rollback resets the team's confidence in the pipeline ; over-rolling-back makes the next real deploy feel risky.


Post-incident

After ANY rollback (path 1, 2, or 3) :

  1. Update docs/POSTMORTEMS.md (or docs/runbooks/incidents/<date>.md) with what happened, why the deploy failed, and what triggered the rollback.
  2. File the fix as a normal PR ; do NOT skip CI.
  3. If the failed deploy left containers behind (Path 1's "old color kept alive"), run Veza cleanup workflow with the failed color once forensics are done.
  4. Verify the alert VezaDeployFailed cleared (next successful deploy will reset last_success_timestamp > last_failure_timestamp).

Workflows referenced

  • .forgejo/workflows/deploy.yml — push:main → staging, tag → prod.
  • .forgejo/workflows/rollback.yml — workflow_dispatch only, modes fast and full.
  • .forgejo/workflows/cleanup-failed.yml — workflow_dispatch only, destroys a specific color's app containers.

Playbooks referenced

  • infra/ansible/playbooks/deploy_app.yml
  • infra/ansible/playbooks/rollback.yml
  • infra/ansible/playbooks/cleanup_failed.yml
  • infra/ansible/playbooks/deploy_data.yml

Roles referenced

  • infra/ansible/roles/veza_app/
  • infra/ansible/roles/veza_haproxy_switch/
  • infra/ansible/roles/haproxy/ (template haproxy.cfg.j2 with blue/green topology toggle).