Two operator docs the W5+ deploy pipeline depends on for safe
operation.
docs/MIGRATIONS.md (extended) :
Existing file already covered migration tooling + naming. Append
a "Expand-contract discipline (W5+ deploy pipeline contract)"
section : explains why blue/green rollback breaks if migrations
are forward-only, walks through the 3-deploy expand-backfill-
contract pattern with a worked example (add nullable column →
backfill → set NOT NULL), tables of allowed vs not-allowed
changes for a single deploy, reviewer checklist, and an "in case
of incident" override path with audit trail.
docs/RUNBOOK_ROLLBACK.md (new) :
Three rollback paths from fastest to slowest :
1. HAProxy fast-flip (~5s) — when prior color is still alive,
use the rollback.yml workflow with mode=fast. Pre-checks +
post-rollback steps.
2. Re-deploy older SHA (~10m) — when prior color is gone but
tarball is still in the Forgejo registry. mode=full.
Schema-migration caveat documented.
3. Manual emergency — tarball missing (rebuild + push), schema
poisoned (manual SQL), Incus host broken (ZFS rollback).
Plus a decision flowchart, "When NOT to rollback" with examples
that bias toward fix-forward over rollback (single-user bugs,
perf regressions, cosmetic issues), and a post-incident checklist.
Cross-referenced with the workflow + playbook + role file paths
the operator will actually need to look up.
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
253 lines
8.7 KiB
Markdown
253 lines
8.7 KiB
Markdown
# Runbook — rollback a Veza deploy
|
||
|
||
Three rollback paths, ordered from fastest to slowest. Pick based on
|
||
what's still alive and what you're rolling back from.
|
||
|
||
| Path | Time | Use when |
|
||
| --------------------- | ---- | -------------------------------------------------- |
|
||
| 1. HAProxy fast-flip | ~5s | The previous color's containers are still alive. |
|
||
| 2. Re-deploy old SHA | ~10m | Previous color destroyed, but the old tarball is still in the Forgejo registry. |
|
||
| 3. Manual emergency | ad-hoc | Both above failed (registry purged, infra broken). |
|
||
|
||
> **Before you rollback, breathe and read this first.** The default
|
||
> instinct under fire is "smash the rollback button". Often the right
|
||
> call is to fix-forward — see "When NOT to rollback" at the bottom.
|
||
|
||
---
|
||
|
||
## Decision flowchart
|
||
|
||
```
|
||
Did the new color come up at all?
|
||
│
|
||
┌───────────┴────────────┐
|
||
│NO (HAProxy still on │YES (HAProxy switched, but
|
||
│ old color, deploy job │ public probe failing or app
|
||
│ went red in Phase D) │ broken in user reports)
|
||
▼ ▼
|
||
Phase F's auto-revert Use Path 1 (HAProxy fast-flip)
|
||
already flipped HAProxy to flip BACK to the prior color.
|
||
for you. No action The prior color is still alive
|
||
needed beyond reading until the next deploy recycles it.
|
||
logs.
|
||
If the prior color was already
|
||
cleaned up, use Path 2.
|
||
```
|
||
|
||
---
|
||
|
||
## Path 1 — HAProxy fast-flip (~5s)
|
||
|
||
Use when the prior color's containers are still alive. Triggered via
|
||
the `Veza rollback` workflow with `mode=fast`.
|
||
|
||
### Pre-checks
|
||
|
||
```bash
|
||
# What's the current active color?
|
||
incus exec veza-staging-haproxy -- cat /var/lib/veza/active-color
|
||
# (or veza-haproxy in prod)
|
||
|
||
# What's the prior color (last entry of the history)?
|
||
incus exec veza-staging-haproxy -- head -2 /var/lib/veza/active-color.history
|
||
|
||
# Are the prior color's containers RUNNING?
|
||
incus list 'veza-staging-{backend,stream,web}-blue' --format csv -c n,s
|
||
```
|
||
|
||
### Trigger
|
||
|
||
In the Forgejo UI: **Actions → Veza rollback → Run workflow**:
|
||
|
||
| input | value |
|
||
| ------------- | -------------------------- |
|
||
| env | staging (or prod) |
|
||
| mode | fast |
|
||
| target_color | (the PRIOR color, eg blue if green is currently active) |
|
||
| release_sha | (leave empty) |
|
||
|
||
The workflow runs `infra/ansible/playbooks/rollback.yml -e mode=fast
|
||
-e target_color=blue` which :
|
||
|
||
1. Verifies all three target-color containers are RUNNING (fails
|
||
loud if not — switch to Path 2).
|
||
2. Re-templates `haproxy.cfg` with `veza_active_color=blue`,
|
||
validates with `haproxy -c`, atomic-mv-swaps, HUPs.
|
||
3. Updates `/var/lib/veza/active-color`.
|
||
|
||
Wall time: ~5s. Zero connection drop (HAProxy reload is graceful).
|
||
|
||
### Post-rollback
|
||
|
||
- Verify externally: `curl https://staging.veza.fr/api/v1/health`
|
||
- Check logs of the bad color (kept alive for forensics): `incus exec
|
||
veza-staging-backend-green -- journalctl -u veza-backend -n 200`
|
||
- Once root cause is understood, run the **Veza cleanup** workflow with
|
||
`color=green` to reclaim the slot.
|
||
|
||
---
|
||
|
||
## Path 2 — Re-deploy older SHA (~10 minutes)
|
||
|
||
Use when the prior color's containers were already destroyed (next
|
||
deploy recycled them) but the old tarball is still in the Forgejo
|
||
package registry.
|
||
|
||
### Pre-checks
|
||
|
||
```bash
|
||
# Pick the SHA you want to roll back TO.
|
||
# Look at the active-color.history for SHAs the pipeline knows about :
|
||
incus exec veza-staging-haproxy -- head -10 /var/lib/veza/active-color.history
|
||
|
||
# Or `git log --oneline main` for any commit ; just confirm the
|
||
# tarball still exists in the registry (default retention 30 SHAs
|
||
# per component) :
|
||
curl -fsSL -I -H "Authorization: token $TOKEN" \
|
||
"https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst"
|
||
```
|
||
|
||
### Trigger
|
||
|
||
In the Forgejo UI: **Actions → Veza rollback → Run workflow**:
|
||
|
||
| input | value |
|
||
| ------------ | ------------------------------- |
|
||
| env | staging (or prod) |
|
||
| mode | full |
|
||
| target_color | (leave empty) |
|
||
| release_sha | the 40-char SHA you're rolling TO |
|
||
|
||
The workflow runs `playbooks/rollback.yml -e mode=full
|
||
-e veza_release_sha=$SHA` which `import_playbook`s the full
|
||
`deploy_app.yml` pipeline. Same Phase A → Phase F sequence as a
|
||
normal deploy, but with the older SHA.
|
||
|
||
Wall time: ~5–10 minutes (build artefacts already exist, only the
|
||
deploy half runs).
|
||
|
||
### Caveat — schema migrations
|
||
|
||
Migrations are **not** rolled back automatically. The schema after
|
||
`Path 2` is the post-deploy schema, not the pre-deploy schema.
|
||
Per **MIGRATIONS.md**'s expand-contract discipline, this should be
|
||
fine for one deploy back. If it isn't (i.e., the failed deploy
|
||
included a destructive migration), see **Path 3**.
|
||
|
||
---
|
||
|
||
## Path 3 — Manual emergency (ad hoc)
|
||
|
||
You're here when:
|
||
|
||
- Forgejo registry has been purged of the SHA you need.
|
||
- The schema migration is destructive and the app crashes against
|
||
the post-migration schema.
|
||
- The Incus host itself is in a bad state.
|
||
|
||
### Tarball missing — rebuild and push manually
|
||
|
||
```bash
|
||
# Build the artefact locally (you'll need the toolchain) :
|
||
cd veza-backend-api
|
||
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -trimpath -ldflags "-s -w" \
|
||
-o ./bin/veza-api ./cmd/api/main.go
|
||
tar --use-compress-program=zstd -cf "/tmp/veza-backend-$SHA.tar.zst" \
|
||
-C ./bin veza-api migrate_tool
|
||
|
||
# Push to the registry :
|
||
curl -fsSL --fail-with-body -X PUT \
|
||
-H "Authorization: token $TOKEN" \
|
||
--upload-file "/tmp/veza-backend-$SHA.tar.zst" \
|
||
"https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst"
|
||
|
||
# Then run Path 2.
|
||
```
|
||
|
||
### Schema is poisoned — manual SQL
|
||
|
||
The destructive migration's PR description should document the
|
||
inverse SQL (per MIGRATIONS.md "When you must violate the rule").
|
||
Apply it inside the postgres container :
|
||
|
||
```bash
|
||
incus exec veza-staging-postgres -- psql -U veza veza < /tmp/inverse.sql
|
||
```
|
||
|
||
Then run Path 2 to deploy the older binary.
|
||
|
||
### Incus host broken — rollback ZFS snapshot
|
||
|
||
`deploy_data.yml` snapshots every data container's dataset before
|
||
mutating anything (`<dataset>@pre-deploy-<sha>`). To restore :
|
||
|
||
```bash
|
||
# First, stop the container :
|
||
incus stop veza-staging-postgres
|
||
|
||
# Roll the dataset back to the pre-deploy snapshot :
|
||
zfs rollback -r rpool/incus/containers/veza-staging-postgres@pre-deploy-<sha>
|
||
|
||
# Restart the container :
|
||
incus start veza-staging-postgres
|
||
```
|
||
|
||
This loses any data written after the snapshot. Last-resort only.
|
||
|
||
---
|
||
|
||
## When NOT to rollback
|
||
|
||
- **Single user reports a bug**. Triage first ; rolling back affects
|
||
100% of users to fix something hitting <1%.
|
||
- **Performance regression**. If the new SHA is up but slow, scale
|
||
horizontally before rolling back. (Future Hetzner offload covers
|
||
this ; for now, accept the regression and prep a fix-forward.)
|
||
- **Cosmetic UI bug**. Hot-fix the frontend and let the deploy
|
||
pipeline ship it as a normal commit.
|
||
- **You're not on-call and didn't get a page**. Don't rollback "to
|
||
be safe". The on-call's call.
|
||
|
||
The rollback button's existence isn't a license to use it
|
||
preemptively. Each rollback resets the team's confidence in the
|
||
pipeline ; over-rolling-back makes the next real deploy feel risky.
|
||
|
||
---
|
||
|
||
## Post-incident
|
||
|
||
After ANY rollback (path 1, 2, or 3) :
|
||
|
||
1. Update **docs/POSTMORTEMS.md** (or `docs/runbooks/incidents/<date>.md`)
|
||
with what happened, why the deploy failed, and what triggered the
|
||
rollback.
|
||
2. File the fix as a normal PR ; do NOT skip CI.
|
||
3. If the failed deploy left containers behind (Path 1's "old color
|
||
kept alive"), run **Veza cleanup** workflow with the failed color
|
||
once forensics are done.
|
||
4. Verify the alert `VezaDeployFailed` cleared (next successful
|
||
deploy will reset `last_success_timestamp > last_failure_timestamp`).
|
||
|
||
---
|
||
|
||
## Workflows referenced
|
||
|
||
- `.forgejo/workflows/deploy.yml` — push:main → staging, tag → prod.
|
||
- `.forgejo/workflows/rollback.yml` — workflow_dispatch only, modes
|
||
fast and full.
|
||
- `.forgejo/workflows/cleanup-failed.yml` — workflow_dispatch only,
|
||
destroys a specific color's app containers.
|
||
|
||
## Playbooks referenced
|
||
|
||
- `infra/ansible/playbooks/deploy_app.yml`
|
||
- `infra/ansible/playbooks/rollback.yml`
|
||
- `infra/ansible/playbooks/cleanup_failed.yml`
|
||
- `infra/ansible/playbooks/deploy_data.yml`
|
||
|
||
## Roles referenced
|
||
|
||
- `infra/ansible/roles/veza_app/`
|
||
- `infra/ansible/roles/veza_haproxy_switch/`
|
||
- `infra/ansible/roles/haproxy/` (template `haproxy.cfg.j2` with
|
||
blue/green topology toggle).
|