254 lines
8.7 KiB
Markdown
254 lines
8.7 KiB
Markdown
|
|
# Runbook — rollback a Veza deploy
|
|||
|
|
|
|||
|
|
Three rollback paths, ordered from fastest to slowest. Pick based on
|
|||
|
|
what's still alive and what you're rolling back from.
|
|||
|
|
|
|||
|
|
| Path | Time | Use when |
|
|||
|
|
| --------------------- | ---- | -------------------------------------------------- |
|
|||
|
|
| 1. HAProxy fast-flip | ~5s | The previous color's containers are still alive. |
|
|||
|
|
| 2. Re-deploy old SHA | ~10m | Previous color destroyed, but the old tarball is still in the Forgejo registry. |
|
|||
|
|
| 3. Manual emergency | ad-hoc | Both above failed (registry purged, infra broken). |
|
|||
|
|
|
|||
|
|
> **Before you rollback, breathe and read this first.** The default
|
|||
|
|
> instinct under fire is "smash the rollback button". Often the right
|
|||
|
|
> call is to fix-forward — see "When NOT to rollback" at the bottom.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Decision flowchart
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Did the new color come up at all?
|
|||
|
|
│
|
|||
|
|
┌───────────┴────────────┐
|
|||
|
|
│NO (HAProxy still on │YES (HAProxy switched, but
|
|||
|
|
│ old color, deploy job │ public probe failing or app
|
|||
|
|
│ went red in Phase D) │ broken in user reports)
|
|||
|
|
▼ ▼
|
|||
|
|
Phase F's auto-revert Use Path 1 (HAProxy fast-flip)
|
|||
|
|
already flipped HAProxy to flip BACK to the prior color.
|
|||
|
|
for you. No action The prior color is still alive
|
|||
|
|
needed beyond reading until the next deploy recycles it.
|
|||
|
|
logs.
|
|||
|
|
If the prior color was already
|
|||
|
|
cleaned up, use Path 2.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Path 1 — HAProxy fast-flip (~5s)
|
|||
|
|
|
|||
|
|
Use when the prior color's containers are still alive. Triggered via
|
|||
|
|
the `Veza rollback` workflow with `mode=fast`.
|
|||
|
|
|
|||
|
|
### Pre-checks
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# What's the current active color?
|
|||
|
|
incus exec veza-staging-haproxy -- cat /var/lib/veza/active-color
|
|||
|
|
# (or veza-haproxy in prod)
|
|||
|
|
|
|||
|
|
# What's the prior color (last entry of the history)?
|
|||
|
|
incus exec veza-staging-haproxy -- head -2 /var/lib/veza/active-color.history
|
|||
|
|
|
|||
|
|
# Are the prior color's containers RUNNING?
|
|||
|
|
incus list 'veza-staging-{backend,stream,web}-blue' --format csv -c n,s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Trigger
|
|||
|
|
|
|||
|
|
In the Forgejo UI: **Actions → Veza rollback → Run workflow**:
|
|||
|
|
|
|||
|
|
| input | value |
|
|||
|
|
| ------------- | -------------------------- |
|
|||
|
|
| env | staging (or prod) |
|
|||
|
|
| mode | fast |
|
|||
|
|
| target_color | (the PRIOR color, eg blue if green is currently active) |
|
|||
|
|
| release_sha | (leave empty) |
|
|||
|
|
|
|||
|
|
The workflow runs `infra/ansible/playbooks/rollback.yml -e mode=fast
|
|||
|
|
-e target_color=blue` which :
|
|||
|
|
|
|||
|
|
1. Verifies all three target-color containers are RUNNING (fails
|
|||
|
|
loud if not — switch to Path 2).
|
|||
|
|
2. Re-templates `haproxy.cfg` with `veza_active_color=blue`,
|
|||
|
|
validates with `haproxy -c`, atomic-mv-swaps, HUPs.
|
|||
|
|
3. Updates `/var/lib/veza/active-color`.
|
|||
|
|
|
|||
|
|
Wall time: ~5s. Zero connection drop (HAProxy reload is graceful).
|
|||
|
|
|
|||
|
|
### Post-rollback
|
|||
|
|
|
|||
|
|
- Verify externally: `curl https://staging.veza.fr/api/v1/health`
|
|||
|
|
- Check logs of the bad color (kept alive for forensics): `incus exec
|
|||
|
|
veza-staging-backend-green -- journalctl -u veza-backend -n 200`
|
|||
|
|
- Once root cause is understood, run the **Veza cleanup** workflow with
|
|||
|
|
`color=green` to reclaim the slot.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Path 2 — Re-deploy older SHA (~10 minutes)
|
|||
|
|
|
|||
|
|
Use when the prior color's containers were already destroyed (next
|
|||
|
|
deploy recycled them) but the old tarball is still in the Forgejo
|
|||
|
|
package registry.
|
|||
|
|
|
|||
|
|
### Pre-checks
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Pick the SHA you want to roll back TO.
|
|||
|
|
# Look at the active-color.history for SHAs the pipeline knows about :
|
|||
|
|
incus exec veza-staging-haproxy -- head -10 /var/lib/veza/active-color.history
|
|||
|
|
|
|||
|
|
# Or `git log --oneline main` for any commit ; just confirm the
|
|||
|
|
# tarball still exists in the registry (default retention 30 SHAs
|
|||
|
|
# per component) :
|
|||
|
|
curl -fsSL -I -H "Authorization: token $TOKEN" \
|
|||
|
|
"https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Trigger
|
|||
|
|
|
|||
|
|
In the Forgejo UI: **Actions → Veza rollback → Run workflow**:
|
|||
|
|
|
|||
|
|
| input | value |
|
|||
|
|
| ------------ | ------------------------------- |
|
|||
|
|
| env | staging (or prod) |
|
|||
|
|
| mode | full |
|
|||
|
|
| target_color | (leave empty) |
|
|||
|
|
| release_sha | the 40-char SHA you're rolling TO |
|
|||
|
|
|
|||
|
|
The workflow runs `playbooks/rollback.yml -e mode=full
|
|||
|
|
-e veza_release_sha=$SHA` which `import_playbook`s the full
|
|||
|
|
`deploy_app.yml` pipeline. Same Phase A → Phase F sequence as a
|
|||
|
|
normal deploy, but with the older SHA.
|
|||
|
|
|
|||
|
|
Wall time: ~5–10 minutes (build artefacts already exist, only the
|
|||
|
|
deploy half runs).
|
|||
|
|
|
|||
|
|
### Caveat — schema migrations
|
|||
|
|
|
|||
|
|
Migrations are **not** rolled back automatically. The schema after
|
|||
|
|
`Path 2` is the post-deploy schema, not the pre-deploy schema.
|
|||
|
|
Per **MIGRATIONS.md**'s expand-contract discipline, this should be
|
|||
|
|
fine for one deploy back. If it isn't (i.e., the failed deploy
|
|||
|
|
included a destructive migration), see **Path 3**.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Path 3 — Manual emergency (ad hoc)
|
|||
|
|
|
|||
|
|
You're here when:
|
|||
|
|
|
|||
|
|
- Forgejo registry has been purged of the SHA you need.
|
|||
|
|
- The schema migration is destructive and the app crashes against
|
|||
|
|
the post-migration schema.
|
|||
|
|
- The Incus host itself is in a bad state.
|
|||
|
|
|
|||
|
|
### Tarball missing — rebuild and push manually
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Build the artefact locally (you'll need the toolchain) :
|
|||
|
|
cd veza-backend-api
|
|||
|
|
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -trimpath -ldflags "-s -w" \
|
|||
|
|
-o ./bin/veza-api ./cmd/api/main.go
|
|||
|
|
tar --use-compress-program=zstd -cf "/tmp/veza-backend-$SHA.tar.zst" \
|
|||
|
|
-C ./bin veza-api migrate_tool
|
|||
|
|
|
|||
|
|
# Push to the registry :
|
|||
|
|
curl -fsSL --fail-with-body -X PUT \
|
|||
|
|
-H "Authorization: token $TOKEN" \
|
|||
|
|
--upload-file "/tmp/veza-backend-$SHA.tar.zst" \
|
|||
|
|
"https://forgejo.veza.fr/api/packages/talas/generic/veza-backend/$SHA/veza-backend-$SHA.tar.zst"
|
|||
|
|
|
|||
|
|
# Then run Path 2.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Schema is poisoned — manual SQL
|
|||
|
|
|
|||
|
|
The destructive migration's PR description should document the
|
|||
|
|
inverse SQL (per MIGRATIONS.md "When you must violate the rule").
|
|||
|
|
Apply it inside the postgres container :
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
incus exec veza-staging-postgres -- psql -U veza veza < /tmp/inverse.sql
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Then run Path 2 to deploy the older binary.
|
|||
|
|
|
|||
|
|
### Incus host broken — rollback ZFS snapshot
|
|||
|
|
|
|||
|
|
`deploy_data.yml` snapshots every data container's dataset before
|
|||
|
|
mutating anything (`<dataset>@pre-deploy-<sha>`). To restore :
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# First, stop the container :
|
|||
|
|
incus stop veza-staging-postgres
|
|||
|
|
|
|||
|
|
# Roll the dataset back to the pre-deploy snapshot :
|
|||
|
|
zfs rollback -r rpool/incus/containers/veza-staging-postgres@pre-deploy-<sha>
|
|||
|
|
|
|||
|
|
# Restart the container :
|
|||
|
|
incus start veza-staging-postgres
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
This loses any data written after the snapshot. Last-resort only.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## When NOT to rollback
|
|||
|
|
|
|||
|
|
- **Single user reports a bug**. Triage first ; rolling back affects
|
|||
|
|
100% of users to fix something hitting <1%.
|
|||
|
|
- **Performance regression**. If the new SHA is up but slow, scale
|
|||
|
|
horizontally before rolling back. (Future Hetzner offload covers
|
|||
|
|
this ; for now, accept the regression and prep a fix-forward.)
|
|||
|
|
- **Cosmetic UI bug**. Hot-fix the frontend and let the deploy
|
|||
|
|
pipeline ship it as a normal commit.
|
|||
|
|
- **You're not on-call and didn't get a page**. Don't rollback "to
|
|||
|
|
be safe". The on-call's call.
|
|||
|
|
|
|||
|
|
The rollback button's existence isn't a license to use it
|
|||
|
|
preemptively. Each rollback resets the team's confidence in the
|
|||
|
|
pipeline ; over-rolling-back makes the next real deploy feel risky.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Post-incident
|
|||
|
|
|
|||
|
|
After ANY rollback (path 1, 2, or 3) :
|
|||
|
|
|
|||
|
|
1. Update **docs/POSTMORTEMS.md** (or `docs/runbooks/incidents/<date>.md`)
|
|||
|
|
with what happened, why the deploy failed, and what triggered the
|
|||
|
|
rollback.
|
|||
|
|
2. File the fix as a normal PR ; do NOT skip CI.
|
|||
|
|
3. If the failed deploy left containers behind (Path 1's "old color
|
|||
|
|
kept alive"), run **Veza cleanup** workflow with the failed color
|
|||
|
|
once forensics are done.
|
|||
|
|
4. Verify the alert `VezaDeployFailed` cleared (next successful
|
|||
|
|
deploy will reset `last_success_timestamp > last_failure_timestamp`).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Workflows referenced
|
|||
|
|
|
|||
|
|
- `.forgejo/workflows/deploy.yml` — push:main → staging, tag → prod.
|
|||
|
|
- `.forgejo/workflows/rollback.yml` — workflow_dispatch only, modes
|
|||
|
|
fast and full.
|
|||
|
|
- `.forgejo/workflows/cleanup-failed.yml` — workflow_dispatch only,
|
|||
|
|
destroys a specific color's app containers.
|
|||
|
|
|
|||
|
|
## Playbooks referenced
|
|||
|
|
|
|||
|
|
- `infra/ansible/playbooks/deploy_app.yml`
|
|||
|
|
- `infra/ansible/playbooks/rollback.yml`
|
|||
|
|
- `infra/ansible/playbooks/cleanup_failed.yml`
|
|||
|
|
- `infra/ansible/playbooks/deploy_data.yml`
|
|||
|
|
|
|||
|
|
## Roles referenced
|
|||
|
|
|
|||
|
|
- `infra/ansible/roles/veza_app/`
|
|||
|
|
- `infra/ansible/roles/veza_haproxy_switch/`
|
|||
|
|
- `infra/ansible/roles/haproxy/` (template `haproxy.cfg.j2` with
|
|||
|
|
blue/green topology toggle).
|