Two operator-only playbooks (workflow_dispatch in Forgejo) for the
escape hatches docs/RUNBOOK_ROLLBACK.md will document.
playbooks/cleanup_failed.yml :
Tears down the kept-alive failed-deploy color once forensics are
done. Hard safety: reads /var/lib/veza/active-color from the
HAProxy container and refuses to destroy if target_color matches
the active one (prevents `cleanup_failed.yml -e target_color=blue`
when blue is what's serving traffic).
Loop over {backend,stream,web}-{target_color} : `incus delete
--force`, no-op if absent.
playbooks/rollback.yml :
Two modes selected by `-e mode=`:
fast — HAProxy-only flip. Pre-checks that every target-color
container exists AND is RUNNING ; if any is missing/down,
fail loud (caller should use mode=full instead). Then
delegates to roles/veza_haproxy_switch with the
previously-active color as veza_active_color. ~5s wall
time.
full — Re-runs the full deploy_app.yml pipeline with
-e veza_release_sha=<previous_sha>. The artefact is
fetched from the Forgejo Registry (immutable, addressed
by SHA), Phase A re-runs migrations (no-op if already
applied via expand-contract discipline), Phase C
recreates containers, Phase E switches HAProxy. ~5-10
min wall time.
Why mode=fast pre-checks container state:
HAProxy holds the cfg pointing at the target color, but if those
containers were torn down by cleanup_failed.yml or by a more
recent deploy, the flip would land on dead backends. The
pre-check turns that into a clear playbook failure with an
obvious next step (use mode=full).
Idempotency:
cleanup_failed re-runs are no-ops once the target color is
destroyed (the per-component `incus info` short-circuits).
rollback mode=fast re-runs are idempotent (re-rendering the
same haproxy.cfg is a no-op + handler doesn't refire on no-diff).
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>