veza/infra/ansible/playbooks/cleanup_failed.yml
senke 3a67763d6f feat(ansible): playbooks/{cleanup_failed,rollback}.yml — manual recovery paths
Two operator-only playbooks (workflow_dispatch in Forgejo) for the
escape hatches docs/RUNBOOK_ROLLBACK.md will document.

playbooks/cleanup_failed.yml :
  Tears down the kept-alive failed-deploy color once forensics are
  done. Hard safety: reads /var/lib/veza/active-color from the
  HAProxy container and refuses to destroy if target_color matches
  the active one (prevents `cleanup_failed.yml -e target_color=blue`
  when blue is what's serving traffic).
  Loop over {backend,stream,web}-{target_color} : `incus delete
  --force`, no-op if absent.

playbooks/rollback.yml :
  Two modes selected by `-e mode=`:

  fast  — HAProxy-only flip. Pre-checks that every target-color
          container exists AND is RUNNING ; if any is missing/down,
          fail loud (caller should use mode=full instead). Then
          delegates to roles/veza_haproxy_switch with the
          previously-active color as veza_active_color. ~5s wall
          time.

  full  — Re-runs the full deploy_app.yml pipeline with
          -e veza_release_sha=<previous_sha>. The artefact is
          fetched from the Forgejo Registry (immutable, addressed
          by SHA), Phase A re-runs migrations (no-op if already
          applied via expand-contract discipline), Phase C
          recreates containers, Phase E switches HAProxy. ~5-10
          min wall time.

Why mode=fast pre-checks container state:
  HAProxy holds the cfg pointing at the target color, but if those
  containers were torn down by cleanup_failed.yml or by a more
  recent deploy, the flip would land on dead backends. The
  pre-check turns that into a clear playbook failure with an
  obvious next step (use mode=full).

Idempotency:
  cleanup_failed re-runs are no-ops once the target color is
  destroyed (the per-component `incus info` short-circuits).
  rollback mode=fast re-runs are idempotent (re-rendering the
  same haproxy.cfg is a no-op + handler doesn't refire on no-diff).

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:36:40 +02:00

83 lines
2.9 KiB
YAML

# cleanup_failed.yml — destroy the app containers of a specific color.
# Used when a deploy_app.yml run failed Phase D or Phase F and the
# operator has finished forensics on the kept-alive failed color.
#
# Required extra-vars:
# env staging | prod
# target_color blue | green (the color to tear down)
#
# Safety: refuses to destroy the CURRENTLY-ACTIVE color. Active color
# is read from the HAProxy container's /var/lib/veza/active-color.
#
# Caller (workflow_dispatch only):
# ansible-playbook -i inventory/{{env}}.yml playbooks/cleanup_failed.yml \
# -e env={{env}} -e target_color={{color}}
---
- name: Validate inputs and refuse to nuke the active color
hosts: incus_hosts
become: true
gather_facts: false
tasks:
- name: Assert required vars
ansible.builtin.assert:
that:
- veza_env is defined
- veza_env in ['staging', 'prod']
- target_color is defined
- target_color in ['blue', 'green']
fail_msg: cleanup_failed.yml requires veza_env + target_color.
quiet: true
- name: Read active color from HAProxy container
ansible.builtin.shell: |
incus exec "{{ veza_container_prefix }}haproxy" -- \
cat /var/lib/veza/active-color 2>/dev/null | tr -d '[:space:]'
args:
executable: /bin/bash
register: active_color_raw
changed_when: false
failed_when: false
- name: Resolve current_active_color
ansible.builtin.set_fact:
current_active_color: "{{ active_color_raw.stdout if active_color_raw.stdout else 'blue' }}"
- name: Refuse if target_color matches the active color
ansible.builtin.fail:
msg: >-
target_color={{ target_color }} matches the currently-active
color in HAProxy. Refusing to destroy live containers.
Switch HAProxy first via rollback.yml or a re-deploy.
when: target_color == current_active_color
- name: Destroy the inactive-color app containers
hosts: incus_hosts
become: true
gather_facts: false
tasks:
- name: Force-delete each component container
ansible.builtin.shell: |
set -e
CT="{{ veza_container_prefix }}{{ item }}-{{ target_color }}"
if incus info "$CT" >/dev/null 2>&1; then
incus delete --force "$CT"
echo "Destroyed $CT"
else
echo "$CT does not exist, skip"
fi
args:
executable: /bin/bash
loop:
- backend
- stream
- web
register: cleanup_result
changed_when: "'Destroyed' in (cleanup_result.stdout | default(''))"
tags: [cleanup]
- name: Report what was destroyed
ansible.builtin.debug:
msg: |
Cleanup of color {{ target_color }} in env {{ veza_env }} complete.
Active color unchanged: {{ current_active_color }}.
Next deploy will recreate {{ target_color }} containers from scratch.