End-to-end CI deploy workflow. Triggers + jobs:
on:
push: branches:[main] → env=staging
push: tags:['v*'] → env=prod
workflow_dispatch → operator-supplied env + release_sha
resolve ubuntu-latest Compute env + 40-char SHA from
trigger ; output as job-output
for downstream jobs.
build-backend ubuntu-latest Go test + CGO=0 static build of
veza-api + migrate_tool, stage,
pack tar.zst, PUT to Forgejo
Package Registry.
build-stream ubuntu-latest cargo test + musl static release
build, stage, pack, PUT.
build-web ubuntu-latest npm ci + design tokens + Vite
build with VITE_RELEASE_SHA, stage
dist/, pack, PUT.
deploy [self-hosted, incus]
ansible-playbook deploy_data.yml
then deploy_app.yml against the
resolved env's inventory.
Vault pwd from secret →
tmpfile → --vault-password-file
→ shred in `if: always()`.
Ansible logs uploaded as artifact
(30d retention) for forensics.
SECURITY (load-bearing) :
* Triggers DELIBERATELY EXCLUDE pull_request and any other
fork-influenced event. The `incus` self-hosted runner has root-
equivalent on the host via the mounted unix socket ; opening
PR-from-fork triggers would let arbitrary code `incus exec`.
* concurrency.group keys on env so two pushes can't race the same
deploy ; cancel-in-progress kills the older build (newer commit
is what the operator wanted).
* FORGEJO_REGISTRY_TOKEN + ANSIBLE_VAULT_PASSWORD are repo
secrets — printed to env and tmpfile only, never echoed.
Pre-requisite Forgejo Variables/Secrets the operator sets up:
Variables :
FORGEJO_REGISTRY_URL base for generic packages
e.g. https://forgejo.veza.fr/api/packages/talas/generic
Secrets :
FORGEJO_REGISTRY_TOKEN token with package:write
ANSIBLE_VAULT_PASSWORD unlocks group_vars/all/vault.yml
Self-hosted runner expectation :
Runs in srv-102v container. Mount / has /var/lib/incus/unix.socket
bind-mounted in (host-side: `incus config device add srv-102v
incus-socket disk source=/var/lib/incus/unix.socket
path=/var/lib/incus/unix.socket`). Runner registered with the
`incus` label so the deploy job pins to it.
Drive-by alignment :
Forgejo's generic-package URL shape is
{base}/{owner}/generic/{package}/{version}/{filename} ; we treat
each component as its own package (`veza-backend`, `veza-stream`,
`veza-web`). Updated three references (group_vars/all/main.yml's
veza_artifact_base_url, veza_app/defaults/main.yml's
veza_app_artifact_url, deploy_app.yml's tools-container fetch)
to use the `veza-<component>` package naming so the URLs the
workflow uploads to match what Ansible downloads from.
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end orchestrator for the app-tier deploy. Ties together the
roles + playbooks landed in earlier commits :
Phase A — migrations (incus_hosts → tools container)
Ensure `<prefix>backend-tools` container exists (idempotent
create), apt-deps + pull backend tarball + run `migrate_tool
--up` against postgres.lxd. no_log on the DATABASE_URL line
(carries vault_postgres_password).
Phase B — determine inactive color (haproxy container)
slurp /var/lib/veza/active-color, default 'blue' if absent.
inactive_color = the OTHER one — the one we deploy TO.
Both prior_active_color and inactive_color exposed as
cacheable hostvars for downstream phases.
Phase C — recreate inactive containers (host-side + per-container roles)
Host play: incus delete --force + incus launch for each
of {backend,stream,web}-{inactive} ; refresh_inventory.
Then three per-container plays apply roles/veza_app with
component-specific vars (the `tools` container shape was
designed for this). Each role pass ends with an in-container
health probe — failure here fails the playbook before HAProxy
is touched.
Phase D — cross-container probes (haproxy container)
Curl each component's Incus DNS name from inside the HAProxy
container. Catches the "service is up but unreachable via
Incus DNS" failure mode the in-container probe misses.
Phase E — switch HAProxy (haproxy container)
Apply roles/veza_haproxy_switch with veza_active_color =
inactive_color. The role's block/rescue handles validate-fail
or HUP-fail by restoring the previous cfg.
Phase F — verify externally + record deploy state
Curl {{ veza_public_url }}/api/v1/health through HAProxy with
retries (10×3s). On success, write a Prometheus textfile-
collector file (active_color, release_sha, last_success_ts).
On failure: write a failure_ts file, re-switch HAProxy back
to prior_active_color via a second invocation of the switch
role, and fail the playbook with a journalctl one-liner the
operator can paste to inspect logs.
Why phase F doesn't destroy the failed inactive containers:
per the user's choice (ask earlier in the design memo), failed
containers are kept alive for `incus exec ... journalctl`. The
manual cleanup_failed.yml workflow tears them down explicitly.
Edge cases this handles:
* No prior active-color file (first-ever deploy) → defaults
to blue, deploys to green.
* Tools container missing (first-ever deploy or someone
deleted it) → recreate idempotently.
* Migration that returns "no changes" (already-applied) →
changed=false, no spurious notifications.
* inactive_color spelled differently across plays → all derive
from a single hostvar set in Phase B.
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>