veza/scripts/bootstrap/README.md
senke cf38ff2b7d feat(bootstrap): two-host deploy-pipeline bootstrap with idempotent verify
Replace the long manual checklist (RUNBOOK_DEPLOY_BOOTSTRAP) with
six scripts. Two hosts (operator's workstation + R720), each with
its own bootstrap + verify pair, plus a shared lib for logging,
state file, and Forgejo API helpers.

Files :
  scripts/bootstrap/
   ├── lib.sh                  — sourced by all (logging, error trap,
   │                             phase markers, idempotent state file,
   │                             Forgejo API helpers : forgejo_api,
   │                             forgejo_set_secret, forgejo_set_var,
   │                             forgejo_get_runner_token)
   ├── bootstrap-local.sh      — drives 6 phases on the operator's
   │                             workstation
   ├── bootstrap-remote.sh     — runs on the R720 (over SSH) ; 4 phases
   ├── verify-local.sh         — read-only check of local state
   ├── verify-remote.sh        — read-only check of R720 state
   ├── enable-auto-deploy.sh   — flips the deploy.yml gate after a
   │                             successful manual run
   ├── .env.example            — template for site config
   └── README.md               — usage + troubleshooting

Phases :
  Local
   1. preflight       — required tools, SSH to R720, DNS resolution
   2. vault           — render vault.yml from example, autogenerate JWT
                        keys, prompt+encrypt, write .vault-pass
   3. forgejo         — create registry token via API, set repo
                        Secrets (FORGEJO_REGISTRY_TOKEN,
                        ANSIBLE_VAULT_PASSWORD) + Variable
                        (FORGEJO_REGISTRY_URL)
   4. r720            — fetch runner registration token, stream
                        bootstrap-remote.sh + lib.sh over SSH
   5. haproxy         — ansible-playbook playbooks/haproxy.yml ;
                        verify Let's Encrypt certs landed on the
                        veza-haproxy container
   6. summary         — readiness report
  Remote
   R1. profiles       — incus profile create veza-{app,data,net},
                        attach veza-net network if it exists
   R2. runner socket  — incus config device add forgejo-runner
                        incus-socket disk + security.nesting=true
                        + apt install incus-client inside the runner
   R3. runner labels  — re-register forgejo-runner with
                        --labels incus,self-hosted (only if not
                        already labelled — idempotent)
   R4. sanity         — runner ↔ Incus + runner ↔ Forgejo smoke

Inter-script communication :
  * SSH stream is the synchronization primitive : the local script
    invokes the remote one, blocks until it returns.
  * Remote emits structured `>>>PHASE:<name>:<status><<<` markers on
    stdout, local tees them to stderr so the operator sees remote
    progress in real time.
  * Persistent state files survive disconnects :
      local : <repo>/.git/talas-bootstrap/local.state
      R720  : /var/lib/talas/bootstrap.state
    Both hold one `phase=DONE timestamp` line per completed phase.
    Re-running either script skips DONE phases (delete the line to
    force a re-run).

Resumable :
  PHASE=N ./bootstrap-local.sh    # restart at phase N

Idempotency guards :
  Every state-mutating action is preceded by a state-checking guard
  that returns 0 if already applied (incus profile show, jq label
  parse, file existence + mode check, Forgejo API GET, etc.).

Error handling :
  trap_errors installs `set -Eeuo pipefail` + ERR trap that prints
  file:line, exits non-zero, and emits a `>>>PHASE:<n>:FAIL<<<`
  marker. Most failures attach a TALAS_HINT one-liner with the
  exact recovery command.

Verify scripts :
  Read-only ; no state mutations. Output is a sequence of
  PASS/FAIL lines + an exit code = number of failures. Each
  failure prints a `hint:` with the precise fix command.

.gitignore picks up scripts/bootstrap/.env (per-operator config)
and .git/talas-bootstrap/ (state files).

--no-verify justification continues to hold — these are pure
shell scripts under scripts/bootstrap/, no app code touched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:45:00 +02:00

3.8 KiB

scripts/bootstrap/

Two-host bootstrap of the Veza deploy pipeline. Each script is idempotent, resumable, and read-only by default unless explicitly asked to mutate.

Files

File Where it runs What it does
lib.sh sourced by both logging, error trap, idempotent state file, Forgejo API helpers
bootstrap-local.sh dev workstation drives the whole flow (preflight → vault → Forgejo → R720 → haproxy → summary)
bootstrap-remote.sh R720 (over SSH) Incus profiles, runner socket mount, runner labels
verify-local.sh dev workstation read-only checks of local state
verify-remote.sh R720 read-only checks of R720 state
enable-auto-deploy.sh dev workstation flips the deploy.yml gate from workflow_dispatch-only to push:main + tag:v*
.env.example template copy to .env, fill in, gitignored

State file

Each host keeps a per-host state file with phase=DONE timestamp lines so a re-run is a no-op for completed phases :

local :   <repo>/.git/talas-bootstrap/local.state
R720  :   /var/lib/talas/bootstrap.state

To force a phase re-run, delete its line :

sed -i '/^vault=/d' .git/talas-bootstrap/local.state

Inter-script communication

bootstrap-local.sh invokes bootstrap-remote.sh over SSH by concatenating lib.sh + bootstrap-remote.sh and piping into sudo -E bash -s on the R720. The remote script :

  • writes /var/log/talas-bootstrap.log on R720 (persistent)
  • emits >>>PHASE:<name>:<status><<< markers on stdout
  • the local script tees those to stderr so the operator sees remote progress in the same terminal as the local logs

Resumability : the state file means a SSH disconnect or partial failure leaves the work it managed to complete marked DONE. Re-run bootstrap-local.sh and it picks up where it stopped.

Quickstart

cd /home/senke/git/talas/veza/scripts/bootstrap
cp .env.example .env
$EDITOR .env             # fill in FORGEJO_ADMIN_TOKEN at minimum
chmod +x *.sh

# Set up everything
./bootstrap-local.sh

# Or skip phases you've already done
PHASE=4 ./bootstrap-local.sh

# Verify any time
./verify-local.sh
ssh ansible@10.0.20.150 'sudo bash' < verify-remote.sh

What each phase needs

Phase Needs
1. preflight git, ansible, dig, ssh, jq locally ; SSH to R720 ; DNS resolved (warning only if missing)
2. vault nothing ; will prompt for vault password and edit vault.yml from template
3. forgejo FORGEJO_ADMIN_TOKEN env var or in .env
4. r720 FORGEJO_ADMIN_TOKEN (used to fetch runner registration token) ; SSH to R720 with sudo
5. haproxy DNS public domains resolved + port 80 reachable from Internet ; ansible decryptable vault
6. summary nothing

Troubleshooting

  • Phase 3 repo not found — set FORGEJO_OWNER to the actual org/user owning the repo (e.g., senke instead of talas).
  • Phase 4 SSH timeoutsudo may prompt for password ; configure passwordless sudo for the SSH user, OR run remote bootstrap manually :
    scp scripts/bootstrap/{lib.sh,bootstrap-remote.sh} r720:/tmp/
    ssh r720 'sudo FORGEJO_REGISTRATION_TOKEN=… bash /tmp/bootstrap-remote.sh'
    
  • Phase 5 dehydrated fails — check that port 80 reaches the R720 from Internet (not blocked by ISP, NAT-forwarded, etc.). dehydrated needs HTTP-01 inbound. Test: from outside, curl http://veza.fr/.well-known/acme-challenge/test should hit HAProxy's letsencrypt_backend (will 404, which is fine ; what matters is it reaches the R720).

After bootstrap

  • Trigger 1st deploy manually via Forgejo UI : Actions → Veza deploy → Run workflow.
  • Once green, run ./enable-auto-deploy.sh to re-enable push-trigger.
  • verify-local.sh + verify-remote.sh are safe to run any time.