veza/scripts/bootstrap/README.md
senke e004e18738 fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults
After running the new bootstrap on a fresh machine, three issues
surfaced that block phase 1–3 :

1. .forgejo/workflows/ may live under workflows.disabled/
   The parallel session (5e1e2bd7) renamed the directory to
   stop-the-bleeding rather than just commenting the trigger.
   verify-local.sh now reports both states correctly.
   enable-auto-deploy.sh does `git mv workflows.disabled
   workflows` first, then proceeds to uncomment if needed.

2. Forgejo on 10.0.20.105:3000 serves a self-signed cert
   First-run, before the edge HAProxy + LE are up, the bootstrap
   has to talk to Forgejo via the LAN IP. lib.sh's forgejo_api
   helper now honours FORGEJO_INSECURE=1 (passes -k to curl).
   verify-local.sh's API checks pick up the same flag.
   .env.example documents the swap : FORGEJO_INSECURE=1 with
   https://10.0.20.105:3000 first ; flip to https://forgejo.talas.group
   + FORGEJO_INSECURE=0 once the edge HAProxy + LE cert are up.

3. SSH defaults wrong for the actual environment
   .env.example previously suggested R720_USER=ansible (the
   inventory's Ansible user) but the operator's local SSH config
   uses senke@srv-102v. Updated defaults : R720_HOST=srv-102v,
   R720_USER=senke. Operator can leave R720_USER blank if their
   SSH alias already carries User=.

Plus two new helper scripts :

  reset-vault.sh — recovery path when the vault password in
  .vault-pass doesn't match what encrypted vault.yml. Confirms
  destructively, removes vault.yml + .vault-pass, clears the
  vault=DONE marker in local.state, points operator at PHASE=2.

  verify-remote-ssh.sh — wrapper that scp's lib.sh +
  verify-remote.sh to the R720 and runs verify-remote.sh under
  sudo. Removes the need to clone the repo on the R720.

bootstrap-local.sh's phase 2 vault-decrypt failure now hints at
reset-vault.sh.

README.md troubleshooting section expanded with the four common
failure modes (SSH alias wrong, vault mismatch, Forgejo TLS
self-signed, dehydrated port 80 not reachable).

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:01:05 +02:00

5.2 KiB

scripts/bootstrap/

Two-host bootstrap of the Veza deploy pipeline. Each script is idempotent, resumable, and read-only by default unless explicitly asked to mutate.

Files

File Where it runs What it does
lib.sh sourced by all logging, error trap, idempotent state file, Forgejo API helpers (honours FORGEJO_INSECURE=1)
bootstrap-local.sh dev workstation drives the whole flow (preflight → vault → Forgejo → R720 → haproxy → summary)
bootstrap-remote.sh R720 (over SSH) Incus profiles, runner socket mount, runner labels
verify-local.sh dev workstation read-only checks of local state
verify-remote.sh R720 read-only checks of R720 state (run via verify-remote-ssh.sh)
verify-remote-ssh.sh dev workstation scp+ssh wrapper that runs verify-remote.sh on R720
enable-auto-deploy.sh dev workstation restores .forgejo/workflows/ if disabled, uncomments push: trigger
reset-vault.sh dev workstation recovery from a vault password mismatch (destructive — re-prompts)
.env.example template copy to .env, fill in, gitignored

State file

Each host keeps a per-host state file with phase=DONE timestamp lines so a re-run is a no-op for completed phases :

local :   <repo>/.git/talas-bootstrap/local.state
R720  :   /var/lib/talas/bootstrap.state

To force a phase re-run, delete its line :

sed -i '/^vault=/d' .git/talas-bootstrap/local.state

Inter-script communication

bootstrap-local.sh invokes bootstrap-remote.sh over SSH by concatenating lib.sh + bootstrap-remote.sh and piping into sudo -E bash -s on the R720. The remote script :

  • writes /var/log/talas-bootstrap.log on R720 (persistent)
  • emits >>>PHASE:<name>:<status><<< markers on stdout
  • the local script tees those to stderr so the operator sees remote progress in the same terminal as the local logs

Resumability : the state file means a SSH disconnect or partial failure leaves the work it managed to complete marked DONE. Re-run bootstrap-local.sh and it picks up where it stopped.

Quickstart

cd /home/senke/git/talas/veza/scripts/bootstrap
cp .env.example .env
$EDITOR .env             # fill in FORGEJO_ADMIN_TOKEN at minimum
chmod +x *.sh

# Set up everything
./bootstrap-local.sh

# Or skip phases you've already done
PHASE=4 ./bootstrap-local.sh

# Verify any time
./verify-local.sh
ssh ansible@10.0.20.150 'sudo bash' < verify-remote.sh

What each phase needs

Phase Needs
1. preflight git, ansible, dig, ssh, jq locally ; SSH to R720 ; DNS resolved (warning only if missing)
2. vault nothing ; will prompt for vault password and edit vault.yml from template
3. forgejo FORGEJO_ADMIN_TOKEN env var or in .env
4. r720 FORGEJO_ADMIN_TOKEN (used to fetch runner registration token) ; SSH to R720 with sudo
5. haproxy DNS public domains resolved + port 80 reachable from Internet ; ansible decryptable vault
6. summary nothing

Troubleshooting

  • Phase 1 SSH fails — verify R720_HOST + R720_USER in .env. If you use an SSH config alias (e.g. Host srv-102v in ~/.ssh/config), set R720_HOST=srv-102v and either set R720_USER= (empty, alias's User= wins) or match the alias's user. Test manually : ssh ${R720_USER}@${R720_HOST} /bin/true.
  • Phase 2 cannot decrypt vault.yml — the password in .vault-pass doesn't match what was used to encrypt vault.yml.
    • If you remember the original password, edit .vault-pass (echo "<correct password>" > infra/ansible/.vault-pass ; chmod 0400 …).
    • Otherwise : ./reset-vault.sh — destructive, re-prompts for everything.
  • Phase 3 Forgejo API unreachable — Forgejo on https://10.0.20.105:3000 serves a self-signed cert. Set FORGEJO_INSECURE=1 in .env. Once the edge HAProxy is up + LE has issued forgejo.talas.group, switch to that URL and clear FORGEJO_INSECURE.
  • Phase 3 repo not found — set FORGEJO_OWNER to the actual org/user owning the repo. Confirm with git remote -v (the path segment after host:port/).
  • Phase 4 SSH timeout / sudo prompt — passwordless sudo needed for the SSH user. Add to /etc/sudoers.d/talas-bootstrap :
    senke ALL=(ALL) NOPASSWD: /usr/bin/bash
    
    Or run the remote half manually :
    scp scripts/bootstrap/{lib.sh,bootstrap-remote.sh} srv-102v:/tmp/
    ssh srv-102v 'sudo FORGEJO_REGISTRATION_TOKEN=<token> bash /tmp/bootstrap-remote.sh'
    
  • Phase 5 dehydrated fails — port 80 must be reachable from Internet for HTTP-01 (not blocked by ISP, NAT-forwarded). Test from outside : curl http://veza.fr/.well-known/acme-challenge/test should hit HAProxy's letsencrypt_backend (will 404, which is fine ; what matters is reaching the R720).
  • .forgejo/workflows/ is missing, only workflows.disabled/ present — expected when the auto-trigger has been gated by renaming the dir. enable-auto-deploy.sh restores it.

After bootstrap

  • Trigger 1st deploy manually via Forgejo UI : Actions → Veza deploy → Run workflow.
  • Once green, run ./enable-auto-deploy.sh to re-enable push-trigger.
  • verify-local.sh + verify-remote.sh are safe to run any time.