veza/scripts/bootstrap
senke 9d63e249fe fix(bootstrap): phase 3 secret-exists check + phase 4 scp+ssh -t for sudo prompt
Two follow-up fixes from a real run :

1. Phase 3 re-prompts even when secret exists
   GET /actions/secrets/<name> isn't a Forgejo endpoint — values
   are write-only. Listing /actions/secrets returns the metadata
   (incl. names but not values), so we list + jq-grep instead.
   The check correctly short-circuits the create-or-prompt flow
   on subsequent runs.

2. Phase 4 fails because sudo wants a password and there's no TTY
   The previous shape :
     ssh user@host 'sudo -E bash -s' < (cat lib.sh remote.sh)
   pipes the script through stdin while sudo wants to prompt on
   stdout — sudo refuses without a TTY. Fix : scp the two files
   to /tmp/talas-bootstrap/ on the R720, then `ssh -t` (allocate
   TTY) and run `sudo env ... bash /tmp/.../bootstrap-remote.sh`.
   sudo gets a real TTY, prompts the operator once, runs the
   script, returns. Cleanup task removes /tmp/talas-bootstrap/
   regardless of outcome.
   The hint on failure suggests setting up NOPASSWD sudo for
   automation : `<user> ALL=(ALL) NOPASSWD: /usr/bin/bash` in
   /etc/sudoers.d/talas-bootstrap.

Also handles the case where R720_USER is empty in .env (ssh
config alias's User= line wins) — the SSH target becomes the
host alone, no user@ prefix.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:28:22 +02:00
..
.env.example fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults 2026-04-29 23:01:05 +02:00
bootstrap-local.sh fix(bootstrap): phase 3 secret-exists check + phase 4 scp+ssh -t for sudo prompt 2026-04-29 23:28:22 +02:00
bootstrap-remote.sh feat(bootstrap): two-host deploy-pipeline bootstrap with idempotent verify 2026-04-29 22:45:00 +02:00
enable-auto-deploy.sh fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults 2026-04-29 23:01:05 +02:00
lib.sh fix(bootstrap): Forgejo variable URL shape + skip-if-exists registry token 2026-04-29 23:16:50 +02:00
README.md fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults 2026-04-29 23:01:05 +02:00
reset-vault.sh fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults 2026-04-29 23:01:05 +02:00
verify-local.sh fix(bootstrap): phase 3 reachability uses /version (no auth) + registry token fallback 2026-04-29 23:11:44 +02:00
verify-remote-ssh.sh fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults 2026-04-29 23:01:05 +02:00
verify-remote.sh feat(bootstrap): two-host deploy-pipeline bootstrap with idempotent verify 2026-04-29 22:45:00 +02:00

scripts/bootstrap/

Two-host bootstrap of the Veza deploy pipeline. Each script is idempotent, resumable, and read-only by default unless explicitly asked to mutate.

Files

File Where it runs What it does
lib.sh sourced by all logging, error trap, idempotent state file, Forgejo API helpers (honours FORGEJO_INSECURE=1)
bootstrap-local.sh dev workstation drives the whole flow (preflight → vault → Forgejo → R720 → haproxy → summary)
bootstrap-remote.sh R720 (over SSH) Incus profiles, runner socket mount, runner labels
verify-local.sh dev workstation read-only checks of local state
verify-remote.sh R720 read-only checks of R720 state (run via verify-remote-ssh.sh)
verify-remote-ssh.sh dev workstation scp+ssh wrapper that runs verify-remote.sh on R720
enable-auto-deploy.sh dev workstation restores .forgejo/workflows/ if disabled, uncomments push: trigger
reset-vault.sh dev workstation recovery from a vault password mismatch (destructive — re-prompts)
.env.example template copy to .env, fill in, gitignored

State file

Each host keeps a per-host state file with phase=DONE timestamp lines so a re-run is a no-op for completed phases :

local :   <repo>/.git/talas-bootstrap/local.state
R720  :   /var/lib/talas/bootstrap.state

To force a phase re-run, delete its line :

sed -i '/^vault=/d' .git/talas-bootstrap/local.state

Inter-script communication

bootstrap-local.sh invokes bootstrap-remote.sh over SSH by concatenating lib.sh + bootstrap-remote.sh and piping into sudo -E bash -s on the R720. The remote script :

  • writes /var/log/talas-bootstrap.log on R720 (persistent)
  • emits >>>PHASE:<name>:<status><<< markers on stdout
  • the local script tees those to stderr so the operator sees remote progress in the same terminal as the local logs

Resumability : the state file means a SSH disconnect or partial failure leaves the work it managed to complete marked DONE. Re-run bootstrap-local.sh and it picks up where it stopped.

Quickstart

cd /home/senke/git/talas/veza/scripts/bootstrap
cp .env.example .env
$EDITOR .env             # fill in FORGEJO_ADMIN_TOKEN at minimum
chmod +x *.sh

# Set up everything
./bootstrap-local.sh

# Or skip phases you've already done
PHASE=4 ./bootstrap-local.sh

# Verify any time
./verify-local.sh
ssh ansible@10.0.20.150 'sudo bash' < verify-remote.sh

What each phase needs

Phase Needs
1. preflight git, ansible, dig, ssh, jq locally ; SSH to R720 ; DNS resolved (warning only if missing)
2. vault nothing ; will prompt for vault password and edit vault.yml from template
3. forgejo FORGEJO_ADMIN_TOKEN env var or in .env
4. r720 FORGEJO_ADMIN_TOKEN (used to fetch runner registration token) ; SSH to R720 with sudo
5. haproxy DNS public domains resolved + port 80 reachable from Internet ; ansible decryptable vault
6. summary nothing

Troubleshooting

  • Phase 1 SSH fails — verify R720_HOST + R720_USER in .env. If you use an SSH config alias (e.g. Host srv-102v in ~/.ssh/config), set R720_HOST=srv-102v and either set R720_USER= (empty, alias's User= wins) or match the alias's user. Test manually : ssh ${R720_USER}@${R720_HOST} /bin/true.
  • Phase 2 cannot decrypt vault.yml — the password in .vault-pass doesn't match what was used to encrypt vault.yml.
    • If you remember the original password, edit .vault-pass (echo "<correct password>" > infra/ansible/.vault-pass ; chmod 0400 …).
    • Otherwise : ./reset-vault.sh — destructive, re-prompts for everything.
  • Phase 3 Forgejo API unreachable — Forgejo on https://10.0.20.105:3000 serves a self-signed cert. Set FORGEJO_INSECURE=1 in .env. Once the edge HAProxy is up + LE has issued forgejo.talas.group, switch to that URL and clear FORGEJO_INSECURE.
  • Phase 3 repo not found — set FORGEJO_OWNER to the actual org/user owning the repo. Confirm with git remote -v (the path segment after host:port/).
  • Phase 4 SSH timeout / sudo prompt — passwordless sudo needed for the SSH user. Add to /etc/sudoers.d/talas-bootstrap :
    senke ALL=(ALL) NOPASSWD: /usr/bin/bash
    
    Or run the remote half manually :
    scp scripts/bootstrap/{lib.sh,bootstrap-remote.sh} srv-102v:/tmp/
    ssh srv-102v 'sudo FORGEJO_REGISTRATION_TOKEN=<token> bash /tmp/bootstrap-remote.sh'
    
  • Phase 5 dehydrated fails — port 80 must be reachable from Internet for HTTP-01 (not blocked by ISP, NAT-forwarded). Test from outside : curl http://veza.fr/.well-known/acme-challenge/test should hit HAProxy's letsencrypt_backend (will 404, which is fine ; what matters is reaching the R720).
  • .forgejo/workflows/ is missing, only workflows.disabled/ present — expected when the auto-trigger has been gated by renaming the dir. enable-auto-deploy.sh restores it.

After bootstrap

  • Trigger 1st deploy manually via Forgejo UI : Actions → Veza deploy → Run workflow.
  • Once green, run ./enable-auto-deploy.sh to re-enable push-trigger.
  • verify-local.sh + verify-remote.sh are safe to run any time.