veza/scripts/bootstrap
senke 7ca9c15514 fix(bootstrap): phase 5 auto-detects Incus network from forgejo container
The playbook hardcoded `--network "veza-net"` (matching the
group_vars default) but the operator's R720 doesn't have a
network with that name — Forgejo lives on whatever managed bridge
the host was originally set up with. Result : `incus launch` fails
with `Failed loading network "veza-net": Network not found`.

Phase 5 now probes :
  1. `incus config device get forgejo eth0 network` — the network
     the existing forgejo container is on. Most reliable.
  2. Fallback : first managed bridge from `incus network list`.

The detected name is passed to ansible-playbook as
`--extra-vars veza_incus_network=<name>`, overriding the
group_vars default for this run only (no file changes).

If detection fails entirely (no forgejo container, no managed
bridge), the playbook falls through to the group_vars default and
the failure surface is the same as before — but with a clearer
hint mentioning network mismatch.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:54:52 +02:00
..
.env.example fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults 2026-04-29 23:01:05 +02:00
bootstrap-local.sh fix(bootstrap): phase 5 auto-detects Incus network from forgejo container 2026-04-30 14:54:52 +02:00
bootstrap-remote.sh fix(bootstrap): R2 — push incus binary from host instead of apt-installing 2026-04-30 14:27:06 +02:00
enable-auto-deploy.sh fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults 2026-04-29 23:01:05 +02:00
lib.sh fix(bootstrap): Forgejo variable URL shape + skip-if-exists registry token 2026-04-29 23:16:50 +02:00
README.md fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults 2026-04-29 23:01:05 +02:00
reset-vault.sh fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults 2026-04-29 23:01:05 +02:00
verify-local.sh fix(bootstrap): phase 3 reachability uses /version (no auth) + registry token fallback 2026-04-29 23:11:44 +02:00
verify-remote-ssh.sh fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults 2026-04-29 23:01:05 +02:00
verify-remote.sh feat(bootstrap): two-host deploy-pipeline bootstrap with idempotent verify 2026-04-29 22:45:00 +02:00

scripts/bootstrap/

Two-host bootstrap of the Veza deploy pipeline. Each script is idempotent, resumable, and read-only by default unless explicitly asked to mutate.

Files

File Where it runs What it does
lib.sh sourced by all logging, error trap, idempotent state file, Forgejo API helpers (honours FORGEJO_INSECURE=1)
bootstrap-local.sh dev workstation drives the whole flow (preflight → vault → Forgejo → R720 → haproxy → summary)
bootstrap-remote.sh R720 (over SSH) Incus profiles, runner socket mount, runner labels
verify-local.sh dev workstation read-only checks of local state
verify-remote.sh R720 read-only checks of R720 state (run via verify-remote-ssh.sh)
verify-remote-ssh.sh dev workstation scp+ssh wrapper that runs verify-remote.sh on R720
enable-auto-deploy.sh dev workstation restores .forgejo/workflows/ if disabled, uncomments push: trigger
reset-vault.sh dev workstation recovery from a vault password mismatch (destructive — re-prompts)
.env.example template copy to .env, fill in, gitignored

State file

Each host keeps a per-host state file with phase=DONE timestamp lines so a re-run is a no-op for completed phases :

local :   <repo>/.git/talas-bootstrap/local.state
R720  :   /var/lib/talas/bootstrap.state

To force a phase re-run, delete its line :

sed -i '/^vault=/d' .git/talas-bootstrap/local.state

Inter-script communication

bootstrap-local.sh invokes bootstrap-remote.sh over SSH by concatenating lib.sh + bootstrap-remote.sh and piping into sudo -E bash -s on the R720. The remote script :

  • writes /var/log/talas-bootstrap.log on R720 (persistent)
  • emits >>>PHASE:<name>:<status><<< markers on stdout
  • the local script tees those to stderr so the operator sees remote progress in the same terminal as the local logs

Resumability : the state file means a SSH disconnect or partial failure leaves the work it managed to complete marked DONE. Re-run bootstrap-local.sh and it picks up where it stopped.

Quickstart

cd /home/senke/git/talas/veza/scripts/bootstrap
cp .env.example .env
$EDITOR .env             # fill in FORGEJO_ADMIN_TOKEN at minimum
chmod +x *.sh

# Set up everything
./bootstrap-local.sh

# Or skip phases you've already done
PHASE=4 ./bootstrap-local.sh

# Verify any time
./verify-local.sh
ssh ansible@10.0.20.150 'sudo bash' < verify-remote.sh

What each phase needs

Phase Needs
1. preflight git, ansible, dig, ssh, jq locally ; SSH to R720 ; DNS resolved (warning only if missing)
2. vault nothing ; will prompt for vault password and edit vault.yml from template
3. forgejo FORGEJO_ADMIN_TOKEN env var or in .env
4. r720 FORGEJO_ADMIN_TOKEN (used to fetch runner registration token) ; SSH to R720 with sudo
5. haproxy DNS public domains resolved + port 80 reachable from Internet ; ansible decryptable vault
6. summary nothing

Troubleshooting

  • Phase 1 SSH fails — verify R720_HOST + R720_USER in .env. If you use an SSH config alias (e.g. Host srv-102v in ~/.ssh/config), set R720_HOST=srv-102v and either set R720_USER= (empty, alias's User= wins) or match the alias's user. Test manually : ssh ${R720_USER}@${R720_HOST} /bin/true.
  • Phase 2 cannot decrypt vault.yml — the password in .vault-pass doesn't match what was used to encrypt vault.yml.
    • If you remember the original password, edit .vault-pass (echo "<correct password>" > infra/ansible/.vault-pass ; chmod 0400 …).
    • Otherwise : ./reset-vault.sh — destructive, re-prompts for everything.
  • Phase 3 Forgejo API unreachable — Forgejo on https://10.0.20.105:3000 serves a self-signed cert. Set FORGEJO_INSECURE=1 in .env. Once the edge HAProxy is up + LE has issued forgejo.talas.group, switch to that URL and clear FORGEJO_INSECURE.
  • Phase 3 repo not found — set FORGEJO_OWNER to the actual org/user owning the repo. Confirm with git remote -v (the path segment after host:port/).
  • Phase 4 SSH timeout / sudo prompt — passwordless sudo needed for the SSH user. Add to /etc/sudoers.d/talas-bootstrap :
    senke ALL=(ALL) NOPASSWD: /usr/bin/bash
    
    Or run the remote half manually :
    scp scripts/bootstrap/{lib.sh,bootstrap-remote.sh} srv-102v:/tmp/
    ssh srv-102v 'sudo FORGEJO_REGISTRATION_TOKEN=<token> bash /tmp/bootstrap-remote.sh'
    
  • Phase 5 dehydrated fails — port 80 must be reachable from Internet for HTTP-01 (not blocked by ISP, NAT-forwarded). Test from outside : curl http://veza.fr/.well-known/acme-challenge/test should hit HAProxy's letsencrypt_backend (will 404, which is fine ; what matters is reaching the R720).
  • .forgejo/workflows/ is missing, only workflows.disabled/ present — expected when the auto-trigger has been gated by renaming the dir. enable-auto-deploy.sh restores it.

After bootstrap

  • Trigger 1st deploy manually via Forgejo UI : Actions → Veza deploy → Run workflow.
  • Once green, run ./enable-auto-deploy.sh to re-enable push-trigger.
  • verify-local.sh + verify-remote.sh are safe to run any time.