senke/veza

History

senke e58bafde9c fix(bootstrap): runner-token auto-fetch falls back to manual prompt on failure The /api/v1/repos/{owner}/{repo}/actions/runners/registration-token endpoint timed out (30s) on the operator's Forgejo. Cause unclear (Forgejo version, scope, transient WG drop). Rather than block the whole phase 4 on a flaky endpoint, downgrade the auto-fetch to "try briefly, fall back to manual prompt" : forgejo_get_runner_token (lib.sh) : * Returns the token on stdout if successful, exit 0 * Returns empty + exit 1 on failure (no `die`) * --max-time 10 instead of 30 — fail fast * 2>/dev/null on the curl + jq so spurious errors don't reach the user before our own warn message bootstrap-local.sh phase 4 : * if reg_token=$(forgejo_get_runner_token ...) → ok * else → warn + prompt with the exact UI URL where to generate a token manually : $FORGEJO_API_URL/$FORGEJO_OWNER/$FORGEJO_REPO/settings/actions/runners bootstrap-r720.sh : symmetric change. Operator workflow on failure : 1. Open the Forgejo UI URL printed by the warn 2. "Create new runner" → copy the registration token 3. Paste at the prompt — bootstrap continues --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-30 15:20:06 +02:00
..
.env.example	fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults	2026-04-29 23:01:05 +02:00
bootstrap-local.sh	fix(bootstrap): runner-token auto-fetch falls back to manual prompt on failure	2026-04-30 15:20:06 +02:00
bootstrap-r720.sh	fix(bootstrap): runner-token auto-fetch falls back to manual prompt on failure	2026-04-30 15:20:06 +02:00
enable-auto-deploy.sh	fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults	2026-04-29 23:01:05 +02:00
lib.sh	fix(bootstrap): runner-token auto-fetch falls back to manual prompt on failure	2026-04-30 15:20:06 +02:00
README.md	refactor(bootstrap): everything via Ansible — no NOPASSWD, no SSH plumbing	2026-04-30 15:12:26 +02:00
reset-vault.sh	fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults	2026-04-29 23:01:05 +02:00
verify-local.sh	refactor(bootstrap): everything via Ansible — no NOPASSWD, no SSH plumbing	2026-04-30 15:12:26 +02:00
verify-r720.sh	refactor(bootstrap): everything via Ansible — no NOPASSWD, no SSH plumbing	2026-04-30 15:12:26 +02:00

README.md

`scripts/bootstrap/` — bootstrap the Veza deploy pipeline

Two parallel scripts (one per host) + four helpers + one shared lib. Each script is idempotent, resumable, read-only by default unless explicitly asked to mutate. No NOPASSWD sudo required.

The heavy lifting (Incus profiles, forgejo-runner config, HAProxy edge, Let's Encrypt) is done by Ansible playbooks, not bash. The shell scripts are thin orchestrators that handle the chicken-and-egg part : create the Vault that Ansible needs, set the Forgejo CI secrets, then call ansible-playbook.

Files

File	Where it runs	What it does
`lib.sh`	sourced by all	logging, error trap, idempotent state file, Forgejo API helpers
`bootstrap-local.sh`	operator's laptop	drives Ansible over SSH ; --ask-become-pass on the R720
`bootstrap-r720.sh`	R720 directly (sudo)	drives Ansible locally (connection: local) ; no SSH, no sudo prompts
`verify-local.sh`	laptop	read-only checks of local + remote state
`verify-r720.sh`	R720 (sudo)	read-only checks of R720 state
`enable-auto-deploy.sh`	laptop	restores `.forgejo/workflows/`, uncomments push: trigger
`reset-vault.sh`	laptop	recovery from vault password mismatch (destructive)
`.env.example`	template	copy to `.env`, fill in, gitignored

Two scripts, one Ansible

Both bootstrap-local.sh and bootstrap-r720.sh end up running the same two playbooks :

playbooks/bootstrap_runner.yml — Incus profiles + forgejo-runner Incus access + runner registration with incus label
playbooks/haproxy.yml — edge HAProxy container + dehydrated Let's Encrypt issuance for veza.fr / staging.veza.fr / talas.fr / forgejo.talas.group

The difference is the inventory :

laptop → inventory/staging.yml (SSH to R720, --ask-become-pass)
R720 → inventory/local.yml (connection: local, already root)

Pick whichever is convenient. The state files are independent (laptop keeps state under .git/talas-bootstrap/, R720 under /var/lib/talas/), so running both at different times doesn't double-do anything.

State files

laptop : <repo>/.git/talas-bootstrap/local.state
R720   : /var/lib/talas/r720-bootstrap.state

phase=DONE timestamp per completed phase. Re-run skips DONE phases. To force a phase re-run, delete its line :

sed -i '/^vault=/d' .git/talas-bootstrap/local.state

Quickstart — from the laptop

cd /home/senke/git/talas/veza/scripts/bootstrap
cp .env.example .env
vim .env                  # at minimum : FORGEJO_ADMIN_TOKEN
chmod +x *.sh

# Set up everything end-to-end :
./bootstrap-local.sh

# Or skip phases you've already done :
PHASE=4 ./bootstrap-local.sh

# Verify any time (read-only) :
./verify-local.sh

Quickstart — directly on the R720

ssh srv-102v
cd /path/to/veza/scripts/bootstrap
cp .env.example .env
vim .env                  # FORGEJO_ADMIN_TOKEN at minimum
sudo ./bootstrap-r720.sh

# Verify :
sudo ./verify-r720.sh

Sudo on the R720 — the design choice

The bash scripts do not require NOPASSWD sudo on the R720. Two reasons :

Trust boundary — NOPASSWD turns any compromise of the operator's account into root on the host. Keeping the password requirement means an attacker also needs to phish/keylog the sudo password.
Ansible's --ask-become-pass is fine for interactive runs. The operator types the password ONCE per bootstrap-local.sh invocation ; ansible holds it in memory and reuses for every become: true task. No file written, no env var leaked.

pipelining = False in ansible.cfg is what makes interactive --ask-become-pass reliable (the previous True setting raced sudo's TTY-driven prompt).

What each phase needs

Phase	Needs
1. preflight	git, ansible, dig, ssh, jq locally ; SSH to R720 (laptop) ; DNS resolved (warning if missing)
2. vault	nothing ; auto-generates JWT + 11 random passwords, prompts for vault password
3. forgejo	`FORGEJO_ADMIN_TOKEN` (.env or env) — scopes : write:repository, read:repository
4. ansible	sudo password on R720 (interactive ; not stored)
5. summary	nothing

Troubleshooting

Phase 1 SSH fails — verify R720_HOST + R720_USER in .env. If using an SSH config alias, R720_HOST=<alias> and leave R720_USER= empty.
Phase 2 cannot decrypt — ./reset-vault.sh (destructive, re-prompts for everything).
Phase 3 Forgejo unreachable — set FORGEJO_INSECURE=1 for self-signed cert on https://10.0.20.105:3000. Update to https://forgejo.talas.group once edge HAProxy + LE is up.
Phase 3 token lacks scope — token needs at minimum write:repository. write:admin lets the script auto-create the registry token ; without it, you'll be prompted to paste one you create manually.
Phase 4 Timeout waiting for privilege escalation prompt — set pipelining = False in infra/ansible/ansible.cfg. The current default is False ; revert if it's been changed.
Phase 4 dehydrated fails — port 80 must be reachable from Internet (HTTP-01 challenge). Test from an external host : curl http://veza.fr/. If it doesn't reach the R720, configure port forwarding 80 + 443 on your home router / ISP box.
Phase 4 Incus network not found — group_vars defaults to net-veza. The script auto-detects from forgejo's network on the R720 ; if your bridge has a different name, set veza_incus_network in group_vars/staging.yml (or inventory/local.yml for the R720 case).

After bootstrap

Trigger 1st deploy manually : Forgejo Actions UI → Veza deploy → Run workflow.
Once green, run ./enable-auto-deploy.sh to restore the push:main + tag:v* triggers.
verify-{local,r720}.sh are safe to run any time.