senke/veza

History

senke b9445faacc fix(infra): rename veza-net → net-veza everywhere + drop redundant profile The R720 has 5 managed Incus bridges, organized by trust zone : net-ad 10.0.50.0/24 admin net-dmz 10.0.10.0/24 DMZ net-sandbox 10.0.30.0/24 sandbox net-veza 10.0.20.0/24 Veza (forgejo + 12 other containers) incusbr0 10.0.0.0/24 default Veza belongs on `net-veza`. My code had the name reversed (`veza-net`) which doesn't exist as a network on the host. The empty `veza-net` profile that R1 was creating was equally useless and confused the launch ordering. Changes : * group_vars/staging.yml veza_incus_network : veza-staging-net → net-veza veza_incus_subnet : 10.0.21.0/24 → 10.0.20.0/24 Comment block explains why staging+prod share net-veza in v1.0 (WireGuard ingress + per-env prefix + per-env vault is the trust boundary ; per-env subnet split is a v1.1 hardening) and how to flip to a dedicated bridge later. * group_vars/prod.yml veza_incus_network : veza-net → net-veza * playbooks/haproxy.yml incus launch ... --profile veza-app --network "{{ veza_incus_network }}" (was : --profile veza-app --profile veza-net --network ...) * playbooks/deploy_data.yml + deploy_app.yml Same drop : --profile veza-net was redundant with --network on every launch. Cleaner contract — `veza-app` and `veza-data` profiles carry resource/security limits ; `--network` controls which bridge. * scripts/bootstrap/bootstrap-remote.sh R1 Stop creating the `veza-net` profile. Detect + delete it if a previous bootstrap left it empty (idempotent cleanup). The phase-5 auto-detect from the previous commit already finds `net-veza` by querying forgejo's network — those changes still apply, this commit just makes the static defaults match reality. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-30 14:58:04 +02:00
..
.env.example	fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults	2026-04-29 23:01:05 +02:00
bootstrap-local.sh	fix(bootstrap): phase 5 auto-detects Incus network from forgejo container	2026-04-30 14:54:52 +02:00
bootstrap-remote.sh	fix(infra): rename veza-net → net-veza everywhere + drop redundant profile	2026-04-30 14:58:04 +02:00
enable-auto-deploy.sh	fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults	2026-04-29 23:01:05 +02:00
lib.sh	fix(bootstrap): Forgejo variable URL shape + skip-if-exists registry token	2026-04-29 23:16:50 +02:00
README.md	fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults	2026-04-29 23:01:05 +02:00
reset-vault.sh	fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults	2026-04-29 23:01:05 +02:00
verify-local.sh	fix(bootstrap): phase 3 reachability uses /version (no auth) + registry token fallback	2026-04-29 23:11:44 +02:00
verify-remote-ssh.sh	fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults	2026-04-29 23:01:05 +02:00
verify-remote.sh	feat(bootstrap): two-host deploy-pipeline bootstrap with idempotent verify	2026-04-29 22:45:00 +02:00

README.md

`scripts/bootstrap/`

Two-host bootstrap of the Veza deploy pipeline. Each script is idempotent, resumable, and read-only by default unless explicitly asked to mutate.

Files

File	Where it runs	What it does
`lib.sh`	sourced by all	logging, error trap, idempotent state file, Forgejo API helpers (honours `FORGEJO_INSECURE=1`)
`bootstrap-local.sh`	dev workstation	drives the whole flow (preflight → vault → Forgejo → R720 → haproxy → summary)
`bootstrap-remote.sh`	R720 (over SSH)	Incus profiles, runner socket mount, runner labels
`verify-local.sh`	dev workstation	read-only checks of local state
`verify-remote.sh`	R720	read-only checks of R720 state (run via `verify-remote-ssh.sh`)
`verify-remote-ssh.sh`	dev workstation	scp+ssh wrapper that runs `verify-remote.sh` on R720
`enable-auto-deploy.sh`	dev workstation	restores `.forgejo/workflows/` if disabled, uncomments push: trigger
`reset-vault.sh`	dev workstation	recovery from a vault password mismatch (destructive — re-prompts)
`.env.example`	template	copy to `.env`, fill in, gitignored

State file

Each host keeps a per-host state file with phase=DONE timestamp lines so a re-run is a no-op for completed phases :

local :   <repo>/.git/talas-bootstrap/local.state
R720  :   /var/lib/talas/bootstrap.state

To force a phase re-run, delete its line :

sed -i '/^vault=/d' .git/talas-bootstrap/local.state

Inter-script communication

bootstrap-local.sh invokes bootstrap-remote.sh over SSH by concatenating lib.sh + bootstrap-remote.sh and piping into sudo -E bash -s on the R720. The remote script :

writes /var/log/talas-bootstrap.log on R720 (persistent)
emits >>>PHASE:<name>:<status><<< markers on stdout
the local script tees those to stderr so the operator sees remote progress in the same terminal as the local logs

Resumability : the state file means a SSH disconnect or partial failure leaves the work it managed to complete marked DONE. Re-run bootstrap-local.sh and it picks up where it stopped.

Quickstart

cd /home/senke/git/talas/veza/scripts/bootstrap
cp .env.example .env
$EDITOR .env             # fill in FORGEJO_ADMIN_TOKEN at minimum
chmod +x *.sh

# Set up everything
./bootstrap-local.sh

# Or skip phases you've already done
PHASE=4 ./bootstrap-local.sh

# Verify any time
./verify-local.sh
ssh ansible@10.0.20.150 'sudo bash' < verify-remote.sh

What each phase needs

Phase	Needs
1. preflight	git, ansible, dig, ssh, jq locally ; SSH to R720 ; DNS resolved (warning only if missing)
2. vault	nothing ; will prompt for vault password and edit `vault.yml` from template
3. forgejo	`FORGEJO_ADMIN_TOKEN` env var or in .env
4. r720	`FORGEJO_ADMIN_TOKEN` (used to fetch runner registration token) ; SSH to R720 with sudo
5. haproxy	DNS public domains resolved + port 80 reachable from Internet ; ansible decryptable vault
6. summary	nothing

Troubleshooting

Phase 1 SSH fails — verify R720_HOST + R720_USER in .env. If you use an SSH config alias (e.g. Host srv-102v in ~/.ssh/config), set R720_HOST=srv-102v and either set R720_USER= (empty, alias's User= wins) or match the alias's user. Test manually : ssh ${R720_USER}@${R720_HOST} /bin/true.
Phase 2 cannot decrypt vault.yml — the password in .vault-pass doesn't match what was used to encrypt vault.yml.
- If you remember the original password, edit .vault-pass (echo "<correct password>" > infra/ansible/.vault-pass ; chmod 0400 …).
- Otherwise : ./reset-vault.sh — destructive, re-prompts for everything.
Phase 3 Forgejo API unreachable — Forgejo on https://10.0.20.105:3000 serves a self-signed cert. Set FORGEJO_INSECURE=1 in .env. Once the edge HAProxy is up + LE has issued forgejo.talas.group, switch to that URL and clear FORGEJO_INSECURE.
Phase 3 repo not found — set FORGEJO_OWNER to the actual org/user owning the repo. Confirm with git remote -v (the path segment after host:port/).

Phase 4 SSH timeout / sudo prompt — passwordless sudo needed for the SSH user. Add to /etc/sudoers.d/talas-bootstrap :

senke ALL=(ALL) NOPASSWD: /usr/bin/bash

Or run the remote half manually :

scp scripts/bootstrap/{lib.sh,bootstrap-remote.sh} srv-102v:/tmp/
ssh srv-102v 'sudo FORGEJO_REGISTRATION_TOKEN=<token> bash /tmp/bootstrap-remote.sh'

Phase 5 dehydrated fails — port 80 must be reachable from Internet for HTTP-01 (not blocked by ISP, NAT-forwarded). Test from outside : curl http://veza.fr/.well-known/acme-challenge/test should hit HAProxy's letsencrypt_backend (will 404, which is fine ; what matters is reaching the R720).
.forgejo/workflows/ is missing, only workflows.disabled/ present — expected when the auto-trigger has been gated by renaming the dir. enable-auto-deploy.sh restores it.

After bootstrap

Trigger 1st deploy manually via Forgejo UI : Actions → Veza deploy → Run workflow.
Once green, run ./enable-auto-deploy.sh to re-enable push-trigger.
verify-local.sh + verify-remote.sh are safe to run any time.