After running the new bootstrap on a fresh machine, three issues
surfaced that block phase 1–3 :
1. .forgejo/workflows/ may live under workflows.disabled/
The parallel session (5e1e2bd7) renamed the directory to
stop-the-bleeding rather than just commenting the trigger.
verify-local.sh now reports both states correctly.
enable-auto-deploy.sh does `git mv workflows.disabled
workflows` first, then proceeds to uncomment if needed.
2. Forgejo on 10.0.20.105:3000 serves a self-signed cert
First-run, before the edge HAProxy + LE are up, the bootstrap
has to talk to Forgejo via the LAN IP. lib.sh's forgejo_api
helper now honours FORGEJO_INSECURE=1 (passes -k to curl).
verify-local.sh's API checks pick up the same flag.
.env.example documents the swap : FORGEJO_INSECURE=1 with
https://10.0.20.105:3000 first ; flip to https://forgejo.talas.group
+ FORGEJO_INSECURE=0 once the edge HAProxy + LE cert are up.
3. SSH defaults wrong for the actual environment
.env.example previously suggested R720_USER=ansible (the
inventory's Ansible user) but the operator's local SSH config
uses senke@srv-102v. Updated defaults : R720_HOST=srv-102v,
R720_USER=senke. Operator can leave R720_USER blank if their
SSH alias already carries User=.
Plus two new helper scripts :
reset-vault.sh — recovery path when the vault password in
.vault-pass doesn't match what encrypted vault.yml. Confirms
destructively, removes vault.yml + .vault-pass, clears the
vault=DONE marker in local.state, points operator at PHASE=2.
verify-remote-ssh.sh — wrapper that scp's lib.sh +
verify-remote.sh to the R720 and runs verify-remote.sh under
sudo. Removes the need to clone the repo on the R720.
bootstrap-local.sh's phase 2 vault-decrypt failure now hints at
reset-vault.sh.
README.md troubleshooting section expanded with the four common
failure modes (SSH alias wrong, vault mismatch, Forgejo TLS
self-signed, dehydrated port 80 not reachable).
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
125 lines
5.2 KiB
Markdown
125 lines
5.2 KiB
Markdown
# `scripts/bootstrap/`
|
|
|
|
Two-host bootstrap of the Veza deploy pipeline. Each script is
|
|
idempotent, resumable, and read-only by default unless explicitly
|
|
asked to mutate.
|
|
|
|
## Files
|
|
|
|
| File | Where it runs | What it does |
|
|
|---|---|---|
|
|
| `lib.sh` | sourced by all | logging, error trap, idempotent state file, Forgejo API helpers (honours `FORGEJO_INSECURE=1`) |
|
|
| `bootstrap-local.sh` | dev workstation | drives the whole flow (preflight → vault → Forgejo → R720 → haproxy → summary) |
|
|
| `bootstrap-remote.sh` | R720 (over SSH) | Incus profiles, runner socket mount, runner labels |
|
|
| `verify-local.sh` | dev workstation | read-only checks of local state |
|
|
| `verify-remote.sh` | R720 | read-only checks of R720 state (run via `verify-remote-ssh.sh`) |
|
|
| `verify-remote-ssh.sh` | dev workstation | scp+ssh wrapper that runs `verify-remote.sh` on R720 |
|
|
| `enable-auto-deploy.sh` | dev workstation | restores `.forgejo/workflows/` if disabled, uncomments push: trigger |
|
|
| `reset-vault.sh` | dev workstation | recovery from a vault password mismatch (destructive — re-prompts) |
|
|
| `.env.example` | template | copy to `.env`, fill in, gitignored |
|
|
|
|
## State file
|
|
|
|
Each host keeps a per-host state file with `phase=DONE timestamp`
|
|
lines so a re-run is a no-op for completed phases :
|
|
|
|
```
|
|
local : <repo>/.git/talas-bootstrap/local.state
|
|
R720 : /var/lib/talas/bootstrap.state
|
|
```
|
|
|
|
To force a phase re-run, delete its line :
|
|
```bash
|
|
sed -i '/^vault=/d' .git/talas-bootstrap/local.state
|
|
```
|
|
|
|
## Inter-script communication
|
|
|
|
`bootstrap-local.sh` invokes `bootstrap-remote.sh` over SSH by
|
|
concatenating `lib.sh` + `bootstrap-remote.sh` and piping into
|
|
`sudo -E bash -s` on the R720. The remote script :
|
|
|
|
* writes `/var/log/talas-bootstrap.log` on R720 (persistent)
|
|
* emits `>>>PHASE:<name>:<status><<<` markers on stdout
|
|
* the local script `tee`s those to stderr so the operator sees
|
|
remote progress in the same terminal as the local logs
|
|
|
|
Resumability : the state file means a SSH disconnect or partial
|
|
failure leaves the work it managed to complete marked DONE. Re-run
|
|
`bootstrap-local.sh` and it picks up where it stopped.
|
|
|
|
## Quickstart
|
|
|
|
```bash
|
|
cd /home/senke/git/talas/veza/scripts/bootstrap
|
|
cp .env.example .env
|
|
$EDITOR .env # fill in FORGEJO_ADMIN_TOKEN at minimum
|
|
chmod +x *.sh
|
|
|
|
# Set up everything
|
|
./bootstrap-local.sh
|
|
|
|
# Or skip phases you've already done
|
|
PHASE=4 ./bootstrap-local.sh
|
|
|
|
# Verify any time
|
|
./verify-local.sh
|
|
ssh ansible@10.0.20.150 'sudo bash' < verify-remote.sh
|
|
```
|
|
|
|
## What each phase needs
|
|
|
|
| Phase | Needs |
|
|
|---|---|
|
|
| 1. preflight | git, ansible, dig, ssh, jq locally ; SSH to R720 ; DNS resolved (warning only if missing) |
|
|
| 2. vault | nothing ; will prompt for vault password and edit `vault.yml` from template |
|
|
| 3. forgejo | `FORGEJO_ADMIN_TOKEN` env var or in .env |
|
|
| 4. r720 | `FORGEJO_ADMIN_TOKEN` (used to fetch runner registration token) ; SSH to R720 with sudo |
|
|
| 5. haproxy | DNS public domains resolved + port 80 reachable from Internet ; ansible decryptable vault |
|
|
| 6. summary | nothing |
|
|
|
|
## Troubleshooting
|
|
|
|
- **Phase 1 SSH fails** — verify `R720_HOST` + `R720_USER` in `.env`.
|
|
If you use an SSH config alias (e.g. `Host srv-102v` in
|
|
`~/.ssh/config`), set `R720_HOST=srv-102v` and either set
|
|
`R720_USER=` (empty, alias's User= wins) or match the alias's user.
|
|
Test manually : `ssh ${R720_USER}@${R720_HOST} /bin/true`.
|
|
- **Phase 2 `cannot decrypt vault.yml`** — the password in
|
|
`.vault-pass` doesn't match what was used to encrypt `vault.yml`.
|
|
- If you remember the original password, edit `.vault-pass`
|
|
(`echo "<correct password>" > infra/ansible/.vault-pass ; chmod 0400 …`).
|
|
- Otherwise : `./reset-vault.sh` — destructive, re-prompts for
|
|
everything.
|
|
- **Phase 3 `Forgejo API unreachable`** — Forgejo on
|
|
`https://10.0.20.105:3000` serves a self-signed cert. Set
|
|
`FORGEJO_INSECURE=1` in `.env`. Once the edge HAProxy is up + LE has
|
|
issued `forgejo.talas.group`, switch to that URL and clear
|
|
`FORGEJO_INSECURE`.
|
|
- **Phase 3 `repo not found`** — set `FORGEJO_OWNER` to the actual
|
|
org/user owning the repo. Confirm with `git remote -v` (the path
|
|
segment after `host:port/`).
|
|
- **Phase 4 SSH timeout / sudo prompt** — passwordless sudo needed
|
|
for the SSH user. Add to `/etc/sudoers.d/talas-bootstrap` :
|
|
```
|
|
senke ALL=(ALL) NOPASSWD: /usr/bin/bash
|
|
```
|
|
Or run the remote half manually :
|
|
```
|
|
scp scripts/bootstrap/{lib.sh,bootstrap-remote.sh} srv-102v:/tmp/
|
|
ssh srv-102v 'sudo FORGEJO_REGISTRATION_TOKEN=<token> bash /tmp/bootstrap-remote.sh'
|
|
```
|
|
- **Phase 5 dehydrated fails** — port 80 must be reachable from
|
|
Internet for HTTP-01 (not blocked by ISP, NAT-forwarded). Test
|
|
from outside : `curl http://veza.fr/.well-known/acme-challenge/test`
|
|
should hit HAProxy's `letsencrypt_backend` (will 404, which is
|
|
fine ; what matters is reaching the R720).
|
|
- **`.forgejo/workflows/` is missing, only `workflows.disabled/` present** —
|
|
expected when the auto-trigger has been gated by renaming the dir.
|
|
`enable-auto-deploy.sh` restores it.
|
|
|
|
## After bootstrap
|
|
|
|
- Trigger 1st deploy manually via Forgejo UI : Actions → Veza deploy → Run workflow.
|
|
- Once green, run `./enable-auto-deploy.sh` to re-enable push-trigger.
|
|
- `verify-local.sh` + `verify-remote.sh` are safe to run any time.
|