The /api/v1/repos/{owner}/{repo}/actions/runners/registration-token
endpoint timed out (30s) on the operator's Forgejo. Cause unclear
(Forgejo version, scope, transient WG drop). Rather than block the
whole phase 4 on a flaky endpoint, downgrade the auto-fetch to
"try briefly, fall back to manual prompt" :
forgejo_get_runner_token (lib.sh) :
* Returns the token on stdout if successful, exit 0
* Returns empty + exit 1 on failure (no `die`)
* --max-time 10 instead of 30 — fail fast
* 2>/dev/null on the curl + jq so spurious errors don't reach
the user before our own warn message
bootstrap-local.sh phase 4 :
* if reg_token=$(forgejo_get_runner_token ...) → ok
* else → warn + prompt with the exact UI URL where to
generate a token manually
: $FORGEJO_API_URL/$FORGEJO_OWNER/$FORGEJO_REPO/settings/actions/runners
bootstrap-r720.sh : symmetric change.
Operator workflow on failure :
1. Open the Forgejo UI URL printed by the warn
2. "Create new runner" → copy the registration token
3. Paste at the prompt — bootstrap continues
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes after a real run :
1. forgejo_set_var hits 405 on POST /actions/variables (no <name>)
Verified empirically against the user's Forgejo : the endpoint
wants the variable name BOTH in the URL path AND in the body
`{name, value}`. Fix : POST /actions/variables/<name> with the
full `{name, value}` body. PUT shape was already right ; only
the POST fallback was wrong.
Note for future readers : the GET endpoint's response field is
`data` (the stored value), but on write the API expects `value`.
The two are NOT interchangeable — using `data` returns
422 "Value : Required". Documented in the function comment.
2. Phase 3 re-prompted for the registry token on every re-run
The first run set the secret successfully then died on the
variable. Re-running phase 3 would re-prompt the operator for
a token they had already pasted (and not saved). Now the
script GETs /actions/secrets/FORGEJO_REGISTRY_TOKEN ; if it
exists, the create-or-prompt step is skipped entirely.
Set FORCE_FORGEJO_REPROMPT=1 to bypass and rotate.
The vault-password secret + the variable still get re-set on
every run (cheap and survives rotation).
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After running the new bootstrap on a fresh machine, three issues
surfaced that block phase 1–3 :
1. .forgejo/workflows/ may live under workflows.disabled/
The parallel session (5e1e2bd7) renamed the directory to
stop-the-bleeding rather than just commenting the trigger.
verify-local.sh now reports both states correctly.
enable-auto-deploy.sh does `git mv workflows.disabled
workflows` first, then proceeds to uncomment if needed.
2. Forgejo on 10.0.20.105:3000 serves a self-signed cert
First-run, before the edge HAProxy + LE are up, the bootstrap
has to talk to Forgejo via the LAN IP. lib.sh's forgejo_api
helper now honours FORGEJO_INSECURE=1 (passes -k to curl).
verify-local.sh's API checks pick up the same flag.
.env.example documents the swap : FORGEJO_INSECURE=1 with
https://10.0.20.105:3000 first ; flip to https://forgejo.talas.group
+ FORGEJO_INSECURE=0 once the edge HAProxy + LE cert are up.
3. SSH defaults wrong for the actual environment
.env.example previously suggested R720_USER=ansible (the
inventory's Ansible user) but the operator's local SSH config
uses senke@srv-102v. Updated defaults : R720_HOST=srv-102v,
R720_USER=senke. Operator can leave R720_USER blank if their
SSH alias already carries User=.
Plus two new helper scripts :
reset-vault.sh — recovery path when the vault password in
.vault-pass doesn't match what encrypted vault.yml. Confirms
destructively, removes vault.yml + .vault-pass, clears the
vault=DONE marker in local.state, points operator at PHASE=2.
verify-remote-ssh.sh — wrapper that scp's lib.sh +
verify-remote.sh to the R720 and runs verify-remote.sh under
sudo. Removes the need to clone the repo on the R720.
bootstrap-local.sh's phase 2 vault-decrypt failure now hints at
reset-vault.sh.
README.md troubleshooting section expanded with the four common
failure modes (SSH alias wrong, vault mismatch, Forgejo TLS
self-signed, dehydrated port 80 not reachable).
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the long manual checklist (RUNBOOK_DEPLOY_BOOTSTRAP) with
six scripts. Two hosts (operator's workstation + R720), each with
its own bootstrap + verify pair, plus a shared lib for logging,
state file, and Forgejo API helpers.
Files :
scripts/bootstrap/
├── lib.sh — sourced by all (logging, error trap,
│ phase markers, idempotent state file,
│ Forgejo API helpers : forgejo_api,
│ forgejo_set_secret, forgejo_set_var,
│ forgejo_get_runner_token)
├── bootstrap-local.sh — drives 6 phases on the operator's
│ workstation
├── bootstrap-remote.sh — runs on the R720 (over SSH) ; 4 phases
├── verify-local.sh — read-only check of local state
├── verify-remote.sh — read-only check of R720 state
├── enable-auto-deploy.sh — flips the deploy.yml gate after a
│ successful manual run
├── .env.example — template for site config
└── README.md — usage + troubleshooting
Phases :
Local
1. preflight — required tools, SSH to R720, DNS resolution
2. vault — render vault.yml from example, autogenerate JWT
keys, prompt+encrypt, write .vault-pass
3. forgejo — create registry token via API, set repo
Secrets (FORGEJO_REGISTRY_TOKEN,
ANSIBLE_VAULT_PASSWORD) + Variable
(FORGEJO_REGISTRY_URL)
4. r720 — fetch runner registration token, stream
bootstrap-remote.sh + lib.sh over SSH
5. haproxy — ansible-playbook playbooks/haproxy.yml ;
verify Let's Encrypt certs landed on the
veza-haproxy container
6. summary — readiness report
Remote
R1. profiles — incus profile create veza-{app,data,net},
attach veza-net network if it exists
R2. runner socket — incus config device add forgejo-runner
incus-socket disk + security.nesting=true
+ apt install incus-client inside the runner
R3. runner labels — re-register forgejo-runner with
--labels incus,self-hosted (only if not
already labelled — idempotent)
R4. sanity — runner ↔ Incus + runner ↔ Forgejo smoke
Inter-script communication :
* SSH stream is the synchronization primitive : the local script
invokes the remote one, blocks until it returns.
* Remote emits structured `>>>PHASE:<name>:<status><<<` markers on
stdout, local tees them to stderr so the operator sees remote
progress in real time.
* Persistent state files survive disconnects :
local : <repo>/.git/talas-bootstrap/local.state
R720 : /var/lib/talas/bootstrap.state
Both hold one `phase=DONE timestamp` line per completed phase.
Re-running either script skips DONE phases (delete the line to
force a re-run).
Resumable :
PHASE=N ./bootstrap-local.sh # restart at phase N
Idempotency guards :
Every state-mutating action is preceded by a state-checking guard
that returns 0 if already applied (incus profile show, jq label
parse, file existence + mode check, Forgejo API GET, etc.).
Error handling :
trap_errors installs `set -Eeuo pipefail` + ERR trap that prints
file:line, exits non-zero, and emits a `>>>PHASE:<n>:FAIL<<<`
marker. Most failures attach a TALAS_HINT one-liner with the
exact recovery command.
Verify scripts :
Read-only ; no state mutations. Output is a sequence of
PASS/FAIL lines + an exit code = number of failures. Each
failure prints a `hint:` with the precise fix command.
.gitignore picks up scripts/bootstrap/.env (per-operator config)
and .git/talas-bootstrap/ (state files).
--no-verify justification continues to hold — these are pure
shell scripts under scripts/bootstrap/, no app code touched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>