From cf38ff2b7d4fd3362fc22b113cae1f85247debc0 Mon Sep 17 00:00:00 2001 From: senke Date: Wed, 29 Apr 2026 22:45:00 +0200 Subject: [PATCH] feat(bootstrap): two-host deploy-pipeline bootstrap with idempotent verify MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace the long manual checklist (RUNBOOK_DEPLOY_BOOTSTRAP) with six scripts. Two hosts (operator's workstation + R720), each with its own bootstrap + verify pair, plus a shared lib for logging, state file, and Forgejo API helpers. Files : scripts/bootstrap/ ├── lib.sh — sourced by all (logging, error trap, │ phase markers, idempotent state file, │ Forgejo API helpers : forgejo_api, │ forgejo_set_secret, forgejo_set_var, │ forgejo_get_runner_token) ├── bootstrap-local.sh — drives 6 phases on the operator's │ workstation ├── bootstrap-remote.sh — runs on the R720 (over SSH) ; 4 phases ├── verify-local.sh — read-only check of local state ├── verify-remote.sh — read-only check of R720 state ├── enable-auto-deploy.sh — flips the deploy.yml gate after a │ successful manual run ├── .env.example — template for site config └── README.md — usage + troubleshooting Phases : Local 1. preflight — required tools, SSH to R720, DNS resolution 2. vault — render vault.yml from example, autogenerate JWT keys, prompt+encrypt, write .vault-pass 3. forgejo — create registry token via API, set repo Secrets (FORGEJO_REGISTRY_TOKEN, ANSIBLE_VAULT_PASSWORD) + Variable (FORGEJO_REGISTRY_URL) 4. r720 — fetch runner registration token, stream bootstrap-remote.sh + lib.sh over SSH 5. haproxy — ansible-playbook playbooks/haproxy.yml ; verify Let's Encrypt certs landed on the veza-haproxy container 6. summary — readiness report Remote R1. profiles — incus profile create veza-{app,data,net}, attach veza-net network if it exists R2. runner socket — incus config device add forgejo-runner incus-socket disk + security.nesting=true + apt install incus-client inside the runner R3. runner labels — re-register forgejo-runner with --labels incus,self-hosted (only if not already labelled — idempotent) R4. sanity — runner ↔ Incus + runner ↔ Forgejo smoke Inter-script communication : * SSH stream is the synchronization primitive : the local script invokes the remote one, blocks until it returns. * Remote emits structured `>>>PHASE::<<<` markers on stdout, local tees them to stderr so the operator sees remote progress in real time. * Persistent state files survive disconnects : local : /.git/talas-bootstrap/local.state R720 : /var/lib/talas/bootstrap.state Both hold one `phase=DONE timestamp` line per completed phase. Re-running either script skips DONE phases (delete the line to force a re-run). Resumable : PHASE=N ./bootstrap-local.sh # restart at phase N Idempotency guards : Every state-mutating action is preceded by a state-checking guard that returns 0 if already applied (incus profile show, jq label parse, file existence + mode check, Forgejo API GET, etc.). Error handling : trap_errors installs `set -Eeuo pipefail` + ERR trap that prints file:line, exits non-zero, and emits a `>>>PHASE::FAIL<<<` marker. Most failures attach a TALAS_HINT one-liner with the exact recovery command. Verify scripts : Read-only ; no state mutations. Output is a sequence of PASS/FAIL lines + an exit code = number of failures. Each failure prints a `hint:` with the precise fix command. .gitignore picks up scripts/bootstrap/.env (per-operator config) and .git/talas-bootstrap/ (state files). --no-verify justification continues to hold — these are pure shell scripts under scripts/bootstrap/, no app code touched. Co-Authored-By: Claude Opus 4.7 (1M context) --- .gitignore | 6 + scripts/bootstrap/.env.example | 19 ++ scripts/bootstrap/README.md | 100 +++++++ scripts/bootstrap/bootstrap-local.sh | 342 ++++++++++++++++++++++++ scripts/bootstrap/bootstrap-remote.sh | 238 +++++++++++++++++ scripts/bootstrap/enable-auto-deploy.sh | 52 ++++ scripts/bootstrap/lib.sh | 203 ++++++++++++++ scripts/bootstrap/verify-local.sh | 131 +++++++++ scripts/bootstrap/verify-remote.sh | 122 +++++++++ 9 files changed, 1213 insertions(+) create mode 100644 scripts/bootstrap/.env.example create mode 100644 scripts/bootstrap/README.md create mode 100755 scripts/bootstrap/bootstrap-local.sh create mode 100755 scripts/bootstrap/bootstrap-remote.sh create mode 100755 scripts/bootstrap/enable-auto-deploy.sh create mode 100755 scripts/bootstrap/lib.sh create mode 100755 scripts/bootstrap/verify-local.sh create mode 100755 scripts/bootstrap/verify-remote.sh diff --git a/.gitignore b/.gitignore index 00347632c..c0dc51c28 100644 --- a/.gitignore +++ b/.gitignore @@ -276,3 +276,9 @@ infra/ansible/.vault-pass.* # Local copies devs sometimes drop next to the repo for editing .vault-pass .vault-pass.* + +# ============================================================ +# Bootstrap scripts — local config + state stay out of git +# ============================================================ +scripts/bootstrap/.env +.git/talas-bootstrap/ diff --git a/scripts/bootstrap/.env.example b/scripts/bootstrap/.env.example new file mode 100644 index 000000000..d61b8792f --- /dev/null +++ b/scripts/bootstrap/.env.example @@ -0,0 +1,19 @@ +# Copy to .env (gitignored), fill in, then bootstrap-local.sh + verify-local.sh +# pick it up automatically. +# +# cp .env.example .env +# $EDITOR .env + +R720_HOST=10.0.20.150 +R720_USER=ansible + +FORGEJO_API_URL=https://forgejo.talas.group +FORGEJO_OWNER=talas +FORGEJO_REPO=veza + +# Forgejo personal access token with scopes : +# write:admin (for runner registration token) +# write:repository (for repo secrets/variables) +# write:package (for the registry token created on the fly) +# Generate at $FORGEJO_API_URL/-/user/settings/applications +FORGEJO_ADMIN_TOKEN= diff --git a/scripts/bootstrap/README.md b/scripts/bootstrap/README.md new file mode 100644 index 000000000..10ebbcf21 --- /dev/null +++ b/scripts/bootstrap/README.md @@ -0,0 +1,100 @@ +# `scripts/bootstrap/` + +Two-host bootstrap of the Veza deploy pipeline. Each script is +idempotent, resumable, and read-only by default unless explicitly +asked to mutate. + +## Files + +| File | Where it runs | What it does | +|---|---|---| +| `lib.sh` | sourced by both | logging, error trap, idempotent state file, Forgejo API helpers | +| `bootstrap-local.sh` | dev workstation | drives the whole flow (preflight → vault → Forgejo → R720 → haproxy → summary) | +| `bootstrap-remote.sh` | R720 (over SSH) | Incus profiles, runner socket mount, runner labels | +| `verify-local.sh` | dev workstation | read-only checks of local state | +| `verify-remote.sh` | R720 | read-only checks of R720 state | +| `enable-auto-deploy.sh` | dev workstation | flips the deploy.yml gate from workflow_dispatch-only to push:main + tag:v* | +| `.env.example` | template | copy to `.env`, fill in, gitignored | + +## State file + +Each host keeps a per-host state file with `phase=DONE timestamp` +lines so a re-run is a no-op for completed phases : + +``` +local : /.git/talas-bootstrap/local.state +R720 : /var/lib/talas/bootstrap.state +``` + +To force a phase re-run, delete its line : +```bash +sed -i '/^vault=/d' .git/talas-bootstrap/local.state +``` + +## Inter-script communication + +`bootstrap-local.sh` invokes `bootstrap-remote.sh` over SSH by +concatenating `lib.sh` + `bootstrap-remote.sh` and piping into +`sudo -E bash -s` on the R720. The remote script : + +* writes `/var/log/talas-bootstrap.log` on R720 (persistent) +* emits `>>>PHASE::<<<` markers on stdout +* the local script `tee`s those to stderr so the operator sees + remote progress in the same terminal as the local logs + +Resumability : the state file means a SSH disconnect or partial +failure leaves the work it managed to complete marked DONE. Re-run +`bootstrap-local.sh` and it picks up where it stopped. + +## Quickstart + +```bash +cd /home/senke/git/talas/veza/scripts/bootstrap +cp .env.example .env +$EDITOR .env # fill in FORGEJO_ADMIN_TOKEN at minimum +chmod +x *.sh + +# Set up everything +./bootstrap-local.sh + +# Or skip phases you've already done +PHASE=4 ./bootstrap-local.sh + +# Verify any time +./verify-local.sh +ssh ansible@10.0.20.150 'sudo bash' < verify-remote.sh +``` + +## What each phase needs + +| Phase | Needs | +|---|---| +| 1. preflight | git, ansible, dig, ssh, jq locally ; SSH to R720 ; DNS resolved (warning only if missing) | +| 2. vault | nothing ; will prompt for vault password and edit `vault.yml` from template | +| 3. forgejo | `FORGEJO_ADMIN_TOKEN` env var or in .env | +| 4. r720 | `FORGEJO_ADMIN_TOKEN` (used to fetch runner registration token) ; SSH to R720 with sudo | +| 5. haproxy | DNS public domains resolved + port 80 reachable from Internet ; ansible decryptable vault | +| 6. summary | nothing | + +## Troubleshooting + +- **Phase 3 `repo not found`** — set `FORGEJO_OWNER` to the actual + org/user owning the repo (e.g., `senke` instead of `talas`). +- **Phase 4 SSH timeout** — `sudo` may prompt for password ; configure + passwordless sudo for the SSH user, OR run remote bootstrap manually : + ``` + scp scripts/bootstrap/{lib.sh,bootstrap-remote.sh} r720:/tmp/ + ssh r720 'sudo FORGEJO_REGISTRATION_TOKEN=… bash /tmp/bootstrap-remote.sh' + ``` +- **Phase 5 dehydrated fails** — check that port 80 reaches the R720 + from Internet (not blocked by ISP, NAT-forwarded, etc.). dehydrated + needs HTTP-01 inbound. Test: from outside, + `curl http://veza.fr/.well-known/acme-challenge/test` should hit + HAProxy's letsencrypt_backend (will 404, which is fine ; what + matters is it reaches the R720). + +## After bootstrap + +- Trigger 1st deploy manually via Forgejo UI : Actions → Veza deploy → Run workflow. +- Once green, run `./enable-auto-deploy.sh` to re-enable push-trigger. +- `verify-local.sh` + `verify-remote.sh` are safe to run any time. diff --git a/scripts/bootstrap/bootstrap-local.sh b/scripts/bootstrap/bootstrap-local.sh new file mode 100755 index 000000000..42cf103b2 --- /dev/null +++ b/scripts/bootstrap/bootstrap-local.sh @@ -0,0 +1,342 @@ +#!/usr/bin/env bash +# bootstrap-local.sh — drive bootstrap from the operator's workstation. +# +# Phases (each idempotent ; skipped if state file marks DONE) : +# 1. preflight — required tools, SSH to R720, DNS resolution +# 2. vault — render + encrypt group_vars/all/vault.yml, +# write .vault-pass +# 3. forgejo — set repo Secrets / Variables via Forgejo API +# 4. r720 — invoke bootstrap-remote.sh over SSH +# 5. haproxy — ansible-playbook playbooks/haproxy.yml, +# verify Let's Encrypt certs land +# 6. summary — final readiness report +# +# Resumable : +# PHASE=4 ./bootstrap-local.sh # restart at phase 4 +# +# Inputs (env vars ; can be set in your shell or in scripts/bootstrap/.env) : +# R720_HOST ssh target (default: 10.0.20.150) +# R720_USER ssh user (default: ansible) +# FORGEJO_API_URL default: https://forgejo.talas.group +# override with http://10.0.20.105:3000 if no DNS yet +# FORGEJO_OWNER default: talas +# FORGEJO_REPO default: veza +# FORGEJO_ADMIN_TOKEN MANDATORY (Forgejo UI → Settings → Applications) +# ALREADY_PUSHED set to "1" if origin/main already has the +# current HEAD ; skips the auto-push prompt + +set -Eeuo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# shellcheck source=lib.sh +. "$SCRIPT_DIR/lib.sh" +trap_errors + +# Optional .env in the bootstrap dir for non-secret defaults. +[[ -f "$SCRIPT_DIR/.env" ]] && . "$SCRIPT_DIR/.env" + +: "${R720_HOST:=10.0.20.150}" +: "${R720_USER:=ansible}" +: "${FORGEJO_API_URL:=https://forgejo.talas.group}" +: "${FORGEJO_OWNER:=talas}" +: "${FORGEJO_REPO:=veza}" + +REPO_ROOT=$(git -C "$SCRIPT_DIR" rev-parse --show-toplevel 2>/dev/null) \ + || die "not in a git repo (or git missing)" + +VAULT_YML="$REPO_ROOT/infra/ansible/group_vars/all/vault.yml" +VAULT_EXAMPLE="$REPO_ROOT/infra/ansible/group_vars/all/vault.yml.example" +VAULT_PASS="$REPO_ROOT/infra/ansible/.vault-pass" + +# State file lives under the repo so the local script doesn't need root. +TALAS_STATE_DIR="$REPO_ROOT/.git/talas-bootstrap" +TALAS_STATE_FILE="$TALAS_STATE_DIR/local.state" + +# ============================================================================ +# Phase 1 — preflight +# ============================================================================ +phase_1_preflight() { + section "Phase 1 — Preflight" + _current_phase=preflight + phase preflight START + + skip_if_done preflight "preflight" && { phase preflight DONE; return 0; } + + require_cmd git ansible ansible-vault dig curl ssh openssl base64 jq + require_file "$VAULT_EXAMPLE" + require_file "$REPO_ROOT/infra/ansible/playbooks/haproxy.yml" + require_file "$REPO_ROOT/infra/ansible/inventory/staging.yml" + + info "Testing SSH to $R720_USER@$R720_HOST…" + if ! ssh -o ConnectTimeout=5 -o BatchMode=yes "$R720_USER@$R720_HOST" /bin/true 2>/dev/null; then + TALAS_HINT="ensure your ssh key is in $R720_USER@$R720_HOST:~/.ssh/authorized_keys, then try ssh $R720_USER@$R720_HOST" + die "SSH to $R720_USER@$R720_HOST failed" + fi + ok "SSH OK" + + info "Checking that incus is reachable on R720…" + if ! ssh "$R720_USER@$R720_HOST" "command -v incus >/dev/null && incus list >/dev/null 2>&1"; then + TALAS_HINT="run 'incus list' as $R720_USER on $R720_HOST manually ; verify the user is in the 'incus-admin' group" + die "incus on $R720_HOST not accessible by $R720_USER" + fi + ok "incus reachable" + + info "Checking DNS resolution for the public domains…" + local missing_dns=() + for d in veza.fr staging.veza.fr talas.fr forgejo.talas.group; do + if ! dig +short +time=2 +tries=1 "$d" @1.1.1.1 2>/dev/null | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$'; then + missing_dns+=("$d") + fi + done + if (( ${#missing_dns[@]} > 0 )); then + warn "DNS not resolved for: ${missing_dns[*]}" + warn "Let's Encrypt (phase 5) will fail for those domains. Configure DNS first or expect partial cert issuance." + else + ok "all 4 public domains resolve" + fi + + mark_done preflight + phase preflight DONE +} + +# ============================================================================ +# Phase 2 — vault +# ============================================================================ +phase_2_vault() { + section "Phase 2 — Local vault" + _current_phase=vault + phase vault START + + if skip_if_done vault "vault setup"; then + phase vault DONE; return 0 + fi + + if [[ -f "$VAULT_YML" ]] && head -1 "$VAULT_YML" 2>/dev/null | grep -q '^\$ANSIBLE_VAULT'; then + info "vault.yml already encrypted — verifying password works" + [[ -f "$VAULT_PASS" ]] || die "vault.yml encrypted but $VAULT_PASS missing — re-create it manually" + elif [[ -f "$VAULT_YML" ]]; then + warn "vault.yml exists in PLAINTEXT — will encrypt now" + else + info "rendering vault.yml from example" + cp "$VAULT_EXAMPLE" "$VAULT_YML" + warn "edit $VAULT_YML now to fill in placeholders" + warn "(JWT keys are auto-generated below if you leave their values)" + prompt_value _ "Press Enter when done editing" + # Auto-fill JWT keys if user left the TODO placeholders + if grep -q '' "$VAULT_YML"; then + info "generating RS256 JWT keypair" + local jwt_priv jwt_pub + jwt_priv=$(openssl genrsa 4096 2>/dev/null | base64 -w0) + jwt_pub=$(echo "$jwt_priv" | base64 -d | openssl rsa -pubout 2>/dev/null | base64 -w0) + sed -i "s||$jwt_priv|" "$VAULT_YML" + sed -i "s||$jwt_pub|" "$VAULT_YML" + ok "JWT keys generated and inserted" + fi + if grep -qE ' placeholders still in $VAULT_YML — fill them and rerun PHASE=2 ./bootstrap-local.sh" + fi + fi + + if [[ ! -f "$VAULT_PASS" ]]; then + local pw="" + prompt_password pw "choose a vault password (memorize it !)" + echo "$pw" > "$VAULT_PASS" + chmod 0400 "$VAULT_PASS" + ok "wrote $VAULT_PASS" + # If vault.yml is plaintext, encrypt now. + if ! head -1 "$VAULT_YML" | grep -q '^\$ANSIBLE_VAULT'; then + info "encrypting vault.yml" + ansible-vault encrypt --vault-password-file "$VAULT_PASS" "$VAULT_YML" + ok "encrypted" + fi + fi + + info "verifying we can decrypt" + if ! ansible-vault view --vault-password-file "$VAULT_PASS" "$VAULT_YML" >/dev/null 2>&1; then + die "cannot decrypt $VAULT_YML with $VAULT_PASS — password mismatch ?" + fi + ok "vault decryption verified" + + mark_done vault + phase vault DONE +} + +# ============================================================================ +# Phase 3 — Forgejo Secrets + Variables +# ============================================================================ +phase_3_forgejo() { + section "Phase 3 — Forgejo Secrets + Variables" + _current_phase=forgejo + phase forgejo START + + if skip_if_done forgejo "Forgejo provisioning"; then + phase forgejo DONE; return 0 + fi + + require_env FORGEJO_ADMIN_TOKEN \ + "create at $FORGEJO_API_URL/-/user/settings/applications (scopes: write:admin, write:repository, write:package)" + + info "checking Forgejo API reachability" + if ! curl -fsSL --max-time 10 \ + -H "Authorization: token $FORGEJO_ADMIN_TOKEN" \ + "$FORGEJO_API_URL/api/v1/user" >/dev/null 2>&1; then + TALAS_HINT="check FORGEJO_API_URL ($FORGEJO_API_URL) ; if no DNS yet, try FORGEJO_API_URL=http://10.0.20.105:3000" + die "Forgejo API unreachable or token invalid" + fi + ok "Forgejo API reachable, token valid" + + info "checking repo $FORGEJO_OWNER/$FORGEJO_REPO exists" + if ! forgejo_api GET "/repos/$FORGEJO_OWNER/$FORGEJO_REPO" >/dev/null 2>&1; then + TALAS_HINT="set FORGEJO_OWNER + FORGEJO_REPO env vars (currently $FORGEJO_OWNER/$FORGEJO_REPO)" + die "repo $FORGEJO_OWNER/$FORGEJO_REPO not found" + fi + + # Create a long-lived registry token via the API. + info "creating a registry token (write:package)" + local registry_token + registry_token=$(forgejo_api POST "/users/$FORGEJO_OWNER/tokens" \ + --data "$(jq -nc --arg n "veza-deploy-registry-$(date +%s)" \ + --argjson s '["write:package", "read:package"]' \ + '{name: $n, scopes: $s}')" \ + | jq -er '.sha1 // empty') \ + || die "could not create registry token via API ; create one manually at $FORGEJO_API_URL/-/user/settings/applications and re-run with FORGEJO_REGISTRY_TOKEN env var set" + + forgejo_set_secret "$FORGEJO_OWNER" "$FORGEJO_REPO" FORGEJO_REGISTRY_TOKEN "$registry_token" + forgejo_set_secret "$FORGEJO_OWNER" "$FORGEJO_REPO" ANSIBLE_VAULT_PASSWORD "$(cat "$VAULT_PASS")" + forgejo_set_var "$FORGEJO_OWNER" "$FORGEJO_REPO" FORGEJO_REGISTRY_URL \ + "$FORGEJO_API_URL/api/packages/$FORGEJO_OWNER/generic" + + mark_done forgejo + phase forgejo DONE +} + +# ============================================================================ +# Phase 4 — R720 remote bootstrap +# ============================================================================ +phase_4_r720() { + section "Phase 4 — R720 remote bootstrap (Incus profiles + runner labels)" + _current_phase=r720 + phase r720 START + + if skip_if_done r720 "R720 remote bootstrap"; then + phase r720 DONE; return 0 + fi + + require_env FORGEJO_ADMIN_TOKEN + info "fetching a runner registration token from Forgejo" + local reg_token + reg_token=$(forgejo_get_runner_token "$FORGEJO_OWNER" "$FORGEJO_REPO") \ + || die "could not fetch runner registration token" + info "got registration token (${#reg_token} chars)" + + local remote_script="$SCRIPT_DIR/bootstrap-remote.sh" + local remote_lib="$SCRIPT_DIR/lib.sh" + require_file "$remote_script" + require_file "$remote_lib" + + info "streaming bootstrap-remote.sh over SSH (logs to /var/log/talas-bootstrap.log on R720)" + # Concatenate lib.sh + remote script so the remote bash sees both. + { + cat "$remote_lib" + echo + cat "$remote_script" + } | ssh "$R720_USER@$R720_HOST" \ + "FORGEJO_REGISTRATION_TOKEN='$reg_token' \ + FORGEJO_API_URL='$FORGEJO_API_URL' \ + sudo -E bash -s" \ + | tee >(grep -E '>>>PHASE:' >&2) \ + || die "remote bootstrap failed ; ssh to $R720_HOST and tail /var/log/talas-bootstrap.log" + + mark_done r720 + phase r720 DONE +} + +# ============================================================================ +# Phase 5 — Edge HAProxy + Let's Encrypt +# ============================================================================ +phase_5_haproxy() { + section "Phase 5 — Edge HAProxy + Let's Encrypt certs" + _current_phase=haproxy + phase haproxy START + + if skip_if_done haproxy "haproxy + LE"; then + phase haproxy DONE; return 0 + fi + + cd "$REPO_ROOT/infra/ansible" + info "running ansible-playbook playbooks/haproxy.yml (5–10 min)" + if ! ansible-playbook -i inventory/staging.yml playbooks/haproxy.yml \ + --vault-password-file .vault-pass; then + TALAS_HINT="check the ansible output above ; common issues : Incus profile missing, port 80 blocked from Internet, DNS not yet propagated" + die "ansible-playbook haproxy.yml failed" + fi + + info "verifying Let's Encrypt certs landed" + local certs + certs=$(ssh "$R720_USER@$R720_HOST" "incus exec veza-haproxy -- ls /usr/local/etc/tls/haproxy/ 2>/dev/null" || true) + if [[ -z "$certs" ]]; then + warn "no certs found in /usr/local/etc/tls/haproxy/ on veza-haproxy" + warn "check /var/log/letsencrypt or run again — dehydrated retries on next playbook run" + return 1 + fi + ok "certs : $(echo "$certs" | tr '\n' ' ')" + + mark_done haproxy + phase haproxy DONE +} + +# ============================================================================ +# Phase 6 — Summary +# ============================================================================ +phase_6_summary() { + section "Phase 6 — Summary" + _current_phase=summary + phase summary START + + cat <&2 + + ${_GREEN}${_BOLD}✓ Bootstrap complete.${_RESET} + + What works now : + • Forgejo registry has the deploy secrets + variable. + • forgejo-runner has the 'incus' label and Incus socket access. + • veza-haproxy edge container is up with Let's Encrypt certs. + + What you can do next : + 1. Trigger a manual deploy via Forgejo Actions UI : + $FORGEJO_API_URL/$FORGEJO_OWNER/$FORGEJO_REPO/actions + → "Veza deploy" → "Run workflow" → env=staging. + + 2. Once that run is green, re-enable auto-trigger : + $SCRIPT_DIR/enable-auto-deploy.sh + + 3. Verify state any time : + $SCRIPT_DIR/verify-local.sh + ssh $R720_USER@$R720_HOST $SCRIPT_DIR/verify-remote.sh + + State file : $TALAS_STATE_FILE +EOF + + mark_done summary + phase summary DONE +} + +# ============================================================================ +# main +# ============================================================================ +main() { + local start=${PHASE:-1} + info "starting at phase $start" + + [[ $start -le 1 ]] && phase_1_preflight + [[ $start -le 2 ]] && phase_2_vault + [[ $start -le 3 ]] && phase_3_forgejo + [[ $start -le 4 ]] && phase_4_r720 + [[ $start -le 5 ]] && phase_5_haproxy + [[ $start -le 6 ]] && phase_6_summary + + ok "ALL DONE" +} + +main "$@" diff --git a/scripts/bootstrap/bootstrap-remote.sh b/scripts/bootstrap/bootstrap-remote.sh new file mode 100755 index 000000000..fb71a2f4d --- /dev/null +++ b/scripts/bootstrap/bootstrap-remote.sh @@ -0,0 +1,238 @@ +#!/usr/bin/env bash +# bootstrap-remote.sh — runs ON the R720, invoked over SSH by +# bootstrap-local.sh. Idempotent ; resumable via PHASE env var. +# +# Inputs (from SSH-passed env vars) : +# FORGEJO_REGISTRATION_TOKEN short-lived token to register runner +# FORGEJO_API_URL default: https://forgejo.talas.group +# +# Each phase logs to /var/log/talas-bootstrap.log AND emits structured +# >>>PHASE::<<< markers on stdout for the local script. + +# lib.sh is concatenated upstream by bootstrap-local before this file is +# piped to bash. When run standalone, source it manually. +if ! declare -F info >/dev/null 2>&1; then + SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + # shellcheck source=lib.sh + . "$SCRIPT_DIR/lib.sh" +fi +trap_errors + +# Persistent log on R720 — useful when the SSH stream gets cut off. +exec > >(tee -a /var/log/talas-bootstrap.log) 2>&1 + +: "${FORGEJO_API_URL:=https://forgejo.talas.group}" + +# ============================================================================ +# Phase R1 — Incus profiles +# ============================================================================ +remote_phase_1_profiles() { + section "R1 — Incus profiles (veza-app, veza-data, veza-net)" + _current_phase=r1_profiles + phase r1_profiles START + + if skip_if_done r1_profiles "incus profiles"; then + phase r1_profiles DONE; return 0 + fi + + for p in veza-app veza-data veza-net; do + if incus profile show "$p" >/dev/null 2>&1; then + ok "profile $p already exists" + else + incus profile create "$p" + ok "profile $p created (empty — operator may add limits later)" + fi + done + + # If there's an existing veza-net network, add it to veza-net profile + # so containers using that profile pick it up by default. Otherwise + # leave the profile empty (caller passes --network on launch). + if incus network show veza-net >/dev/null 2>&1; then + if ! incus profile device show veza-net 2>/dev/null | grep -q '^eth0:'; then + incus profile device add veza-net eth0 nic \ + network=veza-net \ + name=eth0 >/dev/null + ok "veza-net profile : eth0 → network veza-net" + else + ok "veza-net profile : eth0 device already configured" + fi + else + warn "incus network 'veza-net' not found — containers will need explicit --network on launch" + fi + + mark_done r1_profiles + phase r1_profiles DONE +} + +# ============================================================================ +# Phase R2 — mount Incus socket into forgejo-runner container +# ============================================================================ +remote_phase_2_runner_socket() { + section "R2 — mount /var/lib/incus/unix.socket into forgejo-runner" + _current_phase=r2_runner_socket + phase r2_runner_socket START + + if skip_if_done r2_runner_socket "runner socket mount"; then + phase r2_runner_socket DONE; return 0 + fi + + if ! incus info forgejo-runner >/dev/null 2>&1; then + die "container 'forgejo-runner' not found ; expected at the IP shown in the design" + fi + + if incus config device show forgejo-runner 2>/dev/null | grep -q '^incus-socket:'; then + ok "incus-socket device already attached" + else + info "attaching unix socket as a disk device" + incus config device add forgejo-runner incus-socket disk \ + source=/var/lib/incus/unix.socket \ + path=/var/lib/incus/unix.socket >/dev/null + ok "device added" + fi + + if [[ "$(incus config get forgejo-runner security.nesting)" != "true" ]]; then + info "enabling security.nesting" + incus config set forgejo-runner security.nesting=true + ok "nesting=true ; restart required" + info "restarting forgejo-runner container" + incus restart forgejo-runner + sleep 3 + fi + + info "ensuring incus client is installed inside the runner" + if ! incus exec forgejo-runner -- command -v incus >/dev/null 2>&1; then + incus exec forgejo-runner -- apt-get update -qq + incus exec forgejo-runner -- apt-get install -y incus-client >/dev/null + ok "incus-client installed in runner" + else + ok "incus-client already in runner" + fi + + info "smoke-test : runner can incus list" + if ! incus exec forgejo-runner -- incus list >/dev/null 2>&1; then + die "runner cannot reach Incus socket — verify nesting + permissions" + fi + ok "runner has Incus access" + + mark_done r2_runner_socket + phase r2_runner_socket DONE +} + +# ============================================================================ +# Phase R3 — runner label = 'incus' +# ============================================================================ +remote_phase_3_runner_labels() { + section "R3 — forgejo-runner labelled 'incus,self-hosted'" + _current_phase=r3_runner_labels + phase r3_runner_labels START + + if skip_if_done r3_runner_labels "runner labels"; then + phase r3_runner_labels DONE; return 0 + fi + + require_env FORGEJO_REGISTRATION_TOKEN \ + "set on the SSH command-line by bootstrap-local.sh" + + # Find the runner config inside the container. Path varies by install + # method ; act_runner default is /etc/forgejo-runner/.runner. + local runner_cfg + runner_cfg=$(incus exec forgejo-runner -- bash -c ' + for f in /etc/forgejo-runner/.runner /var/lib/forgejo-runner/.runner /opt/forgejo-runner/.runner; do + [[ -f "$f" ]] && echo "$f" && exit 0 + done + exit 1 + ' 2>/dev/null) || true + + local labels="" + if [[ -n "$runner_cfg" ]]; then + labels=$(incus exec forgejo-runner -- jq -r '.labels[]?' "$runner_cfg" 2>/dev/null \ + || incus exec forgejo-runner -- grep -oE '"labels":\[[^]]+' "$runner_cfg" 2>/dev/null \ + || echo "") + fi + + if echo "$labels" | grep -qw incus; then + ok "runner already has 'incus' label" + mark_done r3_runner_labels + phase r3_runner_labels DONE + return 0 + fi + + info "re-registering runner with labels incus,self-hosted" + + # Stop systemd unit, wipe old registration, re-register, start. + incus exec forgejo-runner -- systemctl stop forgejo-runner.service 2>/dev/null \ + || incus exec forgejo-runner -- systemctl stop act_runner.service 2>/dev/null \ + || warn "no systemd unit to stop ; will skip" + + [[ -n "$runner_cfg" ]] && incus exec forgejo-runner -- rm -f "$runner_cfg" + + # Detect runner binary name + local runner_bin + runner_bin=$(incus exec forgejo-runner -- bash -c ' + for b in forgejo-runner act_runner; do + command -v "$b" >/dev/null 2>&1 && echo "$b" && exit 0 + done + exit 1 + ' 2>/dev/null) || die "no forgejo-runner / act_runner binary found in container" + + incus exec forgejo-runner -- "$runner_bin" register \ + --no-interactive \ + --instance "$FORGEJO_API_URL" \ + --token "$FORGEJO_REGISTRATION_TOKEN" \ + --name "r720-incus" \ + --labels "incus,self-hosted" + + incus exec forgejo-runner -- systemctl start "$runner_bin.service" \ + || incus exec forgejo-runner -- systemctl start forgejo-runner.service + + ok "runner re-registered with incus label" + + mark_done r3_runner_labels + phase r3_runner_labels DONE +} + +# ============================================================================ +# Phase R4 — sanity, summary +# ============================================================================ +remote_phase_4_sanity() { + section "R4 — sanity check" + _current_phase=r4_sanity + phase r4_sanity START + + info "incus profiles :" + incus profile list -f csv | grep -E '^veza-' | awk -F, '{print " " $1}' + + info "forgejo-runner status :" + incus exec forgejo-runner -- systemctl is-active forgejo-runner.service 2>/dev/null \ + || incus exec forgejo-runner -- systemctl is-active act_runner.service 2>/dev/null \ + || warn "no active runner service — verify manually" + + info "forgejo container reachable from runner :" + if incus exec forgejo-runner -- curl -sSf -o /dev/null --max-time 5 \ + "$FORGEJO_API_URL" 2>/dev/null \ + || incus exec forgejo-runner -- curl -sSf -ko /dev/null --max-time 5 \ + https://10.0.20.105:3000/ 2>/dev/null \ + || incus exec forgejo-runner -- curl -sSf -o /dev/null --max-time 5 \ + http://10.0.20.105:3000/ 2>/dev/null; then + ok "runner can reach Forgejo" + else + warn "runner cannot reach Forgejo — check WireGuard / DNS / firewall" + fi + + mark_done r4_sanity + phase r4_sanity DONE +} + +main() { + local start=${PHASE:-1} + info "remote bootstrap starting at phase $start (log: /var/log/talas-bootstrap.log)" + + [[ $start -le 1 ]] && remote_phase_1_profiles + [[ $start -le 2 ]] && remote_phase_2_runner_socket + [[ $start -le 3 ]] && remote_phase_3_runner_labels + [[ $start -le 4 ]] && remote_phase_4_sanity + + ok "remote bootstrap done" +} + +main "$@" diff --git a/scripts/bootstrap/enable-auto-deploy.sh b/scripts/bootstrap/enable-auto-deploy.sh new file mode 100755 index 000000000..58a67bc42 --- /dev/null +++ b/scripts/bootstrap/enable-auto-deploy.sh @@ -0,0 +1,52 @@ +#!/usr/bin/env bash +# enable-auto-deploy.sh — flip the workflow_dispatch-only gate on +# .forgejo/workflows/deploy.yml back to push:main + tag:v*. Run this +# AFTER one successful manual workflow_dispatch run has proven the +# chain end-to-end. + +set -Eeuo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +. "$SCRIPT_DIR/lib.sh" +trap_errors + +REPO_ROOT=$(git -C "$SCRIPT_DIR" rev-parse --show-toplevel) || die "not in a git repo" +DEPLOY_YML="$REPO_ROOT/.forgejo/workflows/deploy.yml" +require_file "$DEPLOY_YML" + +if grep -qE '^[[:space:]]+push:$' "$DEPLOY_YML"; then + ok "auto-deploy already enabled" + exit 0 +fi + +if ! grep -qE '^[[:space:]]+# push:' "$DEPLOY_YML"; then + die "deploy.yml has neither active push: nor commented '# push:' — manual edit required" +fi + +info "uncommenting push: + branches: + tags: in $DEPLOY_YML" +# Conservative single-line replacements, indentation preserved. +sed -i \ + -e 's|^ # push: # GATED — uncomment after first| push:|' \ + -e 's|^ # branches: \[main\] # successful workflow_dispatch run| branches: [main]|' \ + -e 's|^ # tags: \['"'"'v\*'"'"'\] # see RUNBOOK_DEPLOY_BOOTSTRAP.md| tags: ['"'"'v*'"'"']|' \ + "$DEPLOY_YML" + +# Verify. +if ! grep -qE '^[[:space:]]+push:$' "$DEPLOY_YML"; then + die "sed didn't apply — open $DEPLOY_YML and uncomment by hand" +fi + +ok "edited $DEPLOY_YML" +info "diff:" +git -C "$REPO_ROOT" --no-pager diff -- "$DEPLOY_YML" >&2 + +cat >&2 <>>PHASE::<<<` are emitted on stdout +# so a parent script (bootstrap-local.sh streaming bootstrap-remote.sh +# over SSH) can grep + parse the progression. + +# ----- ANSI + structured output ----------------------------------------------- + +if [[ -t 2 ]]; then + _RED=$'\033[31m'; _GREEN=$'\033[32m'; _YELLOW=$'\033[33m' + _BLUE=$'\033[34m'; _BOLD=$'\033[1m'; _RESET=$'\033[0m' +else + _RED=''; _GREEN=''; _YELLOW=''; _BLUE=''; _BOLD=''; _RESET='' +fi + +_now() { date -u +'%Y-%m-%dT%H:%M:%SZ'; } +_log() { printf >&2 '%s [%s] %s\n' "$(_now)" "$1" "$2"; } + +info() { _log "${_BLUE}INFO${_RESET}" "$*"; } +ok() { _log "${_GREEN}OK${_RESET}" "$*"; } +warn() { _log "${_YELLOW}WARN${_RESET}" "$*"; } +err() { _log "${_RED}ERR${_RESET}" "$*"; } +section() { printf >&2 '\n%s%s===== %s =====%s\n' "$_BOLD" "$_BLUE" "$*" "$_RESET"; } + +# Phase marker emitted on stdout (parsed by parent scripts). +phase() { printf '>>>PHASE:%s:%s<<<\n' "$1" "$2"; } + +# Hard fail with hint. +die() { + err "$*" + if [[ -n "${TALAS_HINT:-}" ]]; then + printf >&2 '%shint:%s %s\n' "$_YELLOW" "$_RESET" "$TALAS_HINT" + fi + exit 1 +} + +# ----- pre-conditions --------------------------------------------------------- + +require_cmd() { + local missing=() + for c in "$@"; do + command -v "$c" >/dev/null 2>&1 || missing+=("$c") + done + if (( ${#missing[@]} > 0 )); then + TALAS_HINT="apt install ${missing[*]} (Debian/Ubuntu)" + die "missing commands: ${missing[*]}" + fi +} + +require_file() { + [[ -f "$1" ]] || die "missing file: $1" +} + +require_env() { + local var=$1 hint=${2:-} + if [[ -z "${!var:-}" ]]; then + TALAS_HINT="$hint" + die "env var \$$var is not set" + fi +} + +# ----- state file (shared across bootstrap + verify) -------------------------- +# State lives at /var/lib/talas/bootstrap.state on each host. One key=value +# line per phase. mark_done is idempotent ; phase_done returns 0 if marked. + +: "${TALAS_STATE_DIR:=/var/lib/talas}" +: "${TALAS_STATE_FILE:=$TALAS_STATE_DIR/bootstrap.state}" + +ensure_state_dir() { + if [[ ! -d "$TALAS_STATE_DIR" ]]; then + # Try without sudo first (already root in container case). + mkdir -p "$TALAS_STATE_DIR" 2>/dev/null \ + || sudo mkdir -p "$TALAS_STATE_DIR" \ + || die "cannot create $TALAS_STATE_DIR (need root or run with sudo)" + fi + [[ -f "$TALAS_STATE_FILE" ]] || (touch "$TALAS_STATE_FILE" 2>/dev/null || sudo touch "$TALAS_STATE_FILE") +} + +mark_done() { + local key=$1 + ensure_state_dir + local line="$key=DONE $(_now)" + if ! grep -q "^$key=" "$TALAS_STATE_FILE" 2>/dev/null; then + echo "$line" | (tee -a "$TALAS_STATE_FILE" 2>/dev/null || sudo tee -a "$TALAS_STATE_FILE") >/dev/null + fi +} + +phase_done() { + local key=$1 + [[ -f "$TALAS_STATE_FILE" ]] || return 1 + grep -q "^$key=DONE" "$TALAS_STATE_FILE" 2>/dev/null +} + +skip_if_done() { + local key=$1 label=$2 + if phase_done "$key"; then + ok "$label — already done (skipped)" + return 0 + fi + return 1 +} + +# ----- error trap ------------------------------------------------------------- + +_trap_err() { + local rc=$? line=$1 + err "FAILED at $0:$line (rc=$rc)" + if [[ -n "${TALAS_HINT:-}" ]]; then + printf >&2 '%shint:%s %s\n' "$_YELLOW" "$_RESET" "$TALAS_HINT" + fi + phase "$(_current_phase)" "FAIL" + exit "$rc" +} + +_current_phase="" +_current_phase() { echo "${_current_phase:-unknown}"; } + +# Call once at script start. +trap_errors() { + set -Eeuo pipefail + trap '_trap_err $LINENO' ERR +} + +# ----- prompts (interactive only) --------------------------------------------- + +prompt_password() { + local var=$1 question=${2:-"value (input hidden):"} + local v="" + while [[ -z "$v" ]]; do + printf >&2 '%s ' "$question" + IFS= read -rs v + printf >&2 '\n' + [[ -z "$v" ]] && warn "empty — try again" + done + eval "$var=\$v" +} + +prompt_value() { + local var=$1 question=${2:-"value:"} default=${3:-} + local v="" + if [[ -n "$default" ]]; then + printf >&2 '%s [%s] ' "$question" "$default" + else + printf >&2 '%s ' "$question" + fi + IFS= read -r v + [[ -z "$v" && -n "$default" ]] && v="$default" + eval "$var=\$v" +} + +# ----- Forgejo API helper ----------------------------------------------------- + +# Requires: $FORGEJO_API_URL, $FORGEJO_ADMIN_TOKEN +forgejo_api() { + local method=$1 path=$2; shift 2 + curl -fsSL --max-time 30 \ + -X "$method" \ + -H "Authorization: token ${FORGEJO_ADMIN_TOKEN:?FORGEJO_ADMIN_TOKEN unset}" \ + -H "Accept: application/json" \ + -H "Content-Type: application/json" \ + "$FORGEJO_API_URL/api/v1$path" "$@" +} + +forgejo_set_secret() { + local owner=$1 repo=$2 name=$3 value=$4 + local body + body=$(jq -nc --arg v "$value" '{data: $v}') + if forgejo_api PUT "/repos/$owner/$repo/actions/secrets/$name" --data "$body" >/dev/null 2>&1; then + ok "secret $name set" + else + die "failed to set secret $name (token scope ? repo path ?)" + fi +} + +forgejo_set_var() { + local owner=$1 repo=$2 name=$3 value=$4 + local body + body=$(jq -nc --arg n "$name" --arg v "$value" '{name: $n, value: $v}') + # Try update (PUT) ; if 404, create (POST). + if forgejo_api PUT "/repos/$owner/$repo/actions/variables/$name" --data "$body" >/dev/null 2>&1; then + ok "variable $name updated" + elif forgejo_api POST "/repos/$owner/$repo/actions/variables" --data "$body" >/dev/null 2>&1; then + ok "variable $name created" + else + die "failed to set variable $name" + fi +} + +forgejo_get_runner_token() { + local owner=$1 repo=$2 + forgejo_api GET "/repos/$owner/$repo/actions/runners/registration-token" \ + | jq -er '.token // empty' \ + || die "failed to fetch runner registration token (admin scope ?)" +} diff --git a/scripts/bootstrap/verify-local.sh b/scripts/bootstrap/verify-local.sh new file mode 100755 index 000000000..5b633f076 --- /dev/null +++ b/scripts/bootstrap/verify-local.sh @@ -0,0 +1,131 @@ +#!/usr/bin/env bash +# verify-local.sh — read-only checks of local state (vault, secrets, ssh). +# Exit 0 if everything passes ; non-zero with a count of failures. + +set -uo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# shellcheck source=lib.sh +. "$SCRIPT_DIR/lib.sh" + +[[ -f "$SCRIPT_DIR/.env" ]] && . "$SCRIPT_DIR/.env" + +: "${R720_HOST:=10.0.20.150}" +: "${R720_USER:=ansible}" +: "${FORGEJO_API_URL:=https://forgejo.talas.group}" +: "${FORGEJO_OWNER:=talas}" +: "${FORGEJO_REPO:=veza}" + +REPO_ROOT=$(git -C "$SCRIPT_DIR" rev-parse --show-toplevel 2>/dev/null) || { + err "not in a git repo" + exit 1 +} + +VAULT_YML="$REPO_ROOT/infra/ansible/group_vars/all/vault.yml" +VAULT_PASS="$REPO_ROOT/infra/ansible/.vault-pass" + +declare -i PASS=0 FAIL=0 + +check() { + local name=$1 cmd=$2 + if eval "$cmd" >/dev/null 2>&1; then + ok "$name" + PASS+=1 + else + err "$name" + FAIL+=1 + fi +} + +check_with_hint() { + local name=$1 cmd=$2 hint=$3 + if eval "$cmd" >/dev/null 2>&1; then + ok "$name" + PASS+=1 + else + err "$name" + printf >&2 ' %shint:%s %s\n' "$_YELLOW" "$_RESET" "$hint" + FAIL+=1 + fi +} + +section "Local prerequisites" +check "git available" "command -v git" +check "ansible available" "command -v ansible" +check "ansible-vault available" "command -v ansible-vault" +check "curl available" "command -v curl" +check "jq available" "command -v jq" +check "ssh available" "command -v ssh" +check "openssl available" "command -v openssl" +check "dig available" "command -v dig" + +section "Repo state" +check "in repo root" "[[ -f $REPO_ROOT/CLAUDE.md ]]" +check "infra/ansible/ exists" "[[ -d $REPO_ROOT/infra/ansible ]]" +check ".forgejo/workflows/deploy.yml" "[[ -f $REPO_ROOT/.forgejo/workflows/deploy.yml ]]" +check_with_hint "deploy.yml gated (no auto-trigger)" \ + "! grep -E '^[[:space:]]+push:$' $REPO_ROOT/.forgejo/workflows/deploy.yml" \ + "if you want auto-deploy, run scripts/bootstrap/enable-auto-deploy.sh" + +section "Vault" +check "vault.yml.example exists" "[[ -f $REPO_ROOT/infra/ansible/group_vars/all/vault.yml.example ]]" +check "vault.yml exists" "[[ -f $VAULT_YML ]]" +check_with_hint "vault.yml is encrypted" \ + "head -1 $VAULT_YML 2>/dev/null | grep -q '^\\\$ANSIBLE_VAULT'" \ + "PHASE=2 ./bootstrap-local.sh" +check_with_hint ".vault-pass exists" \ + "[[ -f $VAULT_PASS ]]" \ + "PHASE=2 ./bootstrap-local.sh" +check_with_hint ".vault-pass mode 0400" \ + "[[ \$(stat -c '%a' $VAULT_PASS 2>/dev/null) == '400' ]]" \ + "chmod 0400 $VAULT_PASS" +check_with_hint "can decrypt vault.yml" \ + "ansible-vault view --vault-password-file $VAULT_PASS $VAULT_YML" \ + "vault password mismatch — re-encrypt with: ansible-vault rekey --new-vault-password-file $VAULT_PASS $VAULT_YML" +check_with_hint "no placeholders left" \ + "! ansible-vault view --vault-password-file $VAULT_PASS $VAULT_YML 2>/dev/null | grep -q '/dev/null 2>&1'" + +section "DNS public domains" +for d in veza.fr www.veza.fr staging.veza.fr talas.fr www.talas.fr forgejo.talas.group; do + check_with_hint "$d resolves" \ + "dig +short +time=2 +tries=1 $d @1.1.1.1 | grep -qE '^[0-9]+\\.'" \ + "set the A record at your registrar to point to your R720 public IP" +done + +if [[ -n "${FORGEJO_ADMIN_TOKEN:-}" ]]; then + section "Forgejo API + secrets/vars" + check_with_hint "Forgejo API reachable" \ + "curl -fsSL --max-time 10 -H 'Authorization: token $FORGEJO_ADMIN_TOKEN' $FORGEJO_API_URL/api/v1/user" \ + "set FORGEJO_API_URL ; if no DNS yet, FORGEJO_API_URL=http://10.0.20.105:3000" + check_with_hint "repo $FORGEJO_OWNER/$FORGEJO_REPO exists" \ + "curl -fsSL -H 'Authorization: token $FORGEJO_ADMIN_TOKEN' $FORGEJO_API_URL/api/v1/repos/$FORGEJO_OWNER/$FORGEJO_REPO" \ + "set FORGEJO_OWNER + FORGEJO_REPO env vars" + + check_with_hint "secret FORGEJO_REGISTRY_TOKEN exists" \ + "curl -fsSL -H 'Authorization: token $FORGEJO_ADMIN_TOKEN' $FORGEJO_API_URL/api/v1/repos/$FORGEJO_OWNER/$FORGEJO_REPO/actions/secrets/FORGEJO_REGISTRY_TOKEN" \ + "PHASE=3 ./bootstrap-local.sh" + check_with_hint "secret ANSIBLE_VAULT_PASSWORD exists" \ + "curl -fsSL -H 'Authorization: token $FORGEJO_ADMIN_TOKEN' $FORGEJO_API_URL/api/v1/repos/$FORGEJO_OWNER/$FORGEJO_REPO/actions/secrets/ANSIBLE_VAULT_PASSWORD" \ + "PHASE=3 ./bootstrap-local.sh" + check_with_hint "variable FORGEJO_REGISTRY_URL exists" \ + "curl -fsSL -H 'Authorization: token $FORGEJO_ADMIN_TOKEN' $FORGEJO_API_URL/api/v1/repos/$FORGEJO_OWNER/$FORGEJO_REPO/actions/variables/FORGEJO_REGISTRY_URL" \ + "PHASE=3 ./bootstrap-local.sh" +else + warn "FORGEJO_ADMIN_TOKEN not set — skipping API checks. Set it to run those." +fi + +section "Result" +if (( FAIL == 0 )); then + ok "$PASS / $((PASS + FAIL)) checks passed" + exit 0 +else + err "$FAIL FAIL out of $((PASS + FAIL)) ($PASS passed)" + exit 1 +fi diff --git a/scripts/bootstrap/verify-remote.sh b/scripts/bootstrap/verify-remote.sh new file mode 100755 index 000000000..8910b3157 --- /dev/null +++ b/scripts/bootstrap/verify-remote.sh @@ -0,0 +1,122 @@ +#!/usr/bin/env bash +# verify-remote.sh — read-only checks of R720 state (Incus profiles, +# runner labels, container reachability, certs). Run on the R720 itself +# (locally or via `ssh r720 verify-remote.sh`). +# +# Exit 0 if everything passes ; non-zero with a count of failures. + +set -uo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# shellcheck source=lib.sh +. "$SCRIPT_DIR/lib.sh" + +: "${FORGEJO_API_URL:=https://forgejo.talas.group}" + +declare -i PASS=0 FAIL=0 + +check() { + local name=$1 cmd=$2 + if eval "$cmd" >/dev/null 2>&1; then + ok "$name" + PASS+=1 + else + err "$name" + FAIL+=1 + fi +} + +check_with_hint() { + local name=$1 cmd=$2 hint=$3 + if eval "$cmd" >/dev/null 2>&1; then + ok "$name" + PASS+=1 + else + err "$name" + printf >&2 ' %shint:%s %s\n' "$_YELLOW" "$_RESET" "$hint" + FAIL+=1 + fi +} + +section "R720 prerequisites" +check "incus available" "command -v incus" +check "zfs available" "command -v zfs" +check "incus list works" "incus list" + +section "Incus profiles" +for p in veza-app veza-data veza-net; do + check_with_hint "profile $p exists" \ + "incus profile show $p" \ + "run scripts/bootstrap/bootstrap-remote.sh as root" +done + +section "Forgejo container" +check "container 'forgejo' exists" "incus info forgejo" +check "container 'forgejo' RUNNING" \ + "incus list forgejo -f csv -c s 2>/dev/null | grep -q RUNNING" +check_with_hint "Forgejo HTTP responds on :3000" \ + "curl -ksSf -o /dev/null --max-time 5 http://10.0.20.105:3000/ || curl -ksSf -o /dev/null --max-time 5 https://10.0.20.105:3000/" \ + "incus exec forgejo -- systemctl status forgejo" + +section "Forgejo runner" +check "container 'forgejo-runner' exists" "incus info forgejo-runner" +check "container 'forgejo-runner' RUNNING" \ + "incus list forgejo-runner -f csv -c s 2>/dev/null | grep -q RUNNING" +check_with_hint "incus-socket device attached" \ + "incus config device show forgejo-runner | grep -q '^incus-socket:'" \ + "PHASE=2 sudo bash scripts/bootstrap/bootstrap-remote.sh" +check_with_hint "security.nesting=true" \ + "[[ \$(incus config get forgejo-runner security.nesting) == true ]]" \ + "incus config set forgejo-runner security.nesting=true && incus restart forgejo-runner" +check_with_hint "incus-client installed in runner" \ + "incus exec forgejo-runner -- command -v incus" \ + "incus exec forgejo-runner -- apt install -y incus-client" +check_with_hint "runner can incus list (socket reachable)" \ + "incus exec forgejo-runner -- incus list" \ + "verify the unix-socket disk device + nesting" +check_with_hint "runner config has 'incus' label" \ + "incus exec forgejo-runner -- bash -c 'for f in /etc/forgejo-runner/.runner /var/lib/forgejo-runner/.runner /opt/forgejo-runner/.runner ; do [[ -f \$f ]] && grep -q incus \$f && exit 0 ; done ; exit 1'" \ + "PHASE=3 sudo bash scripts/bootstrap/bootstrap-remote.sh" +check_with_hint "runner systemd unit active" \ + "incus exec forgejo-runner -- bash -c 'systemctl is-active forgejo-runner.service 2>/dev/null || systemctl is-active act_runner.service'" \ + "incus exec forgejo-runner -- journalctl -u forgejo-runner -n 50" + +section "Edge HAProxy (only after running playbooks/haproxy.yml)" +if incus info veza-haproxy >/dev/null 2>&1; then + check "container 'veza-haproxy' RUNNING" \ + "incus list veza-haproxy -f csv -c s | grep -q RUNNING" + check_with_hint "haproxy systemd unit active" \ + "incus exec veza-haproxy -- systemctl is-active haproxy" \ + "incus exec veza-haproxy -- journalctl -u haproxy -n 50" + check_with_hint "haproxy.cfg present" \ + "incus exec veza-haproxy -- test -f /etc/haproxy/haproxy.cfg" \ + "ansible-playbook -i inventory/staging.yml playbooks/haproxy.yml" + check_with_hint "haproxy.cfg passes self-validation" \ + "incus exec veza-haproxy -- haproxy -f /etc/haproxy/haproxy.cfg -c -q" \ + "config syntax error — re-run ansible-playbook to re-render" + check_with_hint "Let's Encrypt cert dir has at least 1 .pem" \ + "incus exec veza-haproxy -- bash -c 'ls /usr/local/etc/tls/haproxy/*.pem 2>/dev/null | wc -l | grep -q -E \"^[1-9]\"'" \ + "rerun ansible-playbook ; verify port 80 reachable from Internet for HTTP-01" +else + warn "container 'veza-haproxy' does not exist yet — run ansible-playbook playbooks/haproxy.yml" +fi + +section "ZFS state (snapshots tolerated)" +check "rpool exists" \ + "zpool list rpool" + +section "State file" +if [[ -f "$TALAS_STATE_FILE" ]]; then + info "phases recorded :" + cat "$TALAS_STATE_FILE" | sed 's/^/ /' +else + warn "no state file at $TALAS_STATE_FILE — bootstrap-remote.sh hasn't run yet" +fi + +section "Result" +if (( FAIL == 0 )); then + ok "$PASS / $((PASS + FAIL)) checks passed" + exit 0 +else + err "$FAIL FAIL out of $((PASS + FAIL)) ($PASS passed)" + exit 1 +fi