veza/scripts/bootstrap/verify-r720.sh
senke 3b33791660 refactor(bootstrap): everything via Ansible — no NOPASSWD, no SSH plumbing
Rearchitecture after operator pushback : the previous design did
too much in bash (SSH-streaming script chunks, manual sudo dance,
NOPASSWD requirement). Ansible is the right tool. The shell
scripts are now thin orchestrators handling the chicken-and-egg
of vault + Forgejo CI provisioning, then calling ansible-playbook.

Key principles :
  1. NO NOPASSWD sudo on the R720. --ask-become-pass interactive,
     password held in ansible memory only for the run.
  2. Two parallel scripts — one per host, fully self-contained.
  3. Both run the SAME Ansible playbooks (bootstrap_runner.yml +
     haproxy.yml). Difference is the inventory.

Files (new + replaced) :

  ansible.cfg
    pipelining=True → False. Required for --ask-become-pass to
    work reliably ; the previous setting raced sudo's prompt and
    timed out at 12s.

  playbooks/bootstrap_runner.yml (new)
    The Incus-host-side bootstrap, ported from the old
    scripts/bootstrap/bootstrap-remote.sh. Three plays :
      Phase 1 : ensure veza-app + veza-data profiles exist ;
                drop legacy empty veza-net profile.
      Phase 2 : forgejo-runner gets /var/lib/incus/unix.socket
                attached as a disk device, security.nesting=true,
                /usr/bin/incus pushed in as /usr/local/bin/incus,
                smoke-tested.
      Phase 3 : forgejo-runner registered with `incus,self-hosted`
                label (idempotent — skips if already labelled).
    Each task uses Ansible idioms (`incus_profile`, `incus_command`
    where they exist, `command:` with `failed_when` and explicit
    state-checking elsewhere). no_log on the registration token.

  inventory/local.yml (new)
    Inventory for `bootstrap-r720.sh` — connection: local instead
    of SSH+become. Same group structure as staging.yml ;
    container groups use community.general.incus connection
    plugin (the local incus binary, no remote).

  inventory/{staging,prod}.yml (modified)
    Added `forgejo_runner` group (target of bootstrap_runner.yml
    phase 3, reached via community.general.incus from the host).

  scripts/bootstrap/bootstrap-local.sh (rewritten)
    Five phases : preflight, vault, forgejo, ansible, summary.
    Phase 4 calls a single `ansible-playbook` with both
    bootstrap_runner.yml + haproxy.yml in sequence.
    --ask-become-pass : ansible prompts ONCE for sudo, holds in
    memory, reuses for every become: true task.

  scripts/bootstrap/bootstrap-r720.sh (new)
    Symmetric to bootstrap-local.sh but runs as root on the R720.
    No SSH preflight, no --ask-become-pass (already root).
    Same Ansible playbooks, inventory/local.yml.

  scripts/bootstrap/verify-r720.sh (new — replaces verify-remote)
    Read-only checks of R720 state. Run as root locally on the R720.

  scripts/bootstrap/verify-local.sh (modified)
    Cross-host SSH check now fits the env-var-driven SSH_TARGET
    pattern (R720_USER may be empty if the alias has User=).

  scripts/bootstrap/{bootstrap-remote.sh, verify-remote.sh,
  verify-remote-ssh.sh} (DELETED)
    Replaced by playbooks/bootstrap_runner.yml + verify-r720.sh.

  README.md (rewritten)
    Documents the parallel-script architecture, the
    no-NOPASSWD-sudo design choice (--ask-become-pass), each
    phase's needs, and a refreshed troubleshooting list.

State files unchanged in shape :
  laptop : .git/talas-bootstrap/local.state
  R720   : /var/lib/talas/r720-bootstrap.state

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:12:26 +02:00

101 lines
4.3 KiB
Bash
Executable file

#!/usr/bin/env bash
# verify-r720.sh — read-only checks on the R720 itself.
#
# Run as root :
# sudo bash scripts/bootstrap/verify-r720.sh
#
# Symmetric to verify-local.sh — exit code = number of failures.
set -uo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
. "$SCRIPT_DIR/lib.sh"
[[ $EUID -ne 0 ]] && warn "running without root — some checks may fail (incus list, ZFS)"
declare -i PASS=0 FAIL=0
check() {
local name=$1 cmd=$2
if eval "$cmd" >/dev/null 2>&1; then ok "$name"; PASS+=1; else err "$name"; FAIL+=1; fi
}
check_with_hint() {
local name=$1 cmd=$2 hint=$3
if eval "$cmd" >/dev/null 2>&1; then ok "$name"; PASS+=1
else err "$name"; printf >&2 ' %shint:%s %s\n' "$_YELLOW" "$_RESET" "$hint"; FAIL+=1
fi
}
section "Host prerequisites"
check "incus binary" "command -v incus"
check "ansible binary" "command -v ansible"
check "zfs binary" "command -v zfs"
check "incus daemon reachable" "incus list"
section "Incus profiles"
check_with_hint "profile veza-app" "incus profile show veza-app" \
"rerun bootstrap-r720.sh phase 4"
check_with_hint "profile veza-data" "incus profile show veza-data" \
"rerun bootstrap-r720.sh phase 4"
section "Incus networks"
check_with_hint "net-veza network exists" "incus network show net-veza" \
"incus network create net-veza ipv4.address=10.0.20.1/24 ipv4.nat=true"
section "Forgejo"
check "forgejo container exists" "incus info forgejo"
check "forgejo container RUNNING" "incus list forgejo -f csv -c s 2>/dev/null | grep -q RUNNING"
check "forgejo HTTP responds" "curl -ksSf -o /dev/null --max-time 5 https://10.0.20.105:3000/api/v1/version || curl -sSf -o /dev/null --max-time 5 http://10.0.20.105:3000/api/v1/version"
section "forgejo-runner"
check "runner container exists" "incus info forgejo-runner"
check "runner container RUNNING" "incus list forgejo-runner -f csv -c s 2>/dev/null | grep -q RUNNING"
check_with_hint "incus-socket device attached" \
"incus config device show forgejo-runner | grep -q '^incus-socket:'" \
"rerun bootstrap-r720.sh phase 4"
check_with_hint "security.nesting=true" \
"[[ \$(incus config get forgejo-runner security.nesting) == true ]]" \
"incus config set forgejo-runner security.nesting=true && incus restart forgejo-runner"
check_with_hint "incus binary in runner" \
"incus exec forgejo-runner -- test -x /usr/local/bin/incus" \
"rerun bootstrap-r720.sh phase 4"
check_with_hint "runner has 'incus' label" \
"incus exec forgejo-runner -- bash -c 'for f in /etc/forgejo-runner/.runner /var/lib/forgejo-runner/.runner /opt/forgejo-runner/.runner; do [[ -f \$f ]] && grep -q incus \$f && exit 0; done; exit 1'" \
"rerun bootstrap-r720.sh phase 4 (will re-register)"
check_with_hint "runner systemd unit active" \
"incus exec forgejo-runner -- bash -c 'systemctl is-active forgejo-runner.service 2>/dev/null || systemctl is-active act_runner.service'" \
"incus exec forgejo-runner -- journalctl -u forgejo-runner -n 50"
section "Edge HAProxy (post-haproxy.yml run)"
if incus info veza-haproxy >/dev/null 2>&1; then
check "veza-haproxy RUNNING" "incus list veza-haproxy -f csv -c s | grep -q RUNNING"
check_with_hint "haproxy systemd unit active" \
"incus exec veza-haproxy -- systemctl is-active haproxy" \
"incus exec veza-haproxy -- journalctl -u haproxy -n 50"
check_with_hint "haproxy.cfg validates" \
"incus exec veza-haproxy -- haproxy -f /etc/haproxy/haproxy.cfg -c -q" \
"rerun playbooks/haproxy.yml — config syntax error"
check_with_hint "Let's Encrypt cert dir has at least 1 .pem" \
"incus exec veza-haproxy -- bash -c 'ls /usr/local/etc/tls/haproxy/*.pem 2>/dev/null | grep -q .'" \
"verify port 80 reachable from Internet ; rerun playbooks/haproxy.yml"
else
warn "veza-haproxy doesn't exist yet — run bootstrap-r720.sh phase 4"
fi
section "ZFS"
check "rpool exists" "zpool list rpool"
section "State file"
if [[ -f "$TALAS_STATE_FILE" ]]; then
info "phases recorded :"
sed 's/^/ /' "$TALAS_STATE_FILE"
else
warn "no state file at $TALAS_STATE_FILE — bootstrap-r720.sh hasn't run yet"
fi
section "Result"
if (( FAIL == 0 )); then
ok "$PASS / $((PASS + FAIL)) checks passed"
exit 0
else
err "$FAIL FAIL out of $((PASS + FAIL)) ($PASS passed)"
exit 1
fi