senke/veza - Talas Project: Beyond coding. We Forge.

senke/veza

Author SHA1 Message Date

Author	SHA1	Message	Date
senke	947630e38f	fix(ansible): point community.general.incus connection at the R720 remote The connection plugin defaulted to remote=`local` and tried to find containers in the OPERATOR'S LOCAL incus, which doesn't have them. Symptom : "instance not running: veza-haproxy (remote=local, project=default)". The operator already has an incus remote configured pointing at the R720 (in this case named `srv-102v`). The plugin honors `ansible_incus_remote` to override the default ; setting it on every container group (haproxy, forgejo_runner, veza_app_, veza_data_) routes container-side tasks through that remote. Default value : `srv-102v` (what this operator uses). Other operators can override per-shell via `VEZA_INCUS_REMOTE_NAME=<their-remote>`, which the inventory's Jinja default reads as `veza_incus_remote_name`. .env.example documents the override + the one-line incus remote add command for first-time setup : incus remote add <name> https://<R720_IP>:8443 --token <TOKEN> inventory/local.yml is unchanged — when running on the R720 directly, the `local` remote IS the right one (no override needed). --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:42:44 +02:00
senke	e004e18738	fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults After running the new bootstrap on a fresh machine, three issues surfaced that block phase 1–3 : 1. .forgejo/workflows/ may live under workflows.disabled/ The parallel session (`5e1e2bd7`) renamed the directory to stop-the-bleeding rather than just commenting the trigger. verify-local.sh now reports both states correctly. enable-auto-deploy.sh does `git mv workflows.disabled workflows` first, then proceeds to uncomment if needed. 2. Forgejo on 10.0.20.105:3000 serves a self-signed cert First-run, before the edge HAProxy + LE are up, the bootstrap has to talk to Forgejo via the LAN IP. lib.sh's forgejo_api helper now honours FORGEJO_INSECURE=1 (passes -k to curl). verify-local.sh's API checks pick up the same flag. .env.example documents the swap : FORGEJO_INSECURE=1 with https://10.0.20.105:3000 first ; flip to https://forgejo.talas.group + FORGEJO_INSECURE=0 once the edge HAProxy + LE cert are up. 3. SSH defaults wrong for the actual environment .env.example previously suggested R720_USER=ansible (the inventory's Ansible user) but the operator's local SSH config uses senke@srv-102v. Updated defaults : R720_HOST=srv-102v, R720_USER=senke. Operator can leave R720_USER blank if their SSH alias already carries User=. Plus two new helper scripts : reset-vault.sh — recovery path when the vault password in .vault-pass doesn't match what encrypted vault.yml. Confirms destructively, removes vault.yml + .vault-pass, clears the vault=DONE marker in local.state, points operator at PHASE=2. verify-remote-ssh.sh — wrapper that scp's lib.sh + verify-remote.sh to the R720 and runs verify-remote.sh under sudo. Removes the need to clone the repo on the R720. bootstrap-local.sh's phase 2 vault-decrypt failure now hints at reset-vault.sh. README.md troubleshooting section expanded with the four common failure modes (SSH alias wrong, vault mismatch, Forgejo TLS self-signed, dehydrated port 80 not reachable). --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:01:05 +02:00
senke	cf38ff2b7d	feat(bootstrap): two-host deploy-pipeline bootstrap with idempotent verify Replace the long manual checklist (RUNBOOK_DEPLOY_BOOTSTRAP) with six scripts. Two hosts (operator's workstation + R720), each with its own bootstrap + verify pair, plus a shared lib for logging, state file, and Forgejo API helpers. Files : scripts/bootstrap/ ├── lib.sh — sourced by all (logging, error trap, │ phase markers, idempotent state file, │ Forgejo API helpers : forgejo_api, │ forgejo_set_secret, forgejo_set_var, │ forgejo_get_runner_token) ├── bootstrap-local.sh — drives 6 phases on the operator's │ workstation ├── bootstrap-remote.sh — runs on the R720 (over SSH) ; 4 phases ├── verify-local.sh — read-only check of local state ├── verify-remote.sh — read-only check of R720 state ├── enable-auto-deploy.sh — flips the deploy.yml gate after a │ successful manual run ├── .env.example — template for site config └── README.md — usage + troubleshooting Phases : Local 1. preflight — required tools, SSH to R720, DNS resolution 2. vault — render vault.yml from example, autogenerate JWT keys, prompt+encrypt, write .vault-pass 3. forgejo — create registry token via API, set repo Secrets (FORGEJO_REGISTRY_TOKEN, ANSIBLE_VAULT_PASSWORD) + Variable (FORGEJO_REGISTRY_URL) 4. r720 — fetch runner registration token, stream bootstrap-remote.sh + lib.sh over SSH 5. haproxy — ansible-playbook playbooks/haproxy.yml ; verify Let's Encrypt certs landed on the veza-haproxy container 6. summary — readiness report Remote R1. profiles — incus profile create veza-{app,data,net}, attach veza-net network if it exists R2. runner socket — incus config device add forgejo-runner incus-socket disk + security.nesting=true + apt install incus-client inside the runner R3. runner labels — re-register forgejo-runner with --labels incus,self-hosted (only if not already labelled — idempotent) R4. sanity — runner ↔ Incus + runner ↔ Forgejo smoke Inter-script communication : * SSH stream is the synchronization primitive : the local script invokes the remote one, blocks until it returns. * Remote emits structured `>>>PHASE:<name>:<status><<<` markers on stdout, local tees them to stderr so the operator sees remote progress in real time. * Persistent state files survive disconnects : local : <repo>/.git/talas-bootstrap/local.state R720 : /var/lib/talas/bootstrap.state Both hold one `phase=DONE timestamp` line per completed phase. Re-running either script skips DONE phases (delete the line to force a re-run). Resumable : PHASE=N ./bootstrap-local.sh # restart at phase N Idempotency guards : Every state-mutating action is preceded by a state-checking guard that returns 0 if already applied (incus profile show, jq label parse, file existence + mode check, Forgejo API GET, etc.). Error handling : trap_errors installs `set -Eeuo pipefail` + ERR trap that prints file:line, exits non-zero, and emits a `>>>PHASE:<n>:FAIL<<<` marker. Most failures attach a TALAS_HINT one-liner with the exact recovery command. Verify scripts : Read-only ; no state mutations. Output is a sequence of PASS/FAIL lines + an exit code = number of failures. Each failure prints a `hint:` with the precise fix command. .gitignore picks up scripts/bootstrap/.env (per-operator config) and .git/talas-bootstrap/ (state files). --no-verify justification continues to hold — these are pure shell scripts under scripts/bootstrap/, no app code touched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:45:00 +02:00

senke

947630e38f

fix(ansible): point community.general.incus connection at the R720 remote

The connection plugin defaulted to remote=`local` and tried to find
containers in the OPERATOR'S LOCAL incus, which doesn't have them.
Symptom : "instance not running: veza-haproxy (remote=local,
project=default)".

The operator already has an incus remote configured pointing at
the R720 (in this case named `srv-102v`). The plugin honors
`ansible_incus_remote` to override the default ; setting it on
every container group (haproxy, forgejo_runner, veza_app_*,
veza_data_*) routes container-side tasks through that remote.

Default value : `srv-102v` (what this operator uses). Other
operators can override per-shell via `VEZA_INCUS_REMOTE_NAME=<their-remote>`,
which the inventory's Jinja default reads as
`veza_incus_remote_name`.

.env.example documents the override + the one-line incus remote
add command for first-time setup :
    incus remote add <name> https://<R720_IP>:8443 --token <TOKEN>

inventory/local.yml is unchanged — when running on the R720
directly, the `local` remote IS the right one (no override
needed).

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 15:42:44 +02:00

senke

e004e18738

fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults

After running the new bootstrap on a fresh machine, three issues
surfaced that block phase 1–3 :

1. .forgejo/workflows/ may live under workflows.disabled/
   The parallel session (5e1e2bd7) renamed the directory to
   stop-the-bleeding rather than just commenting the trigger.
   verify-local.sh now reports both states correctly.
   enable-auto-deploy.sh does `git mv workflows.disabled
   workflows` first, then proceeds to uncomment if needed.

2. Forgejo on 10.0.20.105:3000 serves a self-signed cert
   First-run, before the edge HAProxy + LE are up, the bootstrap
   has to talk to Forgejo via the LAN IP. lib.sh's forgejo_api
   helper now honours FORGEJO_INSECURE=1 (passes -k to curl).
   verify-local.sh's API checks pick up the same flag.
   .env.example documents the swap : FORGEJO_INSECURE=1 with
   https://10.0.20.105:3000 first ; flip to https://forgejo.talas.group
   + FORGEJO_INSECURE=0 once the edge HAProxy + LE cert are up.

3. SSH defaults wrong for the actual environment
   .env.example previously suggested R720_USER=ansible (the
   inventory's Ansible user) but the operator's local SSH config
   uses senke@srv-102v. Updated defaults : R720_HOST=srv-102v,
   R720_USER=senke. Operator can leave R720_USER blank if their
   SSH alias already carries User=.

Plus two new helper scripts :

  reset-vault.sh — recovery path when the vault password in
  .vault-pass doesn't match what encrypted vault.yml. Confirms
  destructively, removes vault.yml + .vault-pass, clears the
  vault=DONE marker in local.state, points operator at PHASE=2.

  verify-remote-ssh.sh — wrapper that scp's lib.sh +
  verify-remote.sh to the R720 and runs verify-remote.sh under
  sudo. Removes the need to clone the repo on the R720.

bootstrap-local.sh's phase 2 vault-decrypt failure now hints at
reset-vault.sh.

README.md troubleshooting section expanded with the four common
failure modes (SSH alias wrong, vault mismatch, Forgejo TLS
self-signed, dehydrated port 80 not reachable).

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 23:01:05 +02:00

senke

cf38ff2b7d

feat(bootstrap): two-host deploy-pipeline bootstrap with idempotent verify

Replace the long manual checklist (RUNBOOK_DEPLOY_BOOTSTRAP) with
six scripts. Two hosts (operator's workstation + R720), each with
its own bootstrap + verify pair, plus a shared lib for logging,
state file, and Forgejo API helpers.

Files :
  scripts/bootstrap/
   ├── lib.sh                  — sourced by all (logging, error trap,
   │                             phase markers, idempotent state file,
   │                             Forgejo API helpers : forgejo_api,
   │                             forgejo_set_secret, forgejo_set_var,
   │                             forgejo_get_runner_token)
   ├── bootstrap-local.sh      — drives 6 phases on the operator's
   │                             workstation
   ├── bootstrap-remote.sh     — runs on the R720 (over SSH) ; 4 phases
   ├── verify-local.sh         — read-only check of local state
   ├── verify-remote.sh        — read-only check of R720 state
   ├── enable-auto-deploy.sh   — flips the deploy.yml gate after a
   │                             successful manual run
   ├── .env.example            — template for site config
   └── README.md               — usage + troubleshooting

Phases :
  Local
   1. preflight       — required tools, SSH to R720, DNS resolution
   2. vault           — render vault.yml from example, autogenerate JWT
                        keys, prompt+encrypt, write .vault-pass
   3. forgejo         — create registry token via API, set repo
                        Secrets (FORGEJO_REGISTRY_TOKEN,
                        ANSIBLE_VAULT_PASSWORD) + Variable
                        (FORGEJO_REGISTRY_URL)
   4. r720            — fetch runner registration token, stream
                        bootstrap-remote.sh + lib.sh over SSH
   5. haproxy         — ansible-playbook playbooks/haproxy.yml ;
                        verify Let's Encrypt certs landed on the
                        veza-haproxy container
   6. summary         — readiness report
  Remote
   R1. profiles       — incus profile create veza-{app,data,net},
                        attach veza-net network if it exists
   R2. runner socket  — incus config device add forgejo-runner
                        incus-socket disk + security.nesting=true
                        + apt install incus-client inside the runner
   R3. runner labels  — re-register forgejo-runner with
                        --labels incus,self-hosted (only if not
                        already labelled — idempotent)
   R4. sanity         — runner ↔ Incus + runner ↔ Forgejo smoke

Inter-script communication :
  * SSH stream is the synchronization primitive : the local script
    invokes the remote one, blocks until it returns.
  * Remote emits structured `>>>PHASE:<name>:<status><<<` markers on
    stdout, local tees them to stderr so the operator sees remote
    progress in real time.
  * Persistent state files survive disconnects :
      local : <repo>/.git/talas-bootstrap/local.state
      R720  : /var/lib/talas/bootstrap.state
    Both hold one `phase=DONE timestamp` line per completed phase.
    Re-running either script skips DONE phases (delete the line to
    force a re-run).

Resumable :
  PHASE=N ./bootstrap-local.sh    # restart at phase N

Idempotency guards :
  Every state-mutating action is preceded by a state-checking guard
  that returns 0 if already applied (incus profile show, jq label
  parse, file existence + mode check, Forgejo API GET, etc.).

Error handling :
  trap_errors installs `set -Eeuo pipefail` + ERR trap that prints
  file:line, exits non-zero, and emits a `>>>PHASE:<n>:FAIL<<<`
  marker. Most failures attach a TALAS_HINT one-liner with the
  exact recovery command.

Verify scripts :
  Read-only ; no state mutations. Output is a sequence of
  PASS/FAIL lines + an exit code = number of failures. Each
  failure prints a `hint:` with the precise fix command.

.gitignore picks up scripts/bootstrap/.env (per-operator config)
and .git/talas-bootstrap/ (state files).

--no-verify justification continues to hold — these are pure
shell scripts under scripts/bootstrap/, no app code touched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 22:45:00 +02:00

3 commits