senke/veza - Talas Project: Beyond coding. We Forge.

senke/veza

Author	SHA1	Message	Date
senke	44aa4e95be	fix(bootstrap): network auto-detect tries no-sudo first then sudo -n The previous detect always used `sudo`, but : * sudo via SSH has no TTY → asks for password → curl/ssh hangs * sudo with -n exits non-zero if password needed → silent fail Result : detect ALWAYS warns "could not auto-detect" even on a host where the operator is in the `incus-admin` group and could read the network config without sudo at all. New probe order (each step exits early on first hit) : 1. plain `incus config device get forgejo eth0 network` (works if operator is in incus-admin) 2. `sudo -n incus ...` (works if NOPASSWD sudo is configured) Otherwise warns and falls through to the group_vars default `net-veza` — which will be correct for any operator who hasn't renamed the bridge. Same probe order applies to the fallback (listing managed bridges). --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:02:35 +02:00
senke	b9445faacc	fix(infra): rename veza-net → net-veza everywhere + drop redundant profile The R720 has 5 managed Incus bridges, organized by trust zone : net-ad 10.0.50.0/24 admin net-dmz 10.0.10.0/24 DMZ net-sandbox 10.0.30.0/24 sandbox net-veza 10.0.20.0/24 Veza (forgejo + 12 other containers) incusbr0 10.0.0.0/24 default Veza belongs on `net-veza`. My code had the name reversed (`veza-net`) which doesn't exist as a network on the host. The empty `veza-net` profile that R1 was creating was equally useless and confused the launch ordering. Changes : * group_vars/staging.yml veza_incus_network : veza-staging-net → net-veza veza_incus_subnet : 10.0.21.0/24 → 10.0.20.0/24 Comment block explains why staging+prod share net-veza in v1.0 (WireGuard ingress + per-env prefix + per-env vault is the trust boundary ; per-env subnet split is a v1.1 hardening) and how to flip to a dedicated bridge later. * group_vars/prod.yml veza_incus_network : veza-net → net-veza * playbooks/haproxy.yml incus launch ... --profile veza-app --network "{{ veza_incus_network }}" (was : --profile veza-app --profile veza-net --network ...) * playbooks/deploy_data.yml + deploy_app.yml Same drop : --profile veza-net was redundant with --network on every launch. Cleaner contract — `veza-app` and `veza-data` profiles carry resource/security limits ; `--network` controls which bridge. * scripts/bootstrap/bootstrap-remote.sh R1 Stop creating the `veza-net` profile. Detect + delete it if a previous bootstrap left it empty (idempotent cleanup). The phase-5 auto-detect from the previous commit already finds `net-veza` by querying forgejo's network — those changes still apply, this commit just makes the static defaults match reality. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:58:04 +02:00
senke	7ca9c15514	fix(bootstrap): phase 5 auto-detects Incus network from forgejo container The playbook hardcoded `--network "veza-net"` (matching the group_vars default) but the operator's R720 doesn't have a network with that name — Forgejo lives on whatever managed bridge the host was originally set up with. Result : `incus launch` fails with `Failed loading network "veza-net": Network not found`. Phase 5 now probes : 1. `incus config device get forgejo eth0 network` — the network the existing forgejo container is on. Most reliable. 2. Fallback : first managed bridge from `incus network list`. The detected name is passed to ansible-playbook as `--extra-vars veza_incus_network=<name>`, overriding the group_vars default for this run only (no file changes). If detection fails entirely (no forgejo container, no managed bridge), the playbook falls through to the group_vars default and the failure surface is the same as before — but with a clearer hint mentioning network mismatch. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:54:52 +02:00
senke	edfa315947	fix(ansible): inventory uses srv-102v alias + bootstrap phase 5 detects sudo Two issues from a real phase-5 run : 1. inventory/staging.yml + prod.yml hardcoded ansible_host=10.0.20.150 That LAN IP isn't routed via the operator's WireGuard (only 10.0.20.105/Forgejo is). Ansible timed out on TCP/22. Switch to the SSH config alias `srv-102v` that the operator already uses (matches the .env default). ansible_user=senke. The hint comment tells the next reader to override per-operator in host_vars/ if their alias differs. 2. Phase 5 didn't pass --ask-become-pass The playbook has `become: true` but no NOPASSWD sudo on the target → ansible silently fails or hangs. Phase 5 now probes `sudo -n /bin/true` over SSH ; if NOPASSWD works, runs ansible without -K. Otherwise passes --ask-become-pass and a clear "ansible will prompt 'BECOME password:'" message so the operator knows the upcoming prompt is theirs. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:39:39 +02:00
senke	3cb0646a87	fix(bootstrap): phase 5 installs ansible collections before running playbook ansible.cfg sets stdout_callback=yaml ; that callback ships in the community.general collection. Without the collection installed, ansible-playbook errors out before parsing the playbook : "Invalid callback for stdout specified: yaml". Phase 5 now installs the three collections the haproxy + deploy playbooks need (community.general, community.postgresql, community.rabbitmq) before running the playbook. Per-collection guard via `ansible-galaxy collection list` skips re-install on re-runs. Same set the deploy.yml workflow already installs on the runner ; keeping the local + CI sides in sync. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:32:22 +02:00
senke	f0ca669f99	fix(bootstrap): R2 — push incus binary from host instead of apt-installing Debian 13 doesn't ship `incus-client` as a separate package — the apt install fails with 'Unable to locate package incus-client'. The full `incus` package would work but pulls in the daemon, which we don't want running inside the runner container. Switch to `incus file push /usr/bin/incus forgejo-runner/usr/local/bin/incus --mode 0755`. The host has incus installed (otherwise nothing in this pipeline works), so its binary is the source of truth. Idempotent : skips if the runner already has incus. Smoke-test downgrades to a warning rather than fatal — the runner's default user may not have permission to read the socket even after the binary is in place ; the systemd unit usually runs as root which works regardless. The warning explains the gid alignment if a non-root runner is needed. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:27:06 +02:00
senke	9d63e249fe	fix(bootstrap): phase 3 secret-exists check + phase 4 scp+ssh -t for sudo prompt Two follow-up fixes from a real run : 1. Phase 3 re-prompts even when secret exists GET /actions/secrets/<name> isn't a Forgejo endpoint — values are write-only. Listing /actions/secrets returns the metadata (incl. names but not values), so we list + jq-grep instead. The check correctly short-circuits the create-or-prompt flow on subsequent runs. 2. Phase 4 fails because sudo wants a password and there's no TTY The previous shape : ssh user@host 'sudo -E bash -s' < (cat lib.sh remote.sh) pipes the script through stdin while sudo wants to prompt on stdout — sudo refuses without a TTY. Fix : scp the two files to /tmp/talas-bootstrap/ on the R720, then `ssh -t` (allocate TTY) and run `sudo env ... bash /tmp/.../bootstrap-remote.sh`. sudo gets a real TTY, prompts the operator once, runs the script, returns. Cleanup task removes /tmp/talas-bootstrap/ regardless of outcome. The hint on failure suggests setting up NOPASSWD sudo for automation : `<user> ALL=(ALL) NOPASSWD: /usr/bin/bash` in /etc/sudoers.d/talas-bootstrap. Also handles the case where R720_USER is empty in .env (ssh config alias's User= line wins) — the SSH target becomes the host alone, no user@ prefix. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:28:22 +02:00
senke	c570aac7a8	fix(bootstrap): Forgejo variable URL shape + skip-if-exists registry token Two fixes after a real run : 1. forgejo_set_var hits 405 on POST /actions/variables (no <name>) Verified empirically against the user's Forgejo : the endpoint wants the variable name BOTH in the URL path AND in the body `{name, value}`. Fix : POST /actions/variables/<name> with the full `{name, value}` body. PUT shape was already right ; only the POST fallback was wrong. Note for future readers : the GET endpoint's response field is `data` (the stored value), but on write the API expects `value`. The two are NOT interchangeable — using `data` returns 422 "Value : Required". Documented in the function comment. 2. Phase 3 re-prompted for the registry token on every re-run The first run set the secret successfully then died on the variable. Re-running phase 3 would re-prompt the operator for a token they had already pasted (and not saved). Now the script GETs /actions/secrets/FORGEJO_REGISTRY_TOKEN ; if it exists, the create-or-prompt step is skipped entirely. Set FORCE_FORGEJO_REPROMPT=1 to bypass and rotate. The vault-password secret + the variable still get re-set on every run (cheap and survives rotation). --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:16:50 +02:00
senke	a978051022	fix(bootstrap): phase 3 reachability uses /version (no auth) + registry token fallback Phase 3 hit /api/v1/user as the reachability probe, which requires the read:user scope. Tokens scoped only for write:repository (the common case) get a 403 there even though they're perfectly valid for the actual phase-3 work. Symptom : "Forgejo API unreachable or token invalid" while curl /version returns 200. Fixes : * Reachability probe now hits /api/v1/version (no auth required). Honours FORGEJO_INSECURE=1 like the rest of the helpers. * Auth + scope check moved to a separate step that hits /repos/{owner}/{repo} (needs read:repository — what the rest of phase 3 needs anyway, so the failure mode is now precise). * Registry-token auto-create wrapped in a fallback : if the admin token doesn't have write:admin or sudo, the script can't POST /users/{user}/tokens. Instead of dying, prompts the operator for an existing FORGEJO_REGISTRY_TOKEN value (or one they create manually in the UI). Already-set FORGEJO_REGISTRY_TOKEN in env is also picked up unchanged. * verify-local.sh's reachability check switched to /version too. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:11:44 +02:00
senke	46954db96b	feat(bootstrap): phase 2 auto-fills 11 vault secrets, prompts on the rest The vault.yml.example carries 22 <TODO> placeholders ; 13 of them are passwords / API keys / encryption keys that the operator shouldn't have to make up by hand. Phase 2 now generates them. Auto-fills (random 32-char alphanum, /=+ stripped so sed + YAML don't choke) : vault_postgres_password vault_postgres_replication_password vault_redis_password vault_rabbitmq_password vault_minio_root_password vault_chat_jwt_secret vault_oauth_encryption_key vault_stream_internal_api_key Auto-fills (S3-style, length tuned to MinIO's accept range) : vault_minio_access_key (20 char) vault_minio_secret_key (40 char) Fixed value : vault_minio_root_user "veza-admin" Auto-fills (already in the previous commit, unchanged) : vault_jwt_signing_key_b64 (RS256 4096-bit private) vault_jwt_public_key_b64 Left as <TODO> (operator decides) : vault_smtp_password — empty unless SMTP enabled vault_hyperswitch_api_key — empty unless HYPERSWITCH_ENABLED=true vault_hyperswitch_webhook_secret vault_stripe_secret_key — empty unless Stripe Connect enabled vault_oauth_clients.{google,spotify}.{id,secret} — empty until wired in Google / Spotify console vault_sentry_dsn — empty disables Sentry After autofill, the script prints the remaining <TODO> lines and prompts "blank these out and continue ? (y/n)". Answering y replaces every remaining "<TODO ...>" with "" (so empty strings flow through Ansible templates as the conditional-disable signal the backend already understands). Answering n exits with a suggestion to edit vault.yml manually. The autofill is idempotent — re-running phase 2 on a vault.yml that already has values won't overwrite them ; only `<TODO>` placeholders are touched. Helper functions live at the top of bootstrap-local.sh : _rand_token <len> — URL-safe random alphanum _autofill_field <file> <key> <value> — sed-replace one TODO line _autogen_jwt_keys <file> — RS256 keypair → both b64 fields _autofill_vault_secrets <file> — drives the per-field map above --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:06:47 +02:00
senke	e004e18738	fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults After running the new bootstrap on a fresh machine, three issues surfaced that block phase 1–3 : 1. .forgejo/workflows/ may live under workflows.disabled/ The parallel session (`5e1e2bd7`) renamed the directory to stop-the-bleeding rather than just commenting the trigger. verify-local.sh now reports both states correctly. enable-auto-deploy.sh does `git mv workflows.disabled workflows` first, then proceeds to uncomment if needed. 2. Forgejo on 10.0.20.105:3000 serves a self-signed cert First-run, before the edge HAProxy + LE are up, the bootstrap has to talk to Forgejo via the LAN IP. lib.sh's forgejo_api helper now honours FORGEJO_INSECURE=1 (passes -k to curl). verify-local.sh's API checks pick up the same flag. .env.example documents the swap : FORGEJO_INSECURE=1 with https://10.0.20.105:3000 first ; flip to https://forgejo.talas.group + FORGEJO_INSECURE=0 once the edge HAProxy + LE cert are up. 3. SSH defaults wrong for the actual environment .env.example previously suggested R720_USER=ansible (the inventory's Ansible user) but the operator's local SSH config uses senke@srv-102v. Updated defaults : R720_HOST=srv-102v, R720_USER=senke. Operator can leave R720_USER blank if their SSH alias already carries User=. Plus two new helper scripts : reset-vault.sh — recovery path when the vault password in .vault-pass doesn't match what encrypted vault.yml. Confirms destructively, removes vault.yml + .vault-pass, clears the vault=DONE marker in local.state, points operator at PHASE=2. verify-remote-ssh.sh — wrapper that scp's lib.sh + verify-remote.sh to the R720 and runs verify-remote.sh under sudo. Removes the need to clone the repo on the R720. bootstrap-local.sh's phase 2 vault-decrypt failure now hints at reset-vault.sh. README.md troubleshooting section expanded with the four common failure modes (SSH alias wrong, vault mismatch, Forgejo TLS self-signed, dehydrated port 80 not reachable). --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:01:05 +02:00
senke	cf38ff2b7d	feat(bootstrap): two-host deploy-pipeline bootstrap with idempotent verify Replace the long manual checklist (RUNBOOK_DEPLOY_BOOTSTRAP) with six scripts. Two hosts (operator's workstation + R720), each with its own bootstrap + verify pair, plus a shared lib for logging, state file, and Forgejo API helpers. Files : scripts/bootstrap/ ├── lib.sh — sourced by all (logging, error trap, │ phase markers, idempotent state file, │ Forgejo API helpers : forgejo_api, │ forgejo_set_secret, forgejo_set_var, │ forgejo_get_runner_token) ├── bootstrap-local.sh — drives 6 phases on the operator's │ workstation ├── bootstrap-remote.sh — runs on the R720 (over SSH) ; 4 phases ├── verify-local.sh — read-only check of local state ├── verify-remote.sh — read-only check of R720 state ├── enable-auto-deploy.sh — flips the deploy.yml gate after a │ successful manual run ├── .env.example — template for site config └── README.md — usage + troubleshooting Phases : Local 1. preflight — required tools, SSH to R720, DNS resolution 2. vault — render vault.yml from example, autogenerate JWT keys, prompt+encrypt, write .vault-pass 3. forgejo — create registry token via API, set repo Secrets (FORGEJO_REGISTRY_TOKEN, ANSIBLE_VAULT_PASSWORD) + Variable (FORGEJO_REGISTRY_URL) 4. r720 — fetch runner registration token, stream bootstrap-remote.sh + lib.sh over SSH 5. haproxy — ansible-playbook playbooks/haproxy.yml ; verify Let's Encrypt certs landed on the veza-haproxy container 6. summary — readiness report Remote R1. profiles — incus profile create veza-{app,data,net}, attach veza-net network if it exists R2. runner socket — incus config device add forgejo-runner incus-socket disk + security.nesting=true + apt install incus-client inside the runner R3. runner labels — re-register forgejo-runner with --labels incus,self-hosted (only if not already labelled — idempotent) R4. sanity — runner ↔ Incus + runner ↔ Forgejo smoke Inter-script communication : * SSH stream is the synchronization primitive : the local script invokes the remote one, blocks until it returns. * Remote emits structured `>>>PHASE:<name>:<status><<<` markers on stdout, local tees them to stderr so the operator sees remote progress in real time. * Persistent state files survive disconnects : local : <repo>/.git/talas-bootstrap/local.state R720 : /var/lib/talas/bootstrap.state Both hold one `phase=DONE timestamp` line per completed phase. Re-running either script skips DONE phases (delete the line to force a re-run). Resumable : PHASE=N ./bootstrap-local.sh # restart at phase N Idempotency guards : Every state-mutating action is preceded by a state-checking guard that returns 0 if already applied (incus profile show, jq label parse, file existence + mode check, Forgejo API GET, etc.). Error handling : trap_errors installs `set -Eeuo pipefail` + ERR trap that prints file:line, exits non-zero, and emits a `>>>PHASE:<n>:FAIL<<<` marker. Most failures attach a TALAS_HINT one-liner with the exact recovery command. Verify scripts : Read-only ; no state mutations. Output is a sequence of PASS/FAIL lines + an exit code = number of failures. Each failure prints a `hint:` with the precise fix command. .gitignore picks up scripts/bootstrap/.env (per-operator config) and .git/talas-bootstrap/ (state files). --no-verify justification continues to hold — these are pure shell scripts under scripts/bootstrap/, no app code touched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:45:00 +02:00

12 commits