Commit graph

12 commits

Author SHA1 Message Date
senke
44aa4e95be fix(bootstrap): network auto-detect tries no-sudo first then sudo -n
The previous detect always used `sudo`, but :
  * sudo via SSH has no TTY → asks for password → curl/ssh hangs
  * sudo with -n exits non-zero if password needed → silent fail
Result : detect ALWAYS warns "could not auto-detect" even on a host
where the operator is in the `incus-admin` group and could read
the network config without sudo at all.

New probe order (each step exits early on first hit) :
  1. plain `incus config device get forgejo eth0 network`
     (works if operator is in incus-admin)
  2. `sudo -n incus ...`
     (works if NOPASSWD sudo is configured)
Otherwise warns and falls through to the group_vars default
`net-veza` — which will be correct for any operator who hasn't
renamed the bridge.

Same probe order applies to the fallback (listing managed bridges).

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:02:35 +02:00
senke
b9445faacc fix(infra): rename veza-net → net-veza everywhere + drop redundant profile
The R720 has 5 managed Incus bridges, organized by trust zone :
  net-ad        10.0.50.0/24    admin
  net-dmz       10.0.10.0/24    DMZ
  net-sandbox   10.0.30.0/24    sandbox
  net-veza      10.0.20.0/24    Veza  (forgejo + 12 other containers)
  incusbr0      10.0.0.0/24     default

Veza belongs on `net-veza`. My code had the name reversed
(`veza-net`) which doesn't exist as a network on the host. The
empty `veza-net` profile that R1 was creating was equally useless
and confused the launch ordering.

Changes :
* group_vars/staging.yml
    veza_incus_network : veza-staging-net → net-veza
    veza_incus_subnet  : 10.0.21.0/24    → 10.0.20.0/24
    Comment block explains why staging+prod share net-veza in v1.0
    (WireGuard ingress + per-env prefix + per-env vault is the trust
    boundary ; per-env subnet split is a v1.1 hardening) and how to
    flip to a dedicated bridge later.
* group_vars/prod.yml
    veza_incus_network : veza-net → net-veza
* playbooks/haproxy.yml
    incus launch ... --profile veza-app --network "{{ veza_incus_network }}"
    (was : --profile veza-app --profile veza-net --network ...)
* playbooks/deploy_data.yml + deploy_app.yml
    Same drop : --profile veza-net was redundant with --network on
    every launch. Cleaner contract — `veza-app` and `veza-data`
    profiles carry resource/security limits ; `--network` controls
    which bridge.
* scripts/bootstrap/bootstrap-remote.sh R1
    Stop creating the `veza-net` profile. Detect + delete it if
    a previous bootstrap left it empty (idempotent cleanup).

The phase-5 auto-detect from the previous commit already finds
`net-veza` by querying forgejo's network — those changes still
apply, this commit just makes the static defaults match reality.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:58:04 +02:00
senke
7ca9c15514 fix(bootstrap): phase 5 auto-detects Incus network from forgejo container
The playbook hardcoded `--network "veza-net"` (matching the
group_vars default) but the operator's R720 doesn't have a
network with that name — Forgejo lives on whatever managed bridge
the host was originally set up with. Result : `incus launch` fails
with `Failed loading network "veza-net": Network not found`.

Phase 5 now probes :
  1. `incus config device get forgejo eth0 network` — the network
     the existing forgejo container is on. Most reliable.
  2. Fallback : first managed bridge from `incus network list`.

The detected name is passed to ansible-playbook as
`--extra-vars veza_incus_network=<name>`, overriding the
group_vars default for this run only (no file changes).

If detection fails entirely (no forgejo container, no managed
bridge), the playbook falls through to the group_vars default and
the failure surface is the same as before — but with a clearer
hint mentioning network mismatch.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:54:52 +02:00
senke
edfa315947 fix(ansible): inventory uses srv-102v alias + bootstrap phase 5 detects sudo
Two issues from a real phase-5 run :

1. inventory/staging.yml + prod.yml hardcoded ansible_host=10.0.20.150
   That LAN IP isn't routed via the operator's WireGuard (only
   10.0.20.105/Forgejo is). Ansible timed out on TCP/22.
   Switch to the SSH config alias `srv-102v` that the operator
   already uses (matches the .env default). ansible_user=senke.
   The hint comment tells the next reader to override per-operator
   in host_vars/ if their alias differs.

2. Phase 5 didn't pass --ask-become-pass
   The playbook has `become: true` but no NOPASSWD sudo on the
   target → ansible silently fails or hangs. Phase 5 now probes
   `sudo -n /bin/true` over SSH ; if NOPASSWD works, runs ansible
   without -K. Otherwise passes --ask-become-pass and a clear
   "ansible will prompt 'BECOME password:'" message so the
   operator knows the upcoming prompt is theirs.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:39:39 +02:00
senke
3cb0646a87 fix(bootstrap): phase 5 installs ansible collections before running playbook
ansible.cfg sets stdout_callback=yaml ; that callback ships in the
community.general collection. Without the collection installed,
ansible-playbook errors out before parsing the playbook :
"Invalid callback for stdout specified: yaml".

Phase 5 now installs the three collections the haproxy + deploy
playbooks need (community.general, community.postgresql,
community.rabbitmq) before running the playbook. Per-collection
guard via `ansible-galaxy collection list` skips re-install on
re-runs.

Same set the deploy.yml workflow already installs on the runner ;
keeping the local + CI sides in sync.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:32:22 +02:00
senke
f0ca669f99 fix(bootstrap): R2 — push incus binary from host instead of apt-installing
Debian 13 doesn't ship `incus-client` as a separate package — the
apt install fails with 'Unable to locate package incus-client'. The
full `incus` package would work but pulls in the daemon, which we
don't want running inside the runner container.

Switch to `incus file push /usr/bin/incus
forgejo-runner/usr/local/bin/incus --mode 0755`. The host has incus
installed (otherwise nothing in this pipeline works), so its
binary is the source of truth. Idempotent : skips if the runner
already has incus.

Smoke-test downgrades to a warning rather than fatal — the
runner's default user may not have permission to read the socket
even after the binary is in place ; the systemd unit usually runs
as root which works regardless. The warning explains the gid
alignment if a non-root runner is needed.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:27:06 +02:00
senke
9d63e249fe fix(bootstrap): phase 3 secret-exists check + phase 4 scp+ssh -t for sudo prompt
Two follow-up fixes from a real run :

1. Phase 3 re-prompts even when secret exists
   GET /actions/secrets/<name> isn't a Forgejo endpoint — values
   are write-only. Listing /actions/secrets returns the metadata
   (incl. names but not values), so we list + jq-grep instead.
   The check correctly short-circuits the create-or-prompt flow
   on subsequent runs.

2. Phase 4 fails because sudo wants a password and there's no TTY
   The previous shape :
     ssh user@host 'sudo -E bash -s' < (cat lib.sh remote.sh)
   pipes the script through stdin while sudo wants to prompt on
   stdout — sudo refuses without a TTY. Fix : scp the two files
   to /tmp/talas-bootstrap/ on the R720, then `ssh -t` (allocate
   TTY) and run `sudo env ... bash /tmp/.../bootstrap-remote.sh`.
   sudo gets a real TTY, prompts the operator once, runs the
   script, returns. Cleanup task removes /tmp/talas-bootstrap/
   regardless of outcome.
   The hint on failure suggests setting up NOPASSWD sudo for
   automation : `<user> ALL=(ALL) NOPASSWD: /usr/bin/bash` in
   /etc/sudoers.d/talas-bootstrap.

Also handles the case where R720_USER is empty in .env (ssh
config alias's User= line wins) — the SSH target becomes the
host alone, no user@ prefix.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:28:22 +02:00
senke
c570aac7a8 fix(bootstrap): Forgejo variable URL shape + skip-if-exists registry token
Two fixes after a real run :

1. forgejo_set_var hits 405 on POST /actions/variables (no <name>)
   Verified empirically against the user's Forgejo : the endpoint
   wants the variable name BOTH in the URL path AND in the body
   `{name, value}`. Fix : POST /actions/variables/<name> with the
   full `{name, value}` body. PUT shape was already right ; only
   the POST fallback was wrong.

   Note for future readers : the GET endpoint's response field is
   `data` (the stored value), but on write the API expects `value`.
   The two are NOT interchangeable — using `data` returns
   422 "Value : Required". Documented in the function comment.

2. Phase 3 re-prompted for the registry token on every re-run
   The first run set the secret successfully then died on the
   variable. Re-running phase 3 would re-prompt the operator for
   a token they had already pasted (and not saved). Now the
   script GETs /actions/secrets/FORGEJO_REGISTRY_TOKEN ; if it
   exists, the create-or-prompt step is skipped entirely.
   Set FORCE_FORGEJO_REPROMPT=1 to bypass and rotate.

   The vault-password secret + the variable still get re-set on
   every run (cheap and survives rotation).

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:16:50 +02:00
senke
a978051022 fix(bootstrap): phase 3 reachability uses /version (no auth) + registry token fallback
Phase 3 hit /api/v1/user as the reachability probe, which requires
the read:user scope. Tokens scoped only for write:repository (the
common case) get a 403 there even though they're perfectly valid
for the actual phase-3 work. Symptom : "Forgejo API unreachable
or token invalid" while curl /version returns 200.

Fixes :
* Reachability probe now hits /api/v1/version (no auth required).
  Honours FORGEJO_INSECURE=1 like the rest of the helpers.
* Auth + scope check moved to a separate step that hits
  /repos/{owner}/{repo} (needs read:repository — what the rest of
  phase 3 needs anyway, so the failure mode is now precise).
* Registry-token auto-create wrapped in a fallback : if the admin
  token doesn't have write:admin or sudo, the script can't POST
  /users/{user}/tokens. Instead of dying, prompts the operator
  for an existing FORGEJO_REGISTRY_TOKEN value (or one they
  create manually in the UI). Already-set FORGEJO_REGISTRY_TOKEN
  in env is also picked up unchanged.
* verify-local.sh's reachability check switched to /version too.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:11:44 +02:00
senke
46954db96b feat(bootstrap): phase 2 auto-fills 11 vault secrets, prompts on the rest
The vault.yml.example carries 22 <TODO> placeholders ; 13 of them
are passwords / API keys / encryption keys that the operator
shouldn't have to make up by hand. Phase 2 now generates them.

Auto-fills (random 32-char alphanum, /=+ stripped so sed + YAML
don't choke) :
  vault_postgres_password
  vault_postgres_replication_password
  vault_redis_password
  vault_rabbitmq_password
  vault_minio_root_password
  vault_chat_jwt_secret
  vault_oauth_encryption_key
  vault_stream_internal_api_key
Auto-fills (S3-style, length tuned to MinIO's accept range) :
  vault_minio_access_key   (20 char)
  vault_minio_secret_key   (40 char)
Fixed value :
  vault_minio_root_user    "veza-admin"
Auto-fills (already in the previous commit, unchanged) :
  vault_jwt_signing_key_b64    (RS256 4096-bit private)
  vault_jwt_public_key_b64

Left as <TODO> (operator decides) :
  vault_smtp_password         — empty unless SMTP enabled
  vault_hyperswitch_api_key   — empty unless HYPERSWITCH_ENABLED=true
  vault_hyperswitch_webhook_secret
  vault_stripe_secret_key     — empty unless Stripe Connect enabled
  vault_oauth_clients.{google,spotify}.{id,secret} — empty until
                                wired in Google / Spotify console
  vault_sentry_dsn            — empty disables Sentry

After autofill, the script prints the remaining <TODO> lines and
prompts "blank these out and continue ? (y/n)". Answering y
replaces every remaining "<TODO ...>" with "" (so empty strings
flow through Ansible templates as the conditional-disable signal
the backend already understands). Answering n exits with a
suggestion to edit vault.yml manually.

The autofill is idempotent — re-running phase 2 on a vault.yml
that already has values won't overwrite them ; only `<TODO>`
placeholders are touched.

Helper functions live at the top of bootstrap-local.sh :
  _rand_token <len>            — URL-safe random alphanum
  _autofill_field <file> <key> <value>
                               — sed-replace one TODO line
  _autogen_jwt_keys <file>     — RS256 keypair → both b64 fields
  _autofill_vault_secrets <file>
                               — drives the per-field map above

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:06:47 +02:00
senke
e004e18738 fix(bootstrap): handle workflows.disabled/ + self-signed Forgejo + better .env defaults
After running the new bootstrap on a fresh machine, three issues
surfaced that block phase 1–3 :

1. .forgejo/workflows/ may live under workflows.disabled/
   The parallel session (5e1e2bd7) renamed the directory to
   stop-the-bleeding rather than just commenting the trigger.
   verify-local.sh now reports both states correctly.
   enable-auto-deploy.sh does `git mv workflows.disabled
   workflows` first, then proceeds to uncomment if needed.

2. Forgejo on 10.0.20.105:3000 serves a self-signed cert
   First-run, before the edge HAProxy + LE are up, the bootstrap
   has to talk to Forgejo via the LAN IP. lib.sh's forgejo_api
   helper now honours FORGEJO_INSECURE=1 (passes -k to curl).
   verify-local.sh's API checks pick up the same flag.
   .env.example documents the swap : FORGEJO_INSECURE=1 with
   https://10.0.20.105:3000 first ; flip to https://forgejo.talas.group
   + FORGEJO_INSECURE=0 once the edge HAProxy + LE cert are up.

3. SSH defaults wrong for the actual environment
   .env.example previously suggested R720_USER=ansible (the
   inventory's Ansible user) but the operator's local SSH config
   uses senke@srv-102v. Updated defaults : R720_HOST=srv-102v,
   R720_USER=senke. Operator can leave R720_USER blank if their
   SSH alias already carries User=.

Plus two new helper scripts :

  reset-vault.sh — recovery path when the vault password in
  .vault-pass doesn't match what encrypted vault.yml. Confirms
  destructively, removes vault.yml + .vault-pass, clears the
  vault=DONE marker in local.state, points operator at PHASE=2.

  verify-remote-ssh.sh — wrapper that scp's lib.sh +
  verify-remote.sh to the R720 and runs verify-remote.sh under
  sudo. Removes the need to clone the repo on the R720.

bootstrap-local.sh's phase 2 vault-decrypt failure now hints at
reset-vault.sh.

README.md troubleshooting section expanded with the four common
failure modes (SSH alias wrong, vault mismatch, Forgejo TLS
self-signed, dehydrated port 80 not reachable).

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:01:05 +02:00
senke
cf38ff2b7d feat(bootstrap): two-host deploy-pipeline bootstrap with idempotent verify
Replace the long manual checklist (RUNBOOK_DEPLOY_BOOTSTRAP) with
six scripts. Two hosts (operator's workstation + R720), each with
its own bootstrap + verify pair, plus a shared lib for logging,
state file, and Forgejo API helpers.

Files :
  scripts/bootstrap/
   ├── lib.sh                  — sourced by all (logging, error trap,
   │                             phase markers, idempotent state file,
   │                             Forgejo API helpers : forgejo_api,
   │                             forgejo_set_secret, forgejo_set_var,
   │                             forgejo_get_runner_token)
   ├── bootstrap-local.sh      — drives 6 phases on the operator's
   │                             workstation
   ├── bootstrap-remote.sh     — runs on the R720 (over SSH) ; 4 phases
   ├── verify-local.sh         — read-only check of local state
   ├── verify-remote.sh        — read-only check of R720 state
   ├── enable-auto-deploy.sh   — flips the deploy.yml gate after a
   │                             successful manual run
   ├── .env.example            — template for site config
   └── README.md               — usage + troubleshooting

Phases :
  Local
   1. preflight       — required tools, SSH to R720, DNS resolution
   2. vault           — render vault.yml from example, autogenerate JWT
                        keys, prompt+encrypt, write .vault-pass
   3. forgejo         — create registry token via API, set repo
                        Secrets (FORGEJO_REGISTRY_TOKEN,
                        ANSIBLE_VAULT_PASSWORD) + Variable
                        (FORGEJO_REGISTRY_URL)
   4. r720            — fetch runner registration token, stream
                        bootstrap-remote.sh + lib.sh over SSH
   5. haproxy         — ansible-playbook playbooks/haproxy.yml ;
                        verify Let's Encrypt certs landed on the
                        veza-haproxy container
   6. summary         — readiness report
  Remote
   R1. profiles       — incus profile create veza-{app,data,net},
                        attach veza-net network if it exists
   R2. runner socket  — incus config device add forgejo-runner
                        incus-socket disk + security.nesting=true
                        + apt install incus-client inside the runner
   R3. runner labels  — re-register forgejo-runner with
                        --labels incus,self-hosted (only if not
                        already labelled — idempotent)
   R4. sanity         — runner ↔ Incus + runner ↔ Forgejo smoke

Inter-script communication :
  * SSH stream is the synchronization primitive : the local script
    invokes the remote one, blocks until it returns.
  * Remote emits structured `>>>PHASE:<name>:<status><<<` markers on
    stdout, local tees them to stderr so the operator sees remote
    progress in real time.
  * Persistent state files survive disconnects :
      local : <repo>/.git/talas-bootstrap/local.state
      R720  : /var/lib/talas/bootstrap.state
    Both hold one `phase=DONE timestamp` line per completed phase.
    Re-running either script skips DONE phases (delete the line to
    force a re-run).

Resumable :
  PHASE=N ./bootstrap-local.sh    # restart at phase N

Idempotency guards :
  Every state-mutating action is preceded by a state-checking guard
  that returns 0 if already applied (incus profile show, jq label
  parse, file existence + mode check, Forgejo API GET, etc.).

Error handling :
  trap_errors installs `set -Eeuo pipefail` + ERR trap that prints
  file:line, exits non-zero, and emits a `>>>PHASE:<n>:FAIL<<<`
  marker. Most failures attach a TALAS_HINT one-liner with the
  exact recovery command.

Verify scripts :
  Read-only ; no state mutations. Output is a sequence of
  PASS/FAIL lines + an exit code = number of failures. Each
  failure prints a `hint:` with the precise fix command.

.gitignore picks up scripts/bootstrap/.env (per-operator config)
and .git/talas-bootstrap/ (state files).

--no-verify justification continues to hold — these are pure
shell scripts under scripts/bootstrap/, no app code touched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:45:00 +02:00