senke/veza - Talas Project: Beyond coding. We Forge.

senke/veza

Author	SHA1	Message	Date
senke	947630e38f	fix(ansible): point community.general.incus connection at the R720 remote The connection plugin defaulted to remote=`local` and tried to find containers in the OPERATOR'S LOCAL incus, which doesn't have them. Symptom : "instance not running: veza-haproxy (remote=local, project=default)". The operator already has an incus remote configured pointing at the R720 (in this case named `srv-102v`). The plugin honors `ansible_incus_remote` to override the default ; setting it on every container group (haproxy, forgejo_runner, veza_app_, veza_data_) routes container-side tasks through that remote. Default value : `srv-102v` (what this operator uses). Other operators can override per-shell via `VEZA_INCUS_REMOTE_NAME=<their-remote>`, which the inventory's Jinja default reads as `veza_incus_remote_name`. .env.example documents the override + the one-line incus remote add command for first-time setup : incus remote add <name> https://<R720_IP>:8443 --token <TOKEN> inventory/local.yml is unchanged — when running on the R720 directly, the `local` remote IS the right one (no override needed). --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:42:44 +02:00
senke	3b33791660	refactor(bootstrap): everything via Ansible — no NOPASSWD, no SSH plumbing Rearchitecture after operator pushback : the previous design did too much in bash (SSH-streaming script chunks, manual sudo dance, NOPASSWD requirement). Ansible is the right tool. The shell scripts are now thin orchestrators handling the chicken-and-egg of vault + Forgejo CI provisioning, then calling ansible-playbook. Key principles : 1. NO NOPASSWD sudo on the R720. --ask-become-pass interactive, password held in ansible memory only for the run. 2. Two parallel scripts — one per host, fully self-contained. 3. Both run the SAME Ansible playbooks (bootstrap_runner.yml + haproxy.yml). Difference is the inventory. Files (new + replaced) : ansible.cfg pipelining=True → False. Required for --ask-become-pass to work reliably ; the previous setting raced sudo's prompt and timed out at 12s. playbooks/bootstrap_runner.yml (new) The Incus-host-side bootstrap, ported from the old scripts/bootstrap/bootstrap-remote.sh. Three plays : Phase 1 : ensure veza-app + veza-data profiles exist ; drop legacy empty veza-net profile. Phase 2 : forgejo-runner gets /var/lib/incus/unix.socket attached as a disk device, security.nesting=true, /usr/bin/incus pushed in as /usr/local/bin/incus, smoke-tested. Phase 3 : forgejo-runner registered with `incus,self-hosted` label (idempotent — skips if already labelled). Each task uses Ansible idioms (`incus_profile`, `incus_command` where they exist, `command:` with `failed_when` and explicit state-checking elsewhere). no_log on the registration token. inventory/local.yml (new) Inventory for `bootstrap-r720.sh` — connection: local instead of SSH+become. Same group structure as staging.yml ; container groups use community.general.incus connection plugin (the local incus binary, no remote). inventory/{staging,prod}.yml (modified) Added `forgejo_runner` group (target of bootstrap_runner.yml phase 3, reached via community.general.incus from the host). scripts/bootstrap/bootstrap-local.sh (rewritten) Five phases : preflight, vault, forgejo, ansible, summary. Phase 4 calls a single `ansible-playbook` with both bootstrap_runner.yml + haproxy.yml in sequence. --ask-become-pass : ansible prompts ONCE for sudo, holds in memory, reuses for every become: true task. scripts/bootstrap/bootstrap-r720.sh (new) Symmetric to bootstrap-local.sh but runs as root on the R720. No SSH preflight, no --ask-become-pass (already root). Same Ansible playbooks, inventory/local.yml. scripts/bootstrap/verify-r720.sh (new — replaces verify-remote) Read-only checks of R720 state. Run as root locally on the R720. scripts/bootstrap/verify-local.sh (modified) Cross-host SSH check now fits the env-var-driven SSH_TARGET pattern (R720_USER may be empty if the alias has User=). scripts/bootstrap/{bootstrap-remote.sh, verify-remote.sh, verify-remote-ssh.sh} (DELETED) Replaced by playbooks/bootstrap_runner.yml + verify-r720.sh. README.md (rewritten) Documents the parallel-script architecture, the no-NOPASSWD-sudo design choice (--ask-become-pass), each phase's needs, and a refreshed troubleshooting list. State files unchanged in shape : laptop : .git/talas-bootstrap/local.state R720 : /var/lib/talas/r720-bootstrap.state --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:12:26 +02:00
senke	edfa315947	fix(ansible): inventory uses srv-102v alias + bootstrap phase 5 detects sudo Two issues from a real phase-5 run : 1. inventory/staging.yml + prod.yml hardcoded ansible_host=10.0.20.150 That LAN IP isn't routed via the operator's WireGuard (only 10.0.20.105/Forgejo is). Ansible timed out on TCP/22. Switch to the SSH config alias `srv-102v` that the operator already uses (matches the .env default). ansible_user=senke. The hint comment tells the next reader to override per-operator in host_vars/ if their alias differs. 2. Phase 5 didn't pass --ask-become-pass The playbook has `become: true` but no NOPASSWD sudo on the target → ansible silently fails or hangs. Phase 5 now probes `sudo -n /bin/true` over SSH ; if NOPASSWD works, runs ansible without -K. Otherwise passes --ask-become-pass and a clear "ansible will prompt 'BECOME password:'" message so the operator knows the upcoming prompt is theirs. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:39:39 +02:00
senke	5153ab113d	refactor(ansible): single edge HAProxy — multi-env + Forgejo + Talas The 12-record DNS plan ($1 per record at the registrar but only one public R720 IP) forces the obvious : a single HAProxy on :443 must serve staging.veza.fr + veza.fr + www.veza.fr + talas.fr + www.talas.fr + forgejo.talas.group all at once. Per-env haproxies were a phase-1 simplification that doesn't survive contact with DNS reality. Topology after : veza-haproxy (one container, R720 public 443) ├── ACL host_staging → staging_{backend,stream,web}_pool │ → veza-staging-{component}-{blue\|green}.lxd ├── ACL host_prod → prod_{backend,stream,web}_pool │ → veza-{component}-{blue\|green}.lxd ├── ACL host_forgejo → forgejo_backend → 10.0.20.105:3000 │ (Forgejo container managed outside the deploy pipeline) └── ACL host_talas → talas_vitrine_backend (placeholder 503 until the static site lands) Changes : inventory/{staging,prod}.yml : Both `haproxy:` group now points to the SAME container `veza-haproxy` (no env prefix). Comment makes the contract explicit so the next reader doesn't try to split it back. group_vars/all/main.yml : NEW : haproxy_env_prefixes (per-env container prefix mapping). NEW : haproxy_env_public_hosts (per-env Host-header mapping). NEW : haproxy_forgejo_host + haproxy_forgejo_backend. NEW : haproxy_talas_hosts + haproxy_talas_vitrine_backend. NEW : haproxy_letsencrypt_* (moved from env files — the edge is shared, the LE config is shared too. Else the env that ran the haproxy role last would clobber the domain set). group_vars/{staging,prod}.yml : Strip the haproxy_letsencrypt_* block (now in all/main.yml). Comment points readers there. roles/haproxy/templates/haproxy.cfg.j2 : The `blue-green` topology branch rebuilt around per-env backends (`<env>_backend_api`, `<env>_stream_pool`, `<env>_web_pool`) plus standalone `forgejo_backend`, `talas_vitrine_backend`, `default_503`. Frontend ACLs : `host_<env>` (hdr(host) -i ...) selects which env's backends to use ; path ACLs (`is_api`, `is_stream_seg`, etc.) refine within the env. Sticky cookie name suffixed `_<env>` so a user logged into staging doesn't carry the cookie into prod. Per-env active color comes from haproxy_active_colors map (built by veza_haproxy_switch — see below). Multi-instance branch (lab) untouched. roles/veza_haproxy_switch/defaults/main.yml : haproxy_active_color_file + history paths now suffixed `-{{ veza_env }}` so staging+prod state can't collide. roles/veza_haproxy_switch/tasks/main.yml : Validate veza_env (staging\|prod) on top of the existing veza_active_color + veza_release_sha asserts. Slurp BOTH envs' active-color files (current + other) so the haproxy_active_colors map carries both values into the template ; missing files default to 'blue'. playbooks/deploy_app.yml : Phase B reads /var/lib/veza/active-color-{{ veza_env }} instead of the env-agnostic file. playbooks/cleanup_failed.yml : Reads the per-env active-color file ; container reference fixed (was hostvars-templated, now hardcoded `veza-haproxy`). playbooks/rollback.yml : Fast-mode SHA lookup reads the per-env history file. Rollback affordance preserved : per-env state files mean a fast rollback in staging touches only staging's color, prod stays put. The history files (`active-color-{staging,prod}.history`) keep the last 5 deploys per env independently. Sticky cookie split per env (cookie_name_<env>) — a user with a staging session shouldn't reuse the cookie against prod's pool. Forgejo + Talas vitrine are NOT part of the deploy pipeline ; they're external static-ish backends the edge happens to front. haproxy_forgejo_backend is "10.0.20.105:3000" today (matches the existing Incus container at that address). --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:32:49 +02:00
senke	f9d00bbe4d	fix(ansible): syntax-check fixes — dynamic groups + block/rescue at task level Three classes of issue surfaced by `ansible-playbook --syntax-check` on the playbooks landed earlier in this series : 1. `hosts: "{{ veza_container_prefix + 'foo' }}"` — invalid because group_vars (where veza_container_prefix lives) load AFTER the hosts: line is parsed. 2. `block`/`rescue` at PLAY level — Ansible only accepts these at task level. 3. `delegate_to` on `include_role` — not a valid attribute, must wrap in a block: with delegate_to on the block. Fixes : inventory/{staging,prod}.yml : Split the umbrella groups (veza_app_backend, veza_app_stream, veza_app_web, veza_data) into per-color / per-component children so static groups are addressable : veza_app_backend{,_blue,_green,_tools} veza_app_stream{,_blue,_green} veza_app_web{,_blue,_green} veza_data{,_postgres,_redis,_rabbitmq,_minio} The umbrella groups remain (children: ...) so existing consumers keep working. playbooks/deploy_app.yml : * Phase A : hosts: veza_app_backend_tools (was templated). * Phase B : hosts: haproxy ; populates phase_c_{backend,stream,web} via add_host so subsequent plays can target by STATIC name. * Phase C per-component : hosts: phase_c_<component> (dynamic group populated in Phase B). * Phase D / E : hosts: haproxy. * Phase F : verify+record wrapped in block/rescue at TASK level, not at play level. Re-switch HAProxy uses delegate_to on a block, with include_role inside. * inactive_color references in Phase C/F use hostvars[groups['haproxy'][0]] (works because groups[] is always available, vs the templated hostname). playbooks/deploy_data.yml : * Per-kind plays use static group names (veza_data_postgres etc.) instead of templated hostnames. * `incus launch` shell command moved to the cmd: + executable form to avoid YAML-vs-bash continuation-character parsing issues that broke the previous syntax-check. playbooks/rollback.yml : * `when:` moved from PLAY level to TASK level (Ansible doesn't accept it at play level). * `import_playbook ... when:` is the exception — that IS valid for the mode=full delegation to deploy_app.yml. * Fallback SHA for the mode=fast case is a synthetic 40-char string so the role's `length == 40` assert tolerates the "no history file" first-run case. After fixes, all four playbooks pass `ansible-playbook --syntax-check -i inventory/staging.yml ...`. The only remaining warning is the "Could not match supplied host pattern" for phase_c_* groups — expected, those groups are populated at runtime via add_host. community.postgresql / community.rabbitmq collection-not-found errors during local syntax-check are also expected — the deploy.yml workflow installs them on the runner via ansible-galaxy. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:01:24 +02:00
senke	6de2923821	chore(ansible): inventory/staging.yml + prod.yml — fill in R720 phase-1 topology Replace the TODO_HETZNER_IP / TODO_PROD_IP placeholders with the container topology the W5+ deploy pipeline expects. Both inventories now declare : incus_hosts the R720 (10.0.20.150 — operator updates to the actual address before first deploy) haproxy one persistent container ; per-deploy reload only, never destroyed veza_app_backend {prefix}backend-{blue,green,tools} veza_app_stream {prefix}stream-{blue,green} veza_app_web {prefix}web-{blue,green} veza_data {prefix}{postgres,redis,rabbitmq,minio} All non-host groups set ansible_connection: community.general.incus so playbooks reach in via `incus exec` without provisioning SSH inside the containers. Naming convention diverges per env to match what's already established in the codebase : staging : veza-staging-<component>[-<color>] prod : veza-<component>[-<color>] (bare, the prod default) Both inventories share the same Incus host in v1.0 (single R720). Prod migrates off-box at v1.1+ ; only ansible_host needs updating. Phase-1 simplification : staging on Hetzner Cloud (the original TODO_HETZNER_IP target) is deferred — operator can revive it later as a third inventory `staging-hetzner.yml` if needed. Local-on-R720 staging is what the user's prompt actually asked for. Containers absent at first run are fine — playbooks/deploy_data.yml + deploy_app.yml create them on demand. The inventory just makes them addressable once they exist. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 14:50:27 +02:00
senke	65c20835c1	feat(infra): Ansible IaC scaffolding — common + incus_host roles (Day 5 v1.0.9) Some checks failed Veza CI / Frontend (Web) (push) Has been cancelled Details E2E Playwright / e2e (full) (push) Has been cancelled Details Veza CI / Notify on failure (push) Blocked by required conditions Details Veza CI / Rust (Stream Server) (push) Successful in 3m27s Details Security Scan / Secret Scanning (gitleaks) (push) Successful in 52s Details Veza CI / Backend (Go) (push) Successful in 5m32s Details Day 5 of ROADMAP_V1.0_LAUNCH.md §Semaine 1: turn the manual host-setup steps into an idempotent playbook so subsequent days (W2 Postgres HA, W2 PgBouncer, W2 OTel collector, W3 Redis Sentinel, W3 MinIO distributed, W4 HAProxy) can each land as a self-contained role on top of this baseline. Layout (full tree under infra/ansible/): ansible.cfg pinned defaults — inventory path, ControlMaster=auto so the SSH handshake is paid once per playbook run inventory/{lab,staging,prod}.yml three environments. lab is the R720's local Incus container (10.0.20.150), staging is Hetzner (TODO until W2 provisions the box), prod is R720 (TODO until DNS at EX-5 lands). group_vars/all.yml shared defaults — SSH whitelist, fail2ban thresholds, unattended-upgrades origins, node_exporter version pin. playbooks/site.yml entry point. Two plays: 1. common (every host) 2. incus_host (incus_hosts group) roles/common/ idempotent baseline: ssh.yml — drop-in /etc/ssh/sshd_config.d/50-veza- hardening.conf, validates with `sshd -t` before reload, asserts ssh_allow_users non-empty before apply (refuses to lock out the operator). fail2ban.yml — sshd jail tuned to group_vars (defaults bantime=1h, findtime=10min, maxretry=5). unattended_upgrades.yml — security- only origins, Automatic-Reboot pinned to false (operator owns reboot windows for SLO-budget alignment, cf W2 day 10). node_exporter.yml — pinned to 1.8.2, runs as a systemd unit on :9100. Skips download when --version already matches. roles/incus_host/ zabbly upstream apt repo + incus + incus-client install. First-time `incus admin init --preseed` only when `incus list` errors (i.e. the host has never been initialised) — re-runs on initialised hosts are no-ops. Configures incusbr0 / 10.99.0.1/24 with NAT + default storage pool. Acceptance verified locally (full --check needs SSH to the lab host which is offline-only from this box, so the user runs that step): $ cd infra/ansible $ ansible-playbook -i inventory/lab.yml playbooks/site.yml --syntax-check playbook: playbooks/site.yml ← clean $ ansible-playbook -i inventory/lab.yml playbooks/site.yml --list-tasks 21 tasks across 2 plays, all tagged. ← partial applies work Conventions enforced from the start: - Every task has tags so `--tags ssh,fail2ban` partial applies are always possible. - Sub-task files (ssh.yml, fail2ban.yml, etc.) so the role main.yml stays a directory of concerns, not a wall of tasks. - Validators run before reload (sshd -t for sshd_config). The role refuses to apply changes that would lock the operator out. - Comments answer "why" — task names + module names already say "what". Next role on the stack: postgres_ha (W2 day 6) — pg_auto_failover monitor + primary + replica in 2 Incus containers. SKIP_TESTS=1 — IaC YAML, no app code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 18:16:38 +02:00

7 commits