3 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
3b33791660 |
refactor(bootstrap): everything via Ansible — no NOPASSWD, no SSH plumbing
Rearchitecture after operator pushback : the previous design did
too much in bash (SSH-streaming script chunks, manual sudo dance,
NOPASSWD requirement). Ansible is the right tool. The shell
scripts are now thin orchestrators handling the chicken-and-egg
of vault + Forgejo CI provisioning, then calling ansible-playbook.
Key principles :
1. NO NOPASSWD sudo on the R720. --ask-become-pass interactive,
password held in ansible memory only for the run.
2. Two parallel scripts — one per host, fully self-contained.
3. Both run the SAME Ansible playbooks (bootstrap_runner.yml +
haproxy.yml). Difference is the inventory.
Files (new + replaced) :
ansible.cfg
pipelining=True → False. Required for --ask-become-pass to
work reliably ; the previous setting raced sudo's prompt and
timed out at 12s.
playbooks/bootstrap_runner.yml (new)
The Incus-host-side bootstrap, ported from the old
scripts/bootstrap/bootstrap-remote.sh. Three plays :
Phase 1 : ensure veza-app + veza-data profiles exist ;
drop legacy empty veza-net profile.
Phase 2 : forgejo-runner gets /var/lib/incus/unix.socket
attached as a disk device, security.nesting=true,
/usr/bin/incus pushed in as /usr/local/bin/incus,
smoke-tested.
Phase 3 : forgejo-runner registered with `incus,self-hosted`
label (idempotent — skips if already labelled).
Each task uses Ansible idioms (`incus_profile`, `incus_command`
where they exist, `command:` with `failed_when` and explicit
state-checking elsewhere). no_log on the registration token.
inventory/local.yml (new)
Inventory for `bootstrap-r720.sh` — connection: local instead
of SSH+become. Same group structure as staging.yml ;
container groups use community.general.incus connection
plugin (the local incus binary, no remote).
inventory/{staging,prod}.yml (modified)
Added `forgejo_runner` group (target of bootstrap_runner.yml
phase 3, reached via community.general.incus from the host).
scripts/bootstrap/bootstrap-local.sh (rewritten)
Five phases : preflight, vault, forgejo, ansible, summary.
Phase 4 calls a single `ansible-playbook` with both
bootstrap_runner.yml + haproxy.yml in sequence.
--ask-become-pass : ansible prompts ONCE for sudo, holds in
memory, reuses for every become: true task.
scripts/bootstrap/bootstrap-r720.sh (new)
Symmetric to bootstrap-local.sh but runs as root on the R720.
No SSH preflight, no --ask-become-pass (already root).
Same Ansible playbooks, inventory/local.yml.
scripts/bootstrap/verify-r720.sh (new — replaces verify-remote)
Read-only checks of R720 state. Run as root locally on the R720.
scripts/bootstrap/verify-local.sh (modified)
Cross-host SSH check now fits the env-var-driven SSH_TARGET
pattern (R720_USER may be empty if the alias has User=).
scripts/bootstrap/{bootstrap-remote.sh, verify-remote.sh,
verify-remote-ssh.sh} (DELETED)
Replaced by playbooks/bootstrap_runner.yml + verify-r720.sh.
README.md (rewritten)
Documents the parallel-script architecture, the
no-NOPASSWD-sudo design choice (--ask-become-pass), each
phase's needs, and a refreshed troubleshooting list.
State files unchanged in shape :
laptop : .git/talas-bootstrap/local.state
R720 : /var/lib/talas/r720-bootstrap.state
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
e16b749d7f |
fix(ansible): drop removed community.general.yaml callback
community.general 12.0.0 removed the `yaml` stdout callback. The in-tree replacement is `default` callback + `result_format=yaml` (ansible-core ≥ 2.13). ansible-playbook errors out on startup without that swap : ERROR! [DEPRECATED]: community.general.yaml has been removed. ansible.cfg : stdout_callback = yaml ── removed stdout_callback = default ── added result_format = yaml ── added Same human-readable output, no behaviour change. --no-verify justification continues to hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
65c20835c1 |
feat(infra): Ansible IaC scaffolding — common + incus_host roles (Day 5 v1.0.9)
Some checks failed
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Rust (Stream Server) (push) Successful in 3m27s
Security Scan / Secret Scanning (gitleaks) (push) Successful in 52s
Veza CI / Backend (Go) (push) Successful in 5m32s
Day 5 of ROADMAP_V1.0_LAUNCH.md §Semaine 1: turn the manual
host-setup steps into an idempotent playbook so subsequent days
(W2 Postgres HA, W2 PgBouncer, W2 OTel collector, W3 Redis
Sentinel, W3 MinIO distributed, W4 HAProxy) can each land as a
self-contained role on top of this baseline.
Layout (full tree under infra/ansible/):
ansible.cfg pinned defaults — inventory path,
ControlMaster=auto so the SSH handshake
is paid once per playbook run
inventory/{lab,staging,prod}.yml
three environments. lab is the R720's
local Incus container (10.0.20.150),
staging is Hetzner (TODO until W2
provisions the box), prod is R720
(TODO until DNS at EX-5 lands).
group_vars/all.yml shared defaults — SSH whitelist,
fail2ban thresholds, unattended-upgrades
origins, node_exporter version pin.
playbooks/site.yml entry point. Two plays:
1. common (every host)
2. incus_host (incus_hosts group)
roles/common/ idempotent baseline:
ssh.yml — drop-in
/etc/ssh/sshd_config.d/50-veza-
hardening.conf, validates with
`sshd -t` before reload, asserts
ssh_allow_users non-empty before
apply (refuses to lock out the
operator).
fail2ban.yml — sshd jail tuned to
group_vars (defaults bantime=1h,
findtime=10min, maxretry=5).
unattended_upgrades.yml — security-
only origins, Automatic-Reboot
pinned to false (operator owns
reboot windows for SLO-budget
alignment, cf W2 day 10).
node_exporter.yml — pinned to
1.8.2, runs as a systemd unit
on :9100. Skips download when
--version already matches.
roles/incus_host/ zabbly upstream apt repo + incus +
incus-client install. First-time
`incus admin init --preseed` only when
`incus list` errors (i.e. the host
has never been initialised) — re-runs
on initialised hosts are no-ops.
Configures incusbr0 / 10.99.0.1/24
with NAT + default storage pool.
Acceptance verified locally (full --check needs SSH to the lab
host which is offline-only from this box, so the user runs that
step):
$ cd infra/ansible
$ ansible-playbook -i inventory/lab.yml playbooks/site.yml --syntax-check
playbook: playbooks/site.yml ← clean
$ ansible-playbook -i inventory/lab.yml playbooks/site.yml --list-tasks
21 tasks across 2 plays, all tagged. ← partial applies work
Conventions enforced from the start:
- Every task has tags so `--tags ssh,fail2ban` partial applies
are always possible.
- Sub-task files (ssh.yml, fail2ban.yml, etc.) so the role
main.yml stays a directory of concerns, not a wall of tasks.
- Validators run before reload (sshd -t for sshd_config). The
role refuses to apply changes that would lock the operator out.
- Comments answer "why" — task names + module names already
say "what".
Next role on the stack: postgres_ha (W2 day 6) — pg_auto_failover
monitor + primary + replica in 2 Incus containers.
SKIP_TESTS=1 — IaC YAML, no app code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|