veza/infra/ansible/playbooks/postgres_ha.yml
senke ba6e8b4e0e
All checks were successful
Veza CI / Rust (Stream Server) (push) Successful in 3m49s
Security Scan / Secret Scanning (gitleaks) (push) Successful in 58s
Veza CI / Backend (Go) (push) Successful in 5m59s
Veza CI / Frontend (Web) (push) Successful in 15m22s
E2E Playwright / e2e (full) (push) Successful in 19m34s
Veza CI / Notify on failure (push) Has been skipped
feat(infra): pgbouncer role + pgbench load test (W2 Day 7)
ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 7 deliverable: PgBouncer
fronts the pg_auto_failover formation, the backend pays the
postgres-fork cost 50 times per pool refresh instead of once per
HTTP handler.

Wiring:
  veza-backend-api ──libpq──▶ pgaf-pgbouncer:6432 ──libpq──▶ pgaf-primary:5432
                              (1000 client cap)             (50 server pool)

Files:
  infra/ansible/roles/pgbouncer/
    defaults/main.yml — pool sizes match the acceptance target
      (1000 client × 50 server × 10 reserve), pool_mode=transaction
      (the only safe mode given the backend's session usage —
      LISTEN/NOTIFY and cross-tx prepared statements are forbidden,
      neither of which Veza uses), DNS TTL = 60s for failover.
    tasks/main.yml — apt install pgbouncer + postgresql-client (so
      the pgbench / admin psql lives on the same container), render
      pgbouncer.ini + userlist.txt, ensure /var/log/postgresql for
      the file log, enable + start service.
    templates/pgbouncer.ini.j2 — full config; databases section
      points at pgaf-primary.lxd:5432 directly. Failover follows
      via DNS TTL until the W2 day 8 pg_autoctl state-change hook
      that issues RELOAD on the admin console.
    templates/userlist.txt.j2 — only rendered when auth_type !=
      trust. Lab uses trust on the bridge subnet; prod gets a
      vault-backed list of md5/scram hashes.
    handlers/main.yml — RELOAD pgbouncer (graceful, doesn't drop
      established clients).
    README.md — operational cheatsheet:
      - SHOW POOLS / SHOW STATS via the admin console
      - the transaction-mode forbids list (LISTEN/NOTIFY etc.)
      - failover behaviour today vs after the W2-day-8 hook lands

  infra/ansible/playbooks/postgres_ha.yml
    Provision step extended to launch pgaf-pgbouncer alongside
    the formation containers. Two new plays at the bottom apply
    common baseline + pgbouncer role to it.

  infra/ansible/inventory/lab.yml
    `pgbouncer` group with pgaf-pgbouncer reachable via the
    community.general.incus connection plugin (consistent with the
    postgres_ha containers).

  infra/ansible/tests/test_pgbouncer_load.sh
    Acceptance: pgbench 500 clients × 30s × 8 threads against the
    pgbouncer endpoint, must report 0 failed transactions and 0
    connection errors. Also runs `pgbench -i -s 10` first to
    initialise the standard fixture — that init goes through
    pgbouncer too, which incidentally validates transaction-mode
    compatibility before the load run starts.
    Exit codes: 0 / 1 (errors) / 2 (unreachable) / 3 (missing tool).

  veza-backend-api/internal/config/config.go
    Comment block above DATABASE_URL load — documents the prod
    wiring (DATABASE_URL points at pgaf-pgbouncer.lxd:6432, NOT
    at pgaf-primary directly). Also notes the dev/CI exception:
    direct Postgres because the small scale doesn't benefit from
    pooling and tests occasionally lean on session-scoped GUCs
    that transaction-mode would break.

Acceptance verified locally:
  $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \
      --syntax-check
  playbook: playbooks/postgres_ha.yml          ← clean
  $ bash -n infra/ansible/tests/test_pgbouncer_load.sh
  syntax OK
  $ cd veza-backend-api && go build ./...
  (clean — comment-only change in config.go)
  $ gofmt -l internal/config/config.go
  (no output — clean)

Real apply + pgbench run requires the lab R720 + the
community.general collection — operator's call.

Out of scope (deferred per ROADMAP §2):
  - HA pgbouncer (single instance per env at v1.0; double
    instance + keepalived in v1.1 if needed)
  - pg_autoctl state-change hook → pgbouncer RELOAD (W2 day 8)
  - Prometheus pgbouncer_exporter (W2 day 9 with the OTel
    collector + observability stack)

SKIP_TESTS=1 — IaC YAML + bash + Go comment-only diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 18:35:05 +02:00

88 lines
3 KiB
YAML

# Postgres HA playbook — provisions 3 Incus containers on the
# `incus_hosts` group (lab/staging/prod) and lays down the
# pg_auto_failover formation across them.
#
# Topology:
# - pgaf-monitor — the state machine (single instance)
# - pgaf-primary — first data node, becomes primary at first boot
# - pgaf-replica — second data node, becomes hot-standby
#
# v1.0.9 Day 6 — single host (R720 lab) for now. W2 day 7+ moves
# the data nodes onto separate physical hosts when Hetzner standby
# is provisioned. The formation works the same either way.
#
# Run with:
# ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml --check
# ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml
---
- name: Provision Incus containers for the Postgres formation + pgbouncer
hosts: incus_hosts
become: true
gather_facts: true
tasks:
- name: Launch pgaf-monitor + pgaf-primary + pgaf-replica + pgaf-pgbouncer
ansible.builtin.shell:
cmd: |
set -e
for ct in pgaf-monitor pgaf-primary pgaf-replica pgaf-pgbouncer; do
if ! incus info "$ct" >/dev/null 2>&1; then
incus launch images:ubuntu/22.04 "$ct"
# Wait for cloud-init / network to settle.
for _ in $(seq 1 30); do
if incus exec "$ct" -- cloud-init status 2>/dev/null | grep -q "status: done"; then
break
fi
sleep 1
done
# Install python3 inside the container so Ansible can
# speak to it via the incus connection plugin.
incus exec "$ct" -- apt-get update
incus exec "$ct" -- apt-get install -y python3 python3-apt
fi
done
args:
executable: /bin/bash
register: provision_result
changed_when: "'incus launch' in provision_result.stdout"
tags: [postgres_ha, pgbouncer, provision]
- name: Refresh inventory so the new containers are reachable via the incus connection
ansible.builtin.meta: refresh_inventory
- name: Apply common baseline to the formation containers
hosts: postgres_ha
become: true
gather_facts: true
roles:
- common
- name: Bring up the pg_auto_failover monitor first (formation depends on it)
hosts: postgres_ha_monitor
become: true
gather_facts: true
roles:
- postgres_ha
- name: Bring up the data nodes (primary registers first, replica registers second)
hosts: postgres_ha_nodes
become: true
gather_facts: true
serial: 1 # primary must register before replica — pg_auto_failover assigns roles by registration order
roles:
- postgres_ha
# v1.0.9 Day 7: PgBouncer fronts the formation. Common baseline first
# (SSH + node_exporter + fail2ban), then the pgbouncer role itself.
- name: Apply common baseline to the pgbouncer container
hosts: pgbouncer
become: true
gather_facts: true
roles:
- common
- name: Install + configure PgBouncer pointing at the formation
hosts: pgbouncer
become: true
gather_facts: true
roles:
- pgbouncer