All checks were successful
Veza CI / Rust (Stream Server) (push) Successful in 3m49s
Security Scan / Secret Scanning (gitleaks) (push) Successful in 58s
Veza CI / Backend (Go) (push) Successful in 5m59s
Veza CI / Frontend (Web) (push) Successful in 15m22s
E2E Playwright / e2e (full) (push) Successful in 19m34s
Veza CI / Notify on failure (push) Has been skipped
ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 7 deliverable: PgBouncer
fronts the pg_auto_failover formation, the backend pays the
postgres-fork cost 50 times per pool refresh instead of once per
HTTP handler.
Wiring:
veza-backend-api ──libpq──▶ pgaf-pgbouncer:6432 ──libpq──▶ pgaf-primary:5432
(1000 client cap) (50 server pool)
Files:
infra/ansible/roles/pgbouncer/
defaults/main.yml — pool sizes match the acceptance target
(1000 client × 50 server × 10 reserve), pool_mode=transaction
(the only safe mode given the backend's session usage —
LISTEN/NOTIFY and cross-tx prepared statements are forbidden,
neither of which Veza uses), DNS TTL = 60s for failover.
tasks/main.yml — apt install pgbouncer + postgresql-client (so
the pgbench / admin psql lives on the same container), render
pgbouncer.ini + userlist.txt, ensure /var/log/postgresql for
the file log, enable + start service.
templates/pgbouncer.ini.j2 — full config; databases section
points at pgaf-primary.lxd:5432 directly. Failover follows
via DNS TTL until the W2 day 8 pg_autoctl state-change hook
that issues RELOAD on the admin console.
templates/userlist.txt.j2 — only rendered when auth_type !=
trust. Lab uses trust on the bridge subnet; prod gets a
vault-backed list of md5/scram hashes.
handlers/main.yml — RELOAD pgbouncer (graceful, doesn't drop
established clients).
README.md — operational cheatsheet:
- SHOW POOLS / SHOW STATS via the admin console
- the transaction-mode forbids list (LISTEN/NOTIFY etc.)
- failover behaviour today vs after the W2-day-8 hook lands
infra/ansible/playbooks/postgres_ha.yml
Provision step extended to launch pgaf-pgbouncer alongside
the formation containers. Two new plays at the bottom apply
common baseline + pgbouncer role to it.
infra/ansible/inventory/lab.yml
`pgbouncer` group with pgaf-pgbouncer reachable via the
community.general.incus connection plugin (consistent with the
postgres_ha containers).
infra/ansible/tests/test_pgbouncer_load.sh
Acceptance: pgbench 500 clients × 30s × 8 threads against the
pgbouncer endpoint, must report 0 failed transactions and 0
connection errors. Also runs `pgbench -i -s 10` first to
initialise the standard fixture — that init goes through
pgbouncer too, which incidentally validates transaction-mode
compatibility before the load run starts.
Exit codes: 0 / 1 (errors) / 2 (unreachable) / 3 (missing tool).
veza-backend-api/internal/config/config.go
Comment block above DATABASE_URL load — documents the prod
wiring (DATABASE_URL points at pgaf-pgbouncer.lxd:6432, NOT
at pgaf-primary directly). Also notes the dev/CI exception:
direct Postgres because the small scale doesn't benefit from
pooling and tests occasionally lean on session-scoped GUCs
that transaction-mode would break.
Acceptance verified locally:
$ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \
--syntax-check
playbook: playbooks/postgres_ha.yml ← clean
$ bash -n infra/ansible/tests/test_pgbouncer_load.sh
syntax OK
$ cd veza-backend-api && go build ./...
(clean — comment-only change in config.go)
$ gofmt -l internal/config/config.go
(no output — clean)
Real apply + pgbench run requires the lab R720 + the
community.general collection — operator's call.
Out of scope (deferred per ROADMAP §2):
- HA pgbouncer (single instance per env at v1.0; double
instance + keepalived in v1.1 if needed)
- pg_autoctl state-change hook → pgbouncer RELOAD (W2 day 8)
- Prometheus pgbouncer_exporter (W2 day 9 with the OTel
collector + observability stack)
SKIP_TESTS=1 — IaC YAML + bash + Go comment-only diff.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
88 lines
3 KiB
YAML
88 lines
3 KiB
YAML
# Postgres HA playbook — provisions 3 Incus containers on the
|
|
# `incus_hosts` group (lab/staging/prod) and lays down the
|
|
# pg_auto_failover formation across them.
|
|
#
|
|
# Topology:
|
|
# - pgaf-monitor — the state machine (single instance)
|
|
# - pgaf-primary — first data node, becomes primary at first boot
|
|
# - pgaf-replica — second data node, becomes hot-standby
|
|
#
|
|
# v1.0.9 Day 6 — single host (R720 lab) for now. W2 day 7+ moves
|
|
# the data nodes onto separate physical hosts when Hetzner standby
|
|
# is provisioned. The formation works the same either way.
|
|
#
|
|
# Run with:
|
|
# ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml --check
|
|
# ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml
|
|
---
|
|
- name: Provision Incus containers for the Postgres formation + pgbouncer
|
|
hosts: incus_hosts
|
|
become: true
|
|
gather_facts: true
|
|
tasks:
|
|
- name: Launch pgaf-monitor + pgaf-primary + pgaf-replica + pgaf-pgbouncer
|
|
ansible.builtin.shell:
|
|
cmd: |
|
|
set -e
|
|
for ct in pgaf-monitor pgaf-primary pgaf-replica pgaf-pgbouncer; do
|
|
if ! incus info "$ct" >/dev/null 2>&1; then
|
|
incus launch images:ubuntu/22.04 "$ct"
|
|
# Wait for cloud-init / network to settle.
|
|
for _ in $(seq 1 30); do
|
|
if incus exec "$ct" -- cloud-init status 2>/dev/null | grep -q "status: done"; then
|
|
break
|
|
fi
|
|
sleep 1
|
|
done
|
|
# Install python3 inside the container so Ansible can
|
|
# speak to it via the incus connection plugin.
|
|
incus exec "$ct" -- apt-get update
|
|
incus exec "$ct" -- apt-get install -y python3 python3-apt
|
|
fi
|
|
done
|
|
args:
|
|
executable: /bin/bash
|
|
register: provision_result
|
|
changed_when: "'incus launch' in provision_result.stdout"
|
|
tags: [postgres_ha, pgbouncer, provision]
|
|
|
|
- name: Refresh inventory so the new containers are reachable via the incus connection
|
|
ansible.builtin.meta: refresh_inventory
|
|
|
|
- name: Apply common baseline to the formation containers
|
|
hosts: postgres_ha
|
|
become: true
|
|
gather_facts: true
|
|
roles:
|
|
- common
|
|
|
|
- name: Bring up the pg_auto_failover monitor first (formation depends on it)
|
|
hosts: postgres_ha_monitor
|
|
become: true
|
|
gather_facts: true
|
|
roles:
|
|
- postgres_ha
|
|
|
|
- name: Bring up the data nodes (primary registers first, replica registers second)
|
|
hosts: postgres_ha_nodes
|
|
become: true
|
|
gather_facts: true
|
|
serial: 1 # primary must register before replica — pg_auto_failover assigns roles by registration order
|
|
roles:
|
|
- postgres_ha
|
|
|
|
# v1.0.9 Day 7: PgBouncer fronts the formation. Common baseline first
|
|
# (SSH + node_exporter + fail2ban), then the pgbouncer role itself.
|
|
- name: Apply common baseline to the pgbouncer container
|
|
hosts: pgbouncer
|
|
become: true
|
|
gather_facts: true
|
|
roles:
|
|
- common
|
|
|
|
- name: Install + configure PgBouncer pointing at the formation
|
|
hosts: pgbouncer
|
|
become: true
|
|
gather_facts: true
|
|
roles:
|
|
- pgbouncer
|