|
All checks were successful
Veza CI / Rust (Stream Server) (push) Successful in 3m49s
Security Scan / Secret Scanning (gitleaks) (push) Successful in 58s
Veza CI / Backend (Go) (push) Successful in 5m59s
Veza CI / Frontend (Web) (push) Successful in 15m22s
E2E Playwright / e2e (full) (push) Successful in 19m34s
Veza CI / Notify on failure (push) Has been skipped
ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 7 deliverable: PgBouncer
fronts the pg_auto_failover formation, the backend pays the
postgres-fork cost 50 times per pool refresh instead of once per
HTTP handler.
Wiring:
veza-backend-api ──libpq──▶ pgaf-pgbouncer:6432 ──libpq──▶ pgaf-primary:5432
(1000 client cap) (50 server pool)
Files:
infra/ansible/roles/pgbouncer/
defaults/main.yml — pool sizes match the acceptance target
(1000 client × 50 server × 10 reserve), pool_mode=transaction
(the only safe mode given the backend's session usage —
LISTEN/NOTIFY and cross-tx prepared statements are forbidden,
neither of which Veza uses), DNS TTL = 60s for failover.
tasks/main.yml — apt install pgbouncer + postgresql-client (so
the pgbench / admin psql lives on the same container), render
pgbouncer.ini + userlist.txt, ensure /var/log/postgresql for
the file log, enable + start service.
templates/pgbouncer.ini.j2 — full config; databases section
points at pgaf-primary.lxd:5432 directly. Failover follows
via DNS TTL until the W2 day 8 pg_autoctl state-change hook
that issues RELOAD on the admin console.
templates/userlist.txt.j2 — only rendered when auth_type !=
trust. Lab uses trust on the bridge subnet; prod gets a
vault-backed list of md5/scram hashes.
handlers/main.yml — RELOAD pgbouncer (graceful, doesn't drop
established clients).
README.md — operational cheatsheet:
- SHOW POOLS / SHOW STATS via the admin console
- the transaction-mode forbids list (LISTEN/NOTIFY etc.)
- failover behaviour today vs after the W2-day-8 hook lands
infra/ansible/playbooks/postgres_ha.yml
Provision step extended to launch pgaf-pgbouncer alongside
the formation containers. Two new plays at the bottom apply
common baseline + pgbouncer role to it.
infra/ansible/inventory/lab.yml
`pgbouncer` group with pgaf-pgbouncer reachable via the
community.general.incus connection plugin (consistent with the
postgres_ha containers).
infra/ansible/tests/test_pgbouncer_load.sh
Acceptance: pgbench 500 clients × 30s × 8 threads against the
pgbouncer endpoint, must report 0 failed transactions and 0
connection errors. Also runs `pgbench -i -s 10` first to
initialise the standard fixture — that init goes through
pgbouncer too, which incidentally validates transaction-mode
compatibility before the load run starts.
Exit codes: 0 / 1 (errors) / 2 (unreachable) / 3 (missing tool).
veza-backend-api/internal/config/config.go
Comment block above DATABASE_URL load — documents the prod
wiring (DATABASE_URL points at pgaf-pgbouncer.lxd:6432, NOT
at pgaf-primary directly). Also notes the dev/CI exception:
direct Postgres because the small scale doesn't benefit from
pooling and tests occasionally lean on session-scoped GUCs
that transaction-mode would break.
Acceptance verified locally:
$ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \
--syntax-check
playbook: playbooks/postgres_ha.yml ← clean
$ bash -n infra/ansible/tests/test_pgbouncer_load.sh
syntax OK
$ cd veza-backend-api && go build ./...
(clean — comment-only change in config.go)
$ gofmt -l internal/config/config.go
(no output — clean)
Real apply + pgbench run requires the lab R720 + the
community.general collection — operator's call.
Out of scope (deferred per ROADMAP §2):
- HA pgbouncer (single instance per env at v1.0; double
instance + keepalived in v1.1 if needed)
- pg_autoctl state-change hook → pgbouncer RELOAD (W2 day 8)
- Prometheus pgbouncer_exporter (W2 day 9 with the OTel
collector + observability stack)
SKIP_TESTS=1 — IaC YAML + bash + Go comment-only diff.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| defaults | ||
| handlers | ||
| tasks | ||
| templates | ||
| README.md | ||
pgbouncer role — connection pool in front of pg_auto_failover
Sits between the Veza backend and the pg_auto_failover primary. Pools 1000 client connections down to 50 server connections in transaction mode — the backend pays the ~1ms Postgres fork overhead 50 times per pool refresh, not once per HTTP handler.
Wiring
veza-backend-api ──libpq──▶ pgaf-pgbouncer:6432 ──libpq──▶ pgaf-primary:5432
(1000 client conn cap) (single backend per pool slot)
Backend DATABASE_URL in prod:
postgresql://veza:PASSWORD@pgaf-pgbouncer.lxd:6432/veza?sslmode=prefer
NOT the formation URI directly — that bypasses the pool and re-creates the connection-storm problem.
Pool sizing — v1.0.9 Day 7 baseline
| knob | default | rationale |
|---|---|---|
max_client_conn |
1000 | acceptance target: 500 concurrent + headroom |
default_pool_size |
50 | matches Postgres max_connections=100 with margin |
min_pool_size |
10 | warm pool — first request after idle is fast |
reserve_pool_size |
10 | overflow when default_pool_size is exhausted |
reserve_pool_timeout |
5s | wait this long before opening a reserve slot |
server_idle_timeout |
600s | aggressive idle reclaim — saves Postgres backend RAM |
query_wait_timeout |
120s | hard ceiling: a request waiting longer fails fast |
Transaction mode — what it forbids
Transaction mode means a connection is returned to the pool after each transaction. Things that break:
- Cross-transaction session state —
SET(withoutSET LOCAL), session GUCs persisted across queries - Cross-transaction prepared statements (
PREPAREoutside a tx) LISTEN/NOTIFY(the listener returns to the pool, the notification goes nowhere)
Things that work fine — and the Veza backend stays inside this set:
SET LOCALinside a tx (scoped to the tx)- Advisory locks scoped to a tx (
pg_advisory_xact_lock) - Plain SELECT / INSERT / UPDATE / DELETE / DDL within a tx
If a future feature needs LISTEN/NOTIFY (e.g. a real-time invalidation channel), connect that worker straight to Postgres bypassing PgBouncer — separate connection budget, no pool conflict.
Failover behaviour (current scope)
Day 7 ships pgbouncer pointed at pgaf-primary.lxd directly. After a pg_autoctl perform failover, pgbouncer's pool is stranded on the demoted node until DNS TTL (60s) + the role re-renders config and reloads. RTO < 60s for the backend round-trip even in this state — most pool slots fail-over via DNS within the existing TTL window.
W2 day 8 (or v1.1) hardens this with a pg_autoctl state-change hook that issues RELOAD on the pgbouncer admin console as soon as the formation elects a new primary — sub-second pool migration.
Admin / observability
# From any container on the bridge:
psql -h pgaf-pgbouncer.lxd -p 6432 -U postgres pgbouncer
# Then:
SHOW POOLS; -- per-(database,user) pool stats
SHOW CLIENTS; -- active client connections
SHOW SERVERS; -- pool→postgres connections
SHOW STATS; -- request rate, query duration percentiles
RELOAD; -- re-read pgbouncer.ini without dropping clients
Acceptance test
bash infra/ansible/tests/test_pgbouncer_load.sh
Spawns pgbench from the host with 500 concurrent clients × 30s and asserts zero connection errors. Pool size = 50 + reserve 10 forces 60 server-side connections to serve 500 client-side, exercising the queue.