senke/veza

History

senke ba6e8b4e0e All checks were successful Veza CI / Rust (Stream Server) (push) Successful in 3m49s Details Security Scan / Secret Scanning (gitleaks) (push) Successful in 58s Details Veza CI / Backend (Go) (push) Successful in 5m59s Details Veza CI / Frontend (Web) (push) Successful in 15m22s Details E2E Playwright / e2e (full) (push) Successful in 19m34s Details Veza CI / Notify on failure (push) Has been skipped Details feat(infra): pgbouncer role + pgbench load test (W2 Day 7) ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 7 deliverable: PgBouncer fronts the pg_auto_failover formation, the backend pays the postgres-fork cost 50 times per pool refresh instead of once per HTTP handler. Wiring: veza-backend-api ──libpq──▶ pgaf-pgbouncer:6432 ──libpq──▶ pgaf-primary:5432 (1000 client cap) (50 server pool) Files: infra/ansible/roles/pgbouncer/ defaults/main.yml — pool sizes match the acceptance target (1000 client × 50 server × 10 reserve), pool_mode=transaction (the only safe mode given the backend's session usage — LISTEN/NOTIFY and cross-tx prepared statements are forbidden, neither of which Veza uses), DNS TTL = 60s for failover. tasks/main.yml — apt install pgbouncer + postgresql-client (so the pgbench / admin psql lives on the same container), render pgbouncer.ini + userlist.txt, ensure /var/log/postgresql for the file log, enable + start service. templates/pgbouncer.ini.j2 — full config; databases section points at pgaf-primary.lxd:5432 directly. Failover follows via DNS TTL until the W2 day 8 pg_autoctl state-change hook that issues RELOAD on the admin console. templates/userlist.txt.j2 — only rendered when auth_type != trust. Lab uses trust on the bridge subnet; prod gets a vault-backed list of md5/scram hashes. handlers/main.yml — RELOAD pgbouncer (graceful, doesn't drop established clients). README.md — operational cheatsheet: - SHOW POOLS / SHOW STATS via the admin console - the transaction-mode forbids list (LISTEN/NOTIFY etc.) - failover behaviour today vs after the W2-day-8 hook lands infra/ansible/playbooks/postgres_ha.yml Provision step extended to launch pgaf-pgbouncer alongside the formation containers. Two new plays at the bottom apply common baseline + pgbouncer role to it. infra/ansible/inventory/lab.yml `pgbouncer` group with pgaf-pgbouncer reachable via the community.general.incus connection plugin (consistent with the postgres_ha containers). infra/ansible/tests/test_pgbouncer_load.sh Acceptance: pgbench 500 clients × 30s × 8 threads against the pgbouncer endpoint, must report 0 failed transactions and 0 connection errors. Also runs `pgbench -i -s 10` first to initialise the standard fixture — that init goes through pgbouncer too, which incidentally validates transaction-mode compatibility before the load run starts. Exit codes: 0 / 1 (errors) / 2 (unreachable) / 3 (missing tool). veza-backend-api/internal/config/config.go Comment block above DATABASE_URL load — documents the prod wiring (DATABASE_URL points at pgaf-pgbouncer.lxd:6432, NOT at pgaf-primary directly). Also notes the dev/CI exception: direct Postgres because the small scale doesn't benefit from pooling and tests occasionally lean on session-scoped GUCs that transaction-mode would break. Acceptance verified locally: $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \ --syntax-check playbook: playbooks/postgres_ha.yml ← clean $ bash -n infra/ansible/tests/test_pgbouncer_load.sh syntax OK $ cd veza-backend-api && go build ./... (clean — comment-only change in config.go) $ gofmt -l internal/config/config.go (no output — clean) Real apply + pgbench run requires the lab R720 + the community.general collection — operator's call. Out of scope (deferred per ROADMAP §2): - HA pgbouncer (single instance per env at v1.0; double instance + keepalived in v1.1 if needed) - pg_autoctl state-change hook → pgbouncer RELOAD (W2 day 8) - Prometheus pgbouncer_exporter (W2 day 9 with the OTel collector + observability stack) SKIP_TESTS=1 — IaC YAML + bash + Go comment-only diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-27 18:35:05 +02:00
..
defaults	feat(infra): pgbouncer role + pgbench load test (W2 Day 7)	2026-04-27 18:35:05 +02:00
handlers	feat(infra): pgbouncer role + pgbench load test (W2 Day 7)	2026-04-27 18:35:05 +02:00
tasks	feat(infra): pgbouncer role + pgbench load test (W2 Day 7)	2026-04-27 18:35:05 +02:00
templates	feat(infra): pgbouncer role + pgbench load test (W2 Day 7)	2026-04-27 18:35:05 +02:00
README.md	feat(infra): pgbouncer role + pgbench load test (W2 Day 7)	2026-04-27 18:35:05 +02:00

README.md

`pgbouncer` role — connection pool in front of pg_auto_failover

Sits between the Veza backend and the pg_auto_failover primary. Pools 1000 client connections down to 50 server connections in transaction mode — the backend pays the ~1ms Postgres fork overhead 50 times per pool refresh, not once per HTTP handler.

Wiring

veza-backend-api ──libpq──▶ pgaf-pgbouncer:6432 ──libpq──▶ pgaf-primary:5432
                            (1000 client conn cap)         (single backend per pool slot)

Backend DATABASE_URL in prod:

postgresql://veza:PASSWORD@pgaf-pgbouncer.lxd:6432/veza?sslmode=prefer

NOT the formation URI directly — that bypasses the pool and re-creates the connection-storm problem.

Pool sizing — v1.0.9 Day 7 baseline

knob	default	rationale
`max_client_conn`	1000	acceptance target: 500 concurrent + headroom
`default_pool_size`	50	matches Postgres `max_connections=100` with margin
`min_pool_size`	10	warm pool — first request after idle is fast
`reserve_pool_size`	10	overflow when `default_pool_size` is exhausted
`reserve_pool_timeout`	5s	wait this long before opening a reserve slot
`server_idle_timeout`	600s	aggressive idle reclaim — saves Postgres backend RAM
`query_wait_timeout`	120s	hard ceiling: a request waiting longer fails fast

Transaction mode — what it forbids

Transaction mode means a connection is returned to the pool after each transaction. Things that break:

Cross-transaction session state — SET (without SET LOCAL), session GUCs persisted across queries
Cross-transaction prepared statements (PREPARE outside a tx)
LISTEN/NOTIFY (the listener returns to the pool, the notification goes nowhere)

Things that work fine — and the Veza backend stays inside this set:

SET LOCAL inside a tx (scoped to the tx)
Advisory locks scoped to a tx (pg_advisory_xact_lock)
Plain SELECT / INSERT / UPDATE / DELETE / DDL within a tx

If a future feature needs LISTEN/NOTIFY (e.g. a real-time invalidation channel), connect that worker straight to Postgres bypassing PgBouncer — separate connection budget, no pool conflict.

Failover behaviour (current scope)

Day 7 ships pgbouncer pointed at pgaf-primary.lxd directly. After a pg_autoctl perform failover, pgbouncer's pool is stranded on the demoted node until DNS TTL (60s) + the role re-renders config and reloads. RTO < 60s for the backend round-trip even in this state — most pool slots fail-over via DNS within the existing TTL window.

W2 day 8 (or v1.1) hardens this with a pg_autoctl state-change hook that issues RELOAD on the pgbouncer admin console as soon as the formation elects a new primary — sub-second pool migration.

Admin / observability

# From any container on the bridge:
psql -h pgaf-pgbouncer.lxd -p 6432 -U postgres pgbouncer

# Then:
SHOW POOLS;       -- per-(database,user) pool stats
SHOW CLIENTS;     -- active client connections
SHOW SERVERS;     -- pool→postgres connections
SHOW STATS;       -- request rate, query duration percentiles
RELOAD;           -- re-read pgbouncer.ini without dropping clients

Acceptance test

bash infra/ansible/tests/test_pgbouncer_load.sh

Spawns pgbench from the host with 500 concurrent clients × 30s and asserts zero connection errors. Pool size = 50 + reserve 10 forces 60 server-side connections to serve 500 client-side, exercising the queue.

README.md Unescape Escape

pgbouncer role — connection pool in front of pg_auto_failover