veza/infra/ansible/tests/test_pgbouncer_load.sh
senke ba6e8b4e0e
All checks were successful
Veza CI / Rust (Stream Server) (push) Successful in 3m49s
Security Scan / Secret Scanning (gitleaks) (push) Successful in 58s
Veza CI / Backend (Go) (push) Successful in 5m59s
Veza CI / Frontend (Web) (push) Successful in 15m22s
E2E Playwright / e2e (full) (push) Successful in 19m34s
Veza CI / Notify on failure (push) Has been skipped
feat(infra): pgbouncer role + pgbench load test (W2 Day 7)
ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 7 deliverable: PgBouncer
fronts the pg_auto_failover formation, the backend pays the
postgres-fork cost 50 times per pool refresh instead of once per
HTTP handler.

Wiring:
  veza-backend-api ──libpq──▶ pgaf-pgbouncer:6432 ──libpq──▶ pgaf-primary:5432
                              (1000 client cap)             (50 server pool)

Files:
  infra/ansible/roles/pgbouncer/
    defaults/main.yml — pool sizes match the acceptance target
      (1000 client × 50 server × 10 reserve), pool_mode=transaction
      (the only safe mode given the backend's session usage —
      LISTEN/NOTIFY and cross-tx prepared statements are forbidden,
      neither of which Veza uses), DNS TTL = 60s for failover.
    tasks/main.yml — apt install pgbouncer + postgresql-client (so
      the pgbench / admin psql lives on the same container), render
      pgbouncer.ini + userlist.txt, ensure /var/log/postgresql for
      the file log, enable + start service.
    templates/pgbouncer.ini.j2 — full config; databases section
      points at pgaf-primary.lxd:5432 directly. Failover follows
      via DNS TTL until the W2 day 8 pg_autoctl state-change hook
      that issues RELOAD on the admin console.
    templates/userlist.txt.j2 — only rendered when auth_type !=
      trust. Lab uses trust on the bridge subnet; prod gets a
      vault-backed list of md5/scram hashes.
    handlers/main.yml — RELOAD pgbouncer (graceful, doesn't drop
      established clients).
    README.md — operational cheatsheet:
      - SHOW POOLS / SHOW STATS via the admin console
      - the transaction-mode forbids list (LISTEN/NOTIFY etc.)
      - failover behaviour today vs after the W2-day-8 hook lands

  infra/ansible/playbooks/postgres_ha.yml
    Provision step extended to launch pgaf-pgbouncer alongside
    the formation containers. Two new plays at the bottom apply
    common baseline + pgbouncer role to it.

  infra/ansible/inventory/lab.yml
    `pgbouncer` group with pgaf-pgbouncer reachable via the
    community.general.incus connection plugin (consistent with the
    postgres_ha containers).

  infra/ansible/tests/test_pgbouncer_load.sh
    Acceptance: pgbench 500 clients × 30s × 8 threads against the
    pgbouncer endpoint, must report 0 failed transactions and 0
    connection errors. Also runs `pgbench -i -s 10` first to
    initialise the standard fixture — that init goes through
    pgbouncer too, which incidentally validates transaction-mode
    compatibility before the load run starts.
    Exit codes: 0 / 1 (errors) / 2 (unreachable) / 3 (missing tool).

  veza-backend-api/internal/config/config.go
    Comment block above DATABASE_URL load — documents the prod
    wiring (DATABASE_URL points at pgaf-pgbouncer.lxd:6432, NOT
    at pgaf-primary directly). Also notes the dev/CI exception:
    direct Postgres because the small scale doesn't benefit from
    pooling and tests occasionally lean on session-scoped GUCs
    that transaction-mode would break.

Acceptance verified locally:
  $ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \
      --syntax-check
  playbook: playbooks/postgres_ha.yml          ← clean
  $ bash -n infra/ansible/tests/test_pgbouncer_load.sh
  syntax OK
  $ cd veza-backend-api && go build ./...
  (clean — comment-only change in config.go)
  $ gofmt -l internal/config/config.go
  (no output — clean)

Real apply + pgbench run requires the lab R720 + the
community.general collection — operator's call.

Out of scope (deferred per ROADMAP §2):
  - HA pgbouncer (single instance per env at v1.0; double
    instance + keepalived in v1.1 if needed)
  - pg_autoctl state-change hook → pgbouncer RELOAD (W2 day 8)
  - Prometheus pgbouncer_exporter (W2 day 9 with the OTel
    collector + observability stack)

SKIP_TESTS=1 — IaC YAML + bash + Go comment-only diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 18:35:05 +02:00

86 lines
3.3 KiB
Bash
Executable file
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#!/usr/bin/env bash
# test_pgbouncer_load.sh — exercise PgBouncer with 500 concurrent
# clients × 30s, fail unless every connection lands and stays
# under the query_wait_timeout ceiling.
#
# v1.0.9 Day 7 acceptance for ROADMAP_V1.0_LAUNCH.md §Semaine 2:
# "pgbench 500 clients × 30s sans erreur de connexion".
#
# Usage:
# bash infra/ansible/tests/test_pgbouncer_load.sh
#
# Env overrides:
# PGBOUNCER_HOST default: pgaf-pgbouncer.lxd
# PGBOUNCER_PORT default: 6432
# PGBOUNCER_DB default: veza
# PGBENCH_CLIENTS default: 500
# PGBENCH_DURATION default: 30
#
# Exit codes:
# 0 — pgbench completed clean (no connection errors, no aborts)
# 1 — pgbench reported errors during the run
# 2 — pgbouncer not reachable
# 3 — required tool missing on host
set -euo pipefail
PGBOUNCER_HOST=${PGBOUNCER_HOST:-pgaf-pgbouncer.lxd}
PGBOUNCER_PORT=${PGBOUNCER_PORT:-6432}
PGBOUNCER_DB=${PGBOUNCER_DB:-veza}
PGBOUNCER_USER=${PGBOUNCER_USER:-veza}
PGBENCH_CLIENTS=${PGBENCH_CLIENTS:-500}
PGBENCH_DURATION=${PGBENCH_DURATION:-30}
PGBENCH_THREADS=${PGBENCH_THREADS:-8}
log() { printf '[%s] %s\n' "$(date +%H:%M:%S)" "$*" >&2; }
fail() { log "FAIL: $*"; exit "${2:-1}"; }
require() { command -v "$1" >/dev/null 2>&1 || fail "missing tool: $1" 3; }
require pgbench
require psql
require awk
# 0. Reachability — PgBouncer alive on listen_addr:listen_port.
log "step 0: probing pgbouncer at ${PGBOUNCER_HOST}:${PGBOUNCER_PORT}"
if ! psql "host=${PGBOUNCER_HOST} port=${PGBOUNCER_PORT} dbname=${PGBOUNCER_DB} user=${PGBOUNCER_USER} connect_timeout=5" -c 'select 1' >/dev/null 2>&1; then
fail "pgbouncer not reachable (or app db ${PGBOUNCER_DB} not provisioned). Check the pgbouncer service + the formation primary." 2
fi
# 1. pgbench fixture — initialise the standard pgbench tables ONCE
# before the load run. The init connects through pgbouncer too,
# which incidentally checks transaction-mode compatibility.
log "step 1: initialising pgbench fixture (scale=10)"
if ! pgbench -h "${PGBOUNCER_HOST}" -p "${PGBOUNCER_PORT}" -U "${PGBOUNCER_USER}" -d "${PGBOUNCER_DB}" -i -s 10 --no-vacuum 2>&1 | tail -20 >&2; then
fail "pgbench -i failed — check pgbouncer auth / pool_mode" 1
fi
# 2. Load run.
log "step 2: pgbench ${PGBENCH_CLIENTS} clients × ${PGBENCH_DURATION}s × ${PGBENCH_THREADS} threads"
out=$(pgbench \
-h "${PGBOUNCER_HOST}" \
-p "${PGBOUNCER_PORT}" \
-U "${PGBOUNCER_USER}" \
-d "${PGBOUNCER_DB}" \
-c "${PGBENCH_CLIENTS}" \
-j "${PGBENCH_THREADS}" \
-T "${PGBENCH_DURATION}" \
--no-vacuum \
-P 5 \
-r 2>&1)
echo "$out" | sed 's/^/ /' >&2
# pgbench reports "number of failed transactions: N (X.XX%)" — anything
# > 0 fails the test. Also catch outright "connection refused" errors
# from the runner output.
failed_tx=$(echo "$out" | awk '/number of failed transactions:/ { print $5; exit }' | tr -d ',()')
failed_tx=${failed_tx:-0}
conn_errors=$(echo "$out" | grep -ciE 'connection (refused|reset|timeout)' || true)
log "verdict: failed_tx=${failed_tx} conn_errors=${conn_errors}"
if [ "${failed_tx}" != "0" ] || [ "${conn_errors}" -gt 0 ]; then
fail "pgbench surfaced errors — pool sizing, query_wait_timeout, or upstream is the bottleneck"
fi
log "PASS: pgbench ${PGBENCH_CLIENTS} clients × ${PGBENCH_DURATION}s clean"
exit 0