veza/docs/runbooks/db-failover.md
senke c78bf1b765
Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m4s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s
Veza CI / Backend (Go) (push) Failing after 15m45s
Veza CI / Frontend (Web) (push) Successful in 18m7s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Successful in 24m9s
feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)
Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
  * SLO_API_AVAILABILITY  : 99.5% on read (GET) endpoints
  * SLO_API_LATENCY       : 99% writes p95 < 500ms
  * SLO_PAYMENT_SUCCESS   : 99.5% on POST /api/v1/orders -> 2xx

Each SLO has two alerts :
  * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
  * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)

- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
  check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
  + PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
  + db-failover, redis-down, disk-full, cert-expiring-soon : one stub
  per likely page. Each lists first moves under 5min + common causes.

Acceptance (Day 10) : promtool check rules vert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 01:30:34 +02:00

3.8 KiB

Runbook — Postgres failover (pg_auto_failover)

Alerts : PostgresPrimaryUnreachable, PostgresReplicationLagHigh · also reached from api-availability-slo-burn.md and api-latency-slo-burn.md. Owner : infra on-call.

Topology recap

┌─────────────────┐
│  pgaf-monitor   │  ← state machine; assigns primary/standby roles
└────────┬────────┘
         │ pg_auto_failover protocol
         │
   ┌─────┴─────┐
   │           │
┌──▼───┐   ┌───▼────┐
│ pgaf-│   │ pgaf-  │
│primary│  │replica │
└───────┘  └────────┘

PgBouncer (pgaf-pgbouncer, port 6432) sits in front of whoever is currently primary. Backend reads DATABASE_URL from env that already points at the bouncer.

What "failover" looks like

  • Primary disappears (crash, host reboot, manual incus stop).
  • Monitor notices within pgaf_health_check_interval (~10s).
  • After pgaf_failover_timeout (60s), monitor promotes the replica to primary.
  • PgBouncer is reconfigured by the monitor's notify hook ; new connections go to the new primary.

Expected RTO is ~60 seconds. RPO ≈ 0 if synchronous replication was caught up; up to one tx if async.

Diagnose state

# From any node :
sudo -u postgres pg_autoctl show state

# Look for one node with state="primary" and one with state="secondary".
# If both are "wait_for_primary" the formation is wedged.

# Connection-level test (does the bouncer route to a live primary?) :
psql "$DATABASE_URL" -c "SELECT now(), pg_is_in_recovery();"
# pg_is_in_recovery = false ⇒ you're hitting the primary

Common failure modes

A. Monitor is up, primary is down, replica didn't get promoted

Either pgaf_failover_timeout hasn't elapsed yet (wait 60s) or the replica is too far behind to be safe.

# On the replica :
sudo -u postgres pg_autoctl show state
# Check the LSN distance — if it's > 1MB the monitor will refuse promote.

If monitor refused, manual promotion (only if you accept potential data loss) :

sudo -u postgres pg_autoctl perform failover --formation default --group 0

B. Monitor itself is down

The data nodes keep serving their last-known role until the monitor returns. Reads keep working from the standby. No automatic failover happens without the monitor — start it before doing anything else.

sudo systemctl start pg_autoctl@monitor
sudo journalctl -u pg_autoctl@monitor -n 200 --no-pager

C. Both data nodes are down (catastrophe)

Restore from pgBackRest. See the dr-drill runbook in docs/archive/ (or the pgbackrest role README) for the manual procedure. Estimated RTO ~30 min with a full+diff already on MinIO.

Connection routing

PgBouncer holds the routing decision, so during a failover :

# Confirm which Postgres backend is currently behind the bouncer :
psql -h pgaf-pgbouncer.lxd -p 6432 -U pgbouncer pgbouncer -c "SHOW SERVERS;"

If the bouncer is still pointing at the dead primary :

# Reload the bouncer config (the pg_auto_failover monitor's
# `host_change_hook.sh` should have done this automatically — if not,
# something is broken) :
sudo systemctl reload pgbouncer

Backend behavior during failover

The backend's GORM connection pool drops dead connections lazily. Expect a few hundred 5xx during the 30-60s window — this trips APIAvailabilitySLOFastBurn. The alert clears once the pool refills.

After recovery

  1. Re-add the failed node as standby :
    sudo -u postgres pg_autoctl create postgres ...
    
  2. Wait for pg_autoctl show state to show two healthy nodes.
  3. Run the next dr-drill cycle to validate backups against the new primary.
  4. Postmortem if downtime > 5 min.