Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m4s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s
Veza CI / Backend (Go) (push) Failing after 15m45s
Veza CI / Frontend (Web) (push) Successful in 18m7s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Successful in 24m9s
Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
* SLO_API_AVAILABILITY : 99.5% on read (GET) endpoints
* SLO_API_LATENCY : 99% writes p95 < 500ms
* SLO_PAYMENT_SUCCESS : 99.5% on POST /api/v1/orders -> 2xx
Each SLO has two alerts :
* <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
* <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)
- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
+ PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
+ db-failover, redis-down, disk-full, cert-expiring-soon : one stub
per likely page. Each lists first moves under 5min + common causes.
Acceptance (Day 10) : promtool check rules vert.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
108 lines
3.8 KiB
Markdown
108 lines
3.8 KiB
Markdown
# Runbook — Postgres failover (`pg_auto_failover`)
|
|
|
|
> **Alerts** : `PostgresPrimaryUnreachable`, `PostgresReplicationLagHigh` · also reached from `api-availability-slo-burn.md` and `api-latency-slo-burn.md`.
|
|
> **Owner** : infra on-call.
|
|
|
|
## Topology recap
|
|
|
|
```
|
|
┌─────────────────┐
|
|
│ pgaf-monitor │ ← state machine; assigns primary/standby roles
|
|
└────────┬────────┘
|
|
│ pg_auto_failover protocol
|
|
│
|
|
┌─────┴─────┐
|
|
│ │
|
|
┌──▼───┐ ┌───▼────┐
|
|
│ pgaf-│ │ pgaf- │
|
|
│primary│ │replica │
|
|
└───────┘ └────────┘
|
|
```
|
|
|
|
PgBouncer (`pgaf-pgbouncer`, port 6432) sits in front of whoever is currently primary. Backend reads `DATABASE_URL` from env that already points at the bouncer.
|
|
|
|
## What "failover" looks like
|
|
|
|
- Primary disappears (crash, host reboot, manual `incus stop`).
|
|
- Monitor notices within `pgaf_health_check_interval` (~10s).
|
|
- After `pgaf_failover_timeout` (60s), monitor promotes the replica to primary.
|
|
- PgBouncer is reconfigured by the monitor's notify hook ; new connections go to the new primary.
|
|
|
|
**Expected RTO is ~60 seconds.** RPO ≈ 0 if synchronous replication was caught up; up to one tx if async.
|
|
|
|
## Diagnose state
|
|
|
|
```bash
|
|
# From any node :
|
|
sudo -u postgres pg_autoctl show state
|
|
|
|
# Look for one node with state="primary" and one with state="secondary".
|
|
# If both are "wait_for_primary" the formation is wedged.
|
|
|
|
# Connection-level test (does the bouncer route to a live primary?) :
|
|
psql "$DATABASE_URL" -c "SELECT now(), pg_is_in_recovery();"
|
|
# pg_is_in_recovery = false ⇒ you're hitting the primary
|
|
```
|
|
|
|
## Common failure modes
|
|
|
|
### A. Monitor is up, primary is down, replica didn't get promoted
|
|
|
|
Either `pgaf_failover_timeout` hasn't elapsed yet (wait 60s) **or** the replica is too far behind to be safe.
|
|
|
|
```bash
|
|
# On the replica :
|
|
sudo -u postgres pg_autoctl show state
|
|
# Check the LSN distance — if it's > 1MB the monitor will refuse promote.
|
|
```
|
|
|
|
If monitor refused, manual promotion (only if you accept potential data loss) :
|
|
|
|
```bash
|
|
sudo -u postgres pg_autoctl perform failover --formation default --group 0
|
|
```
|
|
|
|
### B. Monitor itself is down
|
|
|
|
The data nodes keep serving their last-known role until the monitor returns. Reads keep working from the standby. **No automatic failover happens** without the monitor — start it before doing anything else.
|
|
|
|
```bash
|
|
sudo systemctl start pg_autoctl@monitor
|
|
sudo journalctl -u pg_autoctl@monitor -n 200 --no-pager
|
|
```
|
|
|
|
### C. Both data nodes are down (catastrophe)
|
|
|
|
Restore from pgBackRest. See the dr-drill runbook in `docs/archive/` (or the `pgbackrest` role README) for the manual procedure. **Estimated RTO ~30 min** with a full+diff already on MinIO.
|
|
|
|
## Connection routing
|
|
|
|
PgBouncer holds the routing decision, so during a failover :
|
|
|
|
```bash
|
|
# Confirm which Postgres backend is currently behind the bouncer :
|
|
psql -h pgaf-pgbouncer.lxd -p 6432 -U pgbouncer pgbouncer -c "SHOW SERVERS;"
|
|
```
|
|
|
|
If the bouncer is still pointing at the dead primary :
|
|
|
|
```bash
|
|
# Reload the bouncer config (the pg_auto_failover monitor's
|
|
# `host_change_hook.sh` should have done this automatically — if not,
|
|
# something is broken) :
|
|
sudo systemctl reload pgbouncer
|
|
```
|
|
|
|
## Backend behavior during failover
|
|
|
|
The backend's GORM connection pool drops dead connections lazily. Expect a few hundred 5xx during the 30-60s window — this trips `APIAvailabilitySLOFastBurn`. The alert clears once the pool refills.
|
|
|
|
## After recovery
|
|
|
|
1. Re-add the failed node as standby :
|
|
```bash
|
|
sudo -u postgres pg_autoctl create postgres ...
|
|
```
|
|
2. Wait for `pg_autoctl show state` to show two healthy nodes.
|
|
3. Run the next dr-drill cycle to validate backups against the new primary.
|
|
4. Postmortem if downtime > 5 min.
|