veza/docs/runbooks/db-failover.md

# Runbook — Postgres failover (`pg_auto_failover`)

> **Alerts** : `PostgresPrimaryUnreachable`, `PostgresReplicationLagHigh` · also reached from `api-availability-slo-burn.md` and `api-latency-slo-burn.md`.
> **Owner** : infra on-call.

## Topology recap

```
┌─────────────────┐
│  pgaf-monitor   │  ← state machine; assigns primary/standby roles
└────────┬────────┘
         │ pg_auto_failover protocol
         │
   ┌─────┴─────┐
   │           │
┌──▼───┐   ┌───▼────┐
│ pgaf-│   │ pgaf-  │
│primary│  │replica │
└───────┘  └────────┘
```

PgBouncer (`pgaf-pgbouncer`, port 6432) sits in front of whoever is currently primary. Backend reads `DATABASE_URL` from env that already points at the bouncer.

## What "failover" looks like

- Primary disappears (crash, host reboot, manual `incus stop`).
- Monitor notices within `pgaf_health_check_interval` (~10s).
- After `pgaf_failover_timeout` (60s), monitor promotes the replica to primary.
- PgBouncer is reconfigured by the monitor's notify hook ; new connections go to the new primary.

**Expected RTO is ~60 seconds.** RPO ≈ 0 if synchronous replication was caught up; up to one tx if async.

## Diagnose state

```bash
# From any node :
sudo -u postgres pg_autoctl show state

# Look for one node with state="primary" and one with state="secondary".
# If both are "wait_for_primary" the formation is wedged.

# Connection-level test (does the bouncer route to a live primary?) :
psql "$DATABASE_URL" -c "SELECT now(), pg_is_in_recovery();"
# pg_is_in_recovery = false ⇒ you're hitting the primary
```

## Common failure modes

### A. Monitor is up, primary is down, replica didn't get promoted

Either `pgaf_failover_timeout` hasn't elapsed yet (wait 60s) **or** the replica is too far behind to be safe.

```bash
# On the replica :
sudo -u postgres pg_autoctl show state
# Check the LSN distance — if it's > 1MB the monitor will refuse promote.
```

If monitor refused, manual promotion (only if you accept potential data loss) :

```bash
sudo -u postgres pg_autoctl perform failover --formation default --group 0
```

### B. Monitor itself is down

The data nodes keep serving their last-known role until the monitor returns. Reads keep working from the standby. **No automatic failover happens** without the monitor — start it before doing anything else.

```bash
sudo systemctl start pg_autoctl@monitor
sudo journalctl -u pg_autoctl@monitor -n 200 --no-pager
```

### C. Both data nodes are down (catastrophe)

Restore from pgBackRest. See the dr-drill runbook in `docs/archive/` (or the `pgbackrest` role README) for the manual procedure. **Estimated RTO ~30 min** with a full+diff already on MinIO.

## Connection routing

PgBouncer holds the routing decision, so during a failover :

```bash
# Confirm which Postgres backend is currently behind the bouncer :
psql -h pgaf-pgbouncer.lxd -p 6432 -U pgbouncer pgbouncer -c "SHOW SERVERS;"
```

If the bouncer is still pointing at the dead primary :

```bash
# Reload the bouncer config (the pg_auto_failover monitor's
# `host_change_hook.sh` should have done this automatically — if not,
# something is broken) :
sudo systemctl reload pgbouncer
```

## Backend behavior during failover

The backend's GORM connection pool drops dead connections lazily. Expect a few hundred 5xx during the 30-60s window — this trips `APIAvailabilitySLOFastBurn`. The alert clears once the pool refills.

## After recovery

1. Re-add the failed node as standby :
   ```bash
   sudo -u postgres pg_autoctl create postgres ...
   ```
2. Wait for `pg_autoctl show state` to show two healthy nodes.
3. Run the next dr-drill cycle to validate backups against the new primary.
4. Postmortem if downtime > 5 min.
feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10) Three SLOs with multi-window burn-rate alerts (Google SRE workbook methodology) : * SLO_API_AVAILABILITY : 99.5% on read (GET) endpoints * SLO_API_LATENCY : 99% writes p95 < 500ms * SLO_PAYMENT_SUCCESS : 99.5% on POST /api/v1/orders -> 2xx Each SLO has two alerts : * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows) * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m) - config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool check rules => SUCCESS: 18 rules found. - config/alertmanager/routes.yml : routing tree splits page-oncall (slack + PagerDuty) from ticket-oncall (slack only). - docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md + db-failover, redis-down, disk-full, cert-expiring-soon : one stub per likely page. Each lists first moves under 5min + common causes. Acceptance (Day 10) : promtool check rules vert. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-04-27 23:30:34 +00:00			# Runbook — Postgres failover (`pg_auto_failover`)

			> Alerts : `PostgresPrimaryUnreachable`, `PostgresReplicationLagHigh` · also reached from `api-availability-slo-burn.md` and `api-latency-slo-burn.md`.
			`> Owner : infra on-call.`

			`## Topology recap`

			```
			`┌─────────────────┐`
			`│ pgaf-monitor │ ← state machine; assigns primary/standby roles`
			`└────────┬────────┘`
			`│ pg_auto_failover protocol`
			`│`
			`┌─────┴─────┐`
			`│ │`
			`┌──▼───┐ ┌───▼────┐`
			`│ pgaf-│ │ pgaf- │`
			`│primary│ │replica │`
			`└───────┘ └────────┘`
			```

			PgBouncer (`pgaf-pgbouncer`, port 6432) sits in front of whoever is currently primary. Backend reads `DATABASE_URL` from env that already points at the bouncer.

			`## What "failover" looks like`

			- Primary disappears (crash, host reboot, manual `incus stop`).
			- Monitor notices within `pgaf_health_check_interval` (~10s).
			- After `pgaf_failover_timeout` (60s), monitor promotes the replica to primary.
			`- PgBouncer is reconfigured by the monitor's notify hook ; new connections go to the new primary.`

			`Expected RTO is ~60 seconds. RPO ≈ 0 if synchronous replication was caught up; up to one tx if async.`

			`## Diagnose state`

			```bash
			`# From any node :`
			`sudo -u postgres pg_autoctl show state`

			`# Look for one node with state="primary" and one with state="secondary".`
			`# If both are "wait_for_primary" the formation is wedged.`

			`# Connection-level test (does the bouncer route to a live primary?) :`
			`psql "$DATABASE_URL" -c "SELECT now(), pg_is_in_recovery();"`
			`# pg_is_in_recovery = false ⇒ you're hitting the primary`
			```

			`## Common failure modes`

			`### A. Monitor is up, primary is down, replica didn't get promoted`

			Either `pgaf_failover_timeout` hasn't elapsed yet (wait 60s) or the replica is too far behind to be safe.

			```bash
			`# On the replica :`
			`sudo -u postgres pg_autoctl show state`
			`# Check the LSN distance — if it's > 1MB the monitor will refuse promote.`
			```

			`If monitor refused, manual promotion (only if you accept potential data loss) :`

			```bash
			`sudo -u postgres pg_autoctl perform failover --formation default --group 0`
			```

			`### B. Monitor itself is down`

			`The data nodes keep serving their last-known role until the monitor returns. Reads keep working from the standby. No automatic failover happens without the monitor — start it before doing anything else.`

			```bash
			`sudo systemctl start pg_autoctl@monitor`
			`sudo journalctl -u pg_autoctl@monitor -n 200 --no-pager`
			```

			`### C. Both data nodes are down (catastrophe)`

			Restore from pgBackRest. See the dr-drill runbook in `docs/archive/` (or the `pgbackrest` role README) for the manual procedure. Estimated RTO ~30 min with a full+diff already on MinIO.

			`## Connection routing`

			`PgBouncer holds the routing decision, so during a failover :`

			```bash
			`# Confirm which Postgres backend is currently behind the bouncer :`
			`psql -h pgaf-pgbouncer.lxd -p 6432 -U pgbouncer pgbouncer -c "SHOW SERVERS;"`
			```

			`If the bouncer is still pointing at the dead primary :`

			```bash
			`# Reload the bouncer config (the pg_auto_failover monitor's`
			# `host_change_hook.sh` should have done this automatically — if not,
			`# something is broken) :`
			`sudo systemctl reload pgbouncer`
			```

			`## Backend behavior during failover`

			The backend's GORM connection pool drops dead connections lazily. Expect a few hundred 5xx during the 30-60s window — this trips `APIAvailabilitySLOFastBurn`. The alert clears once the pool refills.

			`## After recovery`

			`1. Re-add the failed node as standby :`
			```bash
			`sudo -u postgres pg_autoctl create postgres ...`
			```
			2. Wait for `pg_autoctl show state` to show two healthy nodes.
			`3. Run the next dr-drill cycle to validate backups against the new primary.`
			`4. Postmortem if downtime > 5 min.`