senke/veza

Fork 0

senke c78bf1b765

Veza CI / Rust (Stream Server) (push) Successful in 5m4s

Details

Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s

Details

Veza CI / Backend (Go) (push) Failing after 15m45s

Details

Veza CI / Frontend (Web) (push) Successful in 18m7s

Details

Veza CI / Notify on failure (push) Successful in 6s

Details

E2E Playwright / e2e (full) (push) Successful in 24m9s

Details

feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)

Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
  * SLO_API_AVAILABILITY  : 99.5% on read (GET) endpoints
  * SLO_API_LATENCY       : 99% writes p95 < 500ms
  * SLO_PAYMENT_SUCCESS   : 99.5% on POST /api/v1/orders -> 2xx

Each SLO has two alerts :
  * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
  * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)

- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
  check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
  + PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
  + db-failover, redis-down, disk-full, cert-expiring-soon : one stub
  per likely page. Each lists first moves under 5min + common causes.

Acceptance (Day 10) : promtool check rules vert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 01:30:34 +02:00

3.8 KiB

Raw Blame History

Runbook — Postgres failover (`pg_auto_failover`)

Alerts : PostgresPrimaryUnreachable, PostgresReplicationLagHigh · also reached from api-availability-slo-burn.md and api-latency-slo-burn.md. Owner : infra on-call.

Topology recap

┌─────────────────┐
│  pgaf-monitor   │  ← state machine; assigns primary/standby roles
└────────┬────────┘
         │ pg_auto_failover protocol
         │
   ┌─────┴─────┐
   │           │
┌──▼───┐   ┌───▼────┐
│ pgaf-│   │ pgaf-  │
│primary│  │replica │
└───────┘  └────────┘

PgBouncer (pgaf-pgbouncer, port 6432) sits in front of whoever is currently primary. Backend reads DATABASE_URL from env that already points at the bouncer.

What "failover" looks like

Primary disappears (crash, host reboot, manual incus stop).
Monitor notices within pgaf_health_check_interval (~10s).
After pgaf_failover_timeout (60s), monitor promotes the replica to primary.
PgBouncer is reconfigured by the monitor's notify hook ; new connections go to the new primary.

Expected RTO is ~60 seconds. RPO ≈ 0 if synchronous replication was caught up; up to one tx if async.

Diagnose state

# From any node :
sudo -u postgres pg_autoctl show state

# Look for one node with state="primary" and one with state="secondary".
# If both are "wait_for_primary" the formation is wedged.

# Connection-level test (does the bouncer route to a live primary?) :
psql "$DATABASE_URL" -c "SELECT now(), pg_is_in_recovery();"
# pg_is_in_recovery = false ⇒ you're hitting the primary

Common failure modes

A. Monitor is up, primary is down, replica didn't get promoted

Either pgaf_failover_timeout hasn't elapsed yet (wait 60s) or the replica is too far behind to be safe.

# On the replica :
sudo -u postgres pg_autoctl show state
# Check the LSN distance — if it's > 1MB the monitor will refuse promote.

If monitor refused, manual promotion (only if you accept potential data loss) :

sudo -u postgres pg_autoctl perform failover --formation default --group 0

B. Monitor itself is down

The data nodes keep serving their last-known role until the monitor returns. Reads keep working from the standby. No automatic failover happens without the monitor — start it before doing anything else.

sudo systemctl start pg_autoctl@monitor
sudo journalctl -u pg_autoctl@monitor -n 200 --no-pager

C. Both data nodes are down (catastrophe)

Restore from pgBackRest. See the dr-drill runbook in docs/archive/ (or the pgbackrest role README) for the manual procedure. Estimated RTO ~30 min with a full+diff already on MinIO.

Connection routing

PgBouncer holds the routing decision, so during a failover :

# Confirm which Postgres backend is currently behind the bouncer :
psql -h pgaf-pgbouncer.lxd -p 6432 -U pgbouncer pgbouncer -c "SHOW SERVERS;"

If the bouncer is still pointing at the dead primary :

# Reload the bouncer config (the pg_auto_failover monitor's
# `host_change_hook.sh` should have done this automatically — if not,
# something is broken) :
sudo systemctl reload pgbouncer

Backend behavior during failover

The backend's GORM connection pool drops dead connections lazily. Expect a few hundred 5xx during the 30-60s window — this trips APIAvailabilitySLOFastBurn. The alert clears once the pool refills.

After recovery

Re-add the failed node as standby :

sudo -u postgres pg_autoctl create postgres ...

Wait for pg_autoctl show state to show two healthy nodes.
Run the next dr-drill cycle to validate backups against the new primary.
Postmortem if downtime > 5 min.

3.8 KiB Raw Blame History

Runbook — Postgres failover (pg_auto_failover)

Topology recap

What "failover" looks like

Diagnose state

Common failure modes

A. Monitor is up, primary is down, replica didn't get promoted

B. Monitor itself is down

C. Both data nodes are down (catastrophe)

Connection routing

Backend behavior during failover

After recovery

3.8 KiB

Raw Blame History

Runbook — Postgres failover (`pg_auto_failover`)