veza/k8s/disaster-recovery/runbooks/database-failover.md

4.8 KiB

Database Failover Runbook

This runbook describes the procedure for failing over from a primary PostgreSQL database to a standby replica.

Prerequisites

  • Standby replica configured and synchronized
  • Access to Kubernetes cluster
  • Database credentials in Vault/Secrets
  • Monitoring alerts configured

Detection

Automatic Detection

Monitoring alerts will trigger when:

  • Primary database is unreachable
  • Replication lag exceeds threshold
  • Health checks fail

Manual Detection

# Check primary database status
kubectl exec -it postgres-primary -n veza-production -- pg_isready

# Check replication status
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT * FROM pg_stat_replication;"

Failover Procedure

Step 1: Verify Standby Status

# Check standby is synchronized
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"

# Verify replication lag
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;"

Expected: Lag should be < 60 seconds

Step 2: Promote Standby to Primary

# Promote standby
kubectl exec -it postgres-standby -n veza-production -- \
  pg_ctl promote

# Verify promotion
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT pg_is_in_recovery();"

Expected: Returns false (no longer in recovery mode)

Step 3: Update Service Endpoint

# Update postgres service to point to new primary
kubectl patch service postgres -n veza-production \
  -p '{"spec":{"selector":{"role":"primary"}}}'

# Or update the service selector to point to standby pod
kubectl get pod postgres-standby -n veza-production -o jsonpath='{.metadata.labels}' | \
  jq -r 'to_entries | map("\(.key)=\(.value)") | join(",")'

Step 4: Restart Application Pods

# Restart to pick up new database connection
kubectl rollout restart deployment/veza-backend-api -n veza-production
kubectl rollout restart deployment/veza-chat-server -n veza-production

# Verify pods are healthy
kubectl rollout status deployment/veza-backend-api -n veza-production

Step 5: Verify Application Health

# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production

# Test database connectivity
kubectl exec -it deployment/veza-backend-api -n veza-production -- \
  psql $DATABASE_URL -c "SELECT 1;"

# Check health endpoint
curl https://api.veza.com/health

Step 6: Set Up New Standby

# Once primary is recovered, set up new standby
# Follow PostgreSQL replication setup guide

Rollback Procedure

If failover was incorrect or primary recovers:

# Stop applications
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production

# Revert service endpoint
kubectl patch service postgres -n veza-production \
  -p '{"spec":{"selector":{"role":"primary","pod":"postgres-primary"}}}'

# Restart applications
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production

Verification Checklist

  • Standby promoted successfully
  • Service endpoint updated
  • Application pods restarted
  • Database connectivity verified
  • Application health checks passing
  • No data loss detected
  • Monitoring alerts cleared
  • Documentation updated

Troubleshooting

Standby Not Synchronized

# Check replication status
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT * FROM pg_stat_replication;"

# If replication is broken, rebuild standby
# (See PostgreSQL replication setup guide)

Application Cannot Connect

# Verify service selector
kubectl get service postgres -n veza-production -o yaml

# Check pod labels
kubectl get pod postgres-standby -n veza-production --show-labels

# Verify network connectivity
kubectl run test-connection --rm -it --image=postgres:15-alpine \
  --restart=Never \
  -- psql -h postgres.veza-production.svc.cluster.local -U veza_user -d veza_db

Post-Failover Tasks

  1. Investigate Root Cause

    • Review primary database logs
    • Check system resources
    • Identify failure reason
  2. Set Up New Standby

    • Configure replication from new primary
    • Verify synchronization
    • Update monitoring
  3. Document Incident

    • Document failover procedure
    • Note any issues encountered
    • Update runbook if needed
  4. Notify Stakeholders

    • Send incident report
    • Update status page
    • Schedule post-mortem

References