senke/veza

Fork 0

senke 83dfdcd642 [INFRA-010] infra: Set up disaster recovery plan

2025-12-25 21:40:31 +01:00

4.8 KiB

Raw Blame History

Database Failover Runbook

This runbook describes the procedure for failing over from a primary PostgreSQL database to a standby replica.

Prerequisites

Standby replica configured and synchronized
Access to Kubernetes cluster
Database credentials in Vault/Secrets
Monitoring alerts configured

Detection

Automatic Detection

Monitoring alerts will trigger when:

Primary database is unreachable
Replication lag exceeds threshold
Health checks fail

Manual Detection

# Check primary database status
kubectl exec -it postgres-primary -n veza-production -- pg_isready

# Check replication status
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT * FROM pg_stat_replication;"

Failover Procedure

Step 1: Verify Standby Status

# Check standby is synchronized
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"

# Verify replication lag
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;"

Expected: Lag should be < 60 seconds

Step 2: Promote Standby to Primary

# Promote standby
kubectl exec -it postgres-standby -n veza-production -- \
  pg_ctl promote

# Verify promotion
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT pg_is_in_recovery();"

Expected: Returns false (no longer in recovery mode)

Step 3: Update Service Endpoint

# Update postgres service to point to new primary
kubectl patch service postgres -n veza-production \
  -p '{"spec":{"selector":{"role":"primary"}}}'

# Or update the service selector to point to standby pod
kubectl get pod postgres-standby -n veza-production -o jsonpath='{.metadata.labels}' | \
  jq -r 'to_entries | map("\(.key)=\(.value)") | join(",")'

Step 4: Restart Application Pods

# Restart to pick up new database connection
kubectl rollout restart deployment/veza-backend-api -n veza-production
kubectl rollout restart deployment/veza-chat-server -n veza-production

# Verify pods are healthy
kubectl rollout status deployment/veza-backend-api -n veza-production

Step 5: Verify Application Health

# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production

# Test database connectivity
kubectl exec -it deployment/veza-backend-api -n veza-production -- \
  psql $DATABASE_URL -c "SELECT 1;"

# Check health endpoint
curl https://api.veza.com/health

Step 6: Set Up New Standby

# Once primary is recovered, set up new standby
# Follow PostgreSQL replication setup guide

Rollback Procedure

If failover was incorrect or primary recovers:

# Stop applications
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production

# Revert service endpoint
kubectl patch service postgres -n veza-production \
  -p '{"spec":{"selector":{"role":"primary","pod":"postgres-primary"}}}'

# Restart applications
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production

Verification Checklist

Standby promoted successfully
Service endpoint updated
Application pods restarted
Database connectivity verified
Application health checks passing
No data loss detected
Monitoring alerts cleared
Documentation updated

Troubleshooting

Standby Not Synchronized

# Check replication status
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT * FROM pg_stat_replication;"

# If replication is broken, rebuild standby
# (See PostgreSQL replication setup guide)

Application Cannot Connect

# Verify service selector
kubectl get service postgres -n veza-production -o yaml

# Check pod labels
kubectl get pod postgres-standby -n veza-production --show-labels

# Verify network connectivity
kubectl run test-connection --rm -it --image=postgres:15-alpine \
  --restart=Never \
  -- psql -h postgres.veza-production.svc.cluster.local -U veza_user -d veza_db

Post-Failover Tasks

Investigate Root Cause
- Review primary database logs
- Check system resources
- Identify failure reason
Set Up New Standby
- Configure replication from new primary
- Verify synchronization
- Update monitoring
Document Incident
- Document failover procedure
- Note any issues encountered
- Update runbook if needed
Notify Stakeholders
- Send incident report
- Update status page
- Schedule post-mortem

4.8 KiB Raw Blame History