# Database Failover Runbook This runbook describes the procedure for failing over from a primary PostgreSQL database to a standby replica. ## Prerequisites - Standby replica configured and synchronized - Access to Kubernetes cluster - Database credentials in Vault/Secrets - Monitoring alerts configured ## Detection ### Automatic Detection Monitoring alerts will trigger when: - Primary database is unreachable - Replication lag exceeds threshold - Health checks fail ### Manual Detection ```bash # Check primary database status kubectl exec -it postgres-primary -n veza-production -- pg_isready # Check replication status kubectl exec -it postgres-standby -n veza-production -- \ psql -U postgres -c "SELECT * FROM pg_stat_replication;" ``` ## Failover Procedure ### Step 1: Verify Standby Status ```bash # Check standby is synchronized kubectl exec -it postgres-standby -n veza-production -- \ psql -U postgres -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();" # Verify replication lag kubectl exec -it postgres-standby -n veza-production -- \ psql -U postgres -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;" ``` **Expected**: Lag should be < 60 seconds ### Step 2: Promote Standby to Primary ```bash # Promote standby kubectl exec -it postgres-standby -n veza-production -- \ pg_ctl promote # Verify promotion kubectl exec -it postgres-standby -n veza-production -- \ psql -U postgres -c "SELECT pg_is_in_recovery();" ``` **Expected**: Returns `false` (no longer in recovery mode) ### Step 3: Update Service Endpoint ```bash # Update postgres service to point to new primary kubectl patch service postgres -n veza-production \ -p '{"spec":{"selector":{"role":"primary"}}}' # Or update the service selector to point to standby pod kubectl get pod postgres-standby -n veza-production -o jsonpath='{.metadata.labels}' | \ jq -r 'to_entries | map("\(.key)=\(.value)") | join(",")' ``` ### Step 4: Restart Application Pods ```bash # Restart to pick up new database connection # (backend-api handles chat since v0.502 merge — no separate chat-server deployment) kubectl rollout restart deployment/veza-backend-api -n veza-production # Verify pods are healthy kubectl rollout status deployment/veza-backend-api -n veza-production ``` ### Step 5: Verify Application Health ```bash # Check application logs kubectl logs -f deployment/veza-backend-api -n veza-production # Test database connectivity kubectl exec -it deployment/veza-backend-api -n veza-production -- \ psql $DATABASE_URL -c "SELECT 1;" # Check health endpoint curl https://api.veza.com/health ``` ### Step 6: Set Up New Standby ```bash # Once primary is recovered, set up new standby # Follow PostgreSQL replication setup guide ``` ## Rollback Procedure If failover was incorrect or primary recovers: ```bash # Stop applications kubectl scale deployment veza-backend-api --replicas=0 -n veza-production # Revert service endpoint kubectl patch service postgres -n veza-production \ -p '{"spec":{"selector":{"role":"primary","pod":"postgres-primary"}}}' # Restart applications kubectl scale deployment veza-backend-api --replicas=3 -n veza-production ``` ## Verification Checklist - [ ] Standby promoted successfully - [ ] Service endpoint updated - [ ] Application pods restarted - [ ] Database connectivity verified - [ ] Application health checks passing - [ ] No data loss detected - [ ] Monitoring alerts cleared - [ ] Documentation updated ## Troubleshooting ### Standby Not Synchronized ```bash # Check replication status kubectl exec -it postgres-standby -n veza-production -- \ psql -U postgres -c "SELECT * FROM pg_stat_replication;" # If replication is broken, rebuild standby # (See PostgreSQL replication setup guide) ``` ### Application Cannot Connect ```bash # Verify service selector kubectl get service postgres -n veza-production -o yaml # Check pod labels kubectl get pod postgres-standby -n veza-production --show-labels # Verify network connectivity kubectl run test-connection --rm -it --image=postgres:15-alpine \ --restart=Never \ -- psql -h postgres.veza-production.svc.cluster.local -U veza_user -d veza_db ``` ## Post-Failover Tasks 1. **Investigate Root Cause** - Review primary database logs - Check system resources - Identify failure reason 2. **Set Up New Standby** - Configure replication from new primary - Verify synchronization - Update monitoring 3. **Document Incident** - Document failover procedure - Note any issues encountered - Update runbook if needed 4. **Notify Stakeholders** - Send incident report - Update status page - Schedule post-mortem ## References - [PostgreSQL Replication Documentation](https://www.postgresql.org/docs/current/high-availability.html) - [Kubernetes Service Documentation](https://kubernetes.io/docs/concepts/services-networking/service/)