# Cluster Failover Runbook This runbook describes the procedure for failing over to a disaster recovery (DR) region when the primary cluster is completely unavailable. ## Prerequisites - DR cluster provisioned and ready - Backups available in DR region - DNS access for failover - Access to both primary and DR clusters - Disaster declared and approved ## Pre-Failover Checklist - [ ] Disaster declared and documented - [ ] Stakeholders notified - [ ] DR cluster resources verified - [ ] Latest backups available in DR - [ ] DNS access confirmed - [ ] Team assembled and ready ## Failover Procedure ### Step 1: Verify DR Cluster Status ```bash # Switch kubectl context to DR cluster kubectl config use-context veza-dr-cluster # Verify cluster is healthy kubectl cluster-info kubectl get nodes # Verify namespaces exist kubectl get namespaces | grep veza ``` ### Step 2: Restore Secrets ```bash # Restore secrets from Vault or backup # Option A: From Vault vault kv get -format=json secret/veza/production | \ kubectl create secret generic veza-secrets \ --from-literal=database-url="$(vault kv get -field=database-url secret/veza/production)" \ --from-literal=jwt-secret="$(vault kv get -field=jwt-secret secret/veza/production)" \ --from-literal=redis-url="$(vault kv get -field=redis-url secret/veza/production)" \ -n veza-production \ --dry-run=client -o yaml | kubectl apply -f - # Option B: From backup file kubectl create secret generic veza-secrets \ --from-env-file=secrets-backup.env \ -n veza-production ``` ### Step 3: Restore Database ```bash # 1. Deploy PostgreSQL in DR cluster kubectl apply -f k8s/database/postgres-deployment.yaml -n veza-production # 2. Wait for PostgreSQL to be ready kubectl wait --for=condition=ready pod \ -l app=postgres \ -n veza-production \ --timeout=300s # 3. Restore from latest backup # Get backup from S3 or backup storage aws s3 cp s3://veza-backups/postgres/latest.dump /tmp/backup.dump # Restore database kubectl run postgres-restore --rm -it --image=postgres:15-alpine \ --restart=Never \ --env="PGPASSWORD=..." \ --env="POSTGRES_HOST=postgres-service" \ --env="POSTGRES_USER=veza_user" \ --env="POSTGRES_DB=veza_db" \ --overrides=' { "spec": { "containers": [{ "name": "postgres-restore", "image": "postgres:15-alpine", "command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/latest.dump --clean --if-exists"], "volumeMounts": [{ "name": "backup", "mountPath": "/backups" }] }], "volumes": [{ "name": "backup", "hostPath": { "path": "/tmp" } }] } }' \ -n veza-production # 4. Verify database restore kubectl exec -it postgres-pod -n veza-production -- \ psql -U veza_user -d veza_db -c "SELECT COUNT(*) FROM users;" ``` ### Step 4: Deploy Applications ```bash # Deploy backend API (includes chat since v0.502 merge) kubectl apply -f k8s/backend-api/deployment.yaml -n veza-production kubectl apply -f k8s/backend-api/service.yaml -n veza-production # Deploy frontend kubectl apply -f k8s/frontend/deployment.yaml -n veza-production kubectl apply -f k8s/frontend/service.yaml -n veza-production # Deploy stream server kubectl apply -f k8s/stream-server/deployment.yaml -n veza-production kubectl apply -f k8s/stream-server/service.yaml -n veza-production # Wait for deployments kubectl rollout status deployment/veza-backend-api -n veza-production kubectl rollout status deployment/veza-frontend -n veza-production kubectl rollout status deployment/veza-stream-server -n veza-production ``` ### Step 5: Configure Ingress ```bash # Deploy ingress kubectl apply -f k8s/ingress.yaml -n veza-production # Verify ingress kubectl get ingress -n veza-production ``` ### Step 6: Update DNS ```bash # Get DR cluster ingress IP DR_INGRESS_IP=$(kubectl get ingress veza-ingress -n veza-production -o jsonpath='{.status.loadBalancer.ingress[0].ip}') # Update DNS records # Option A: Using AWS Route53 aws route53 change-resource-record-sets \ --hosted-zone-id Z1234567890 \ --change-batch '{ "Changes": [{ "Action": "UPSERT", "ResourceRecordSet": { "Name": "api.veza.com", "Type": "A", "TTL": 300, "ResourceRecords": [{"Value": "'$DR_INGRESS_IP'"}] } }] }' # Option B: Using Cloudflare curl -X PATCH "https://api.cloudflare.com/client/v4/zones/ZONE_ID/dns_records/RECORD_ID" \ -H "Authorization: Bearer $CLOUDFLARE_TOKEN" \ -H "Content-Type: application/json" \ --data '{"content":"'$DR_INGRESS_IP'"}' # Wait for DNS propagation dig api.veza.com +short ``` ### Step 7: Verify Services ```bash # Check all pods are running kubectl get pods -n veza-production # Test health endpoints curl https://api.veza.com/health curl https://app.veza.com/health # Run smoke tests # (Use your application's test suite) # Check application logs kubectl logs -f deployment/veza-backend-api -n veza-production ``` ### Step 8: Restore Redis (if needed) ```bash # Deploy Redis kubectl apply -f k8s/redis/deployment.yaml -n veza-production # Restore Redis backup if available kubectl cp redis-backup.rdb redis-pod:/data/dump.rdb -n veza-production kubectl delete pod redis-pod -n veza-production # Restart to load backup ``` ## Verification Checklist - [ ] DR cluster is healthy - [ ] Secrets restored - [ ] Database restored and verified - [ ] All applications deployed - [ ] Ingress configured - [ ] DNS updated - [ ] Health checks passing - [ ] Smoke tests passing - [ ] Users can access platform - [ ] Monitoring configured ## Post-Failover Tasks ### Immediate (First Hour) 1. **Monitor Platform** - Watch application logs - Monitor error rates - Check performance metrics 2. **Notify Stakeholders** - Send status update - Update status page - Communicate expected timeline ### Short Term (First Day) 1. **Investigate Primary Cluster** - Assess damage - Identify root cause - Estimate recovery time 2. **Optimize DR Cluster** - Scale resources if needed - Optimize configurations - Monitor performance ### Long Term (Recovery Phase) 1. **Restore Primary Cluster** - Fix issues in primary - Restore from backups - Verify functionality 2. **Plan Failback** - Schedule maintenance window - Prepare failback procedure - Test failback process ## Failback Procedure Once primary cluster is restored: ```bash # 1. Sync data from DR to primary # (Use database replication or restore from DR backup) # 2. Verify primary cluster kubectl config use-context veza-primary-cluster kubectl get pods -n veza-production # 3. Update DNS back to primary # (Reverse of Step 6 in failover) # 4. Monitor both clusters during transition # 5. Once verified, scale down DR cluster kubectl scale deployment veza-backend-api --replicas=0 -n veza-production --context=veza-dr-cluster ``` ## Troubleshooting ### Database Restore Fails ```bash # Check backup file integrity pg_restore --list /backups/latest.dump # Try restoring specific tables pg_restore -h postgres-service -U veza_user -d veza_db \ -t users -t tracks /backups/latest.dump # Check PostgreSQL logs kubectl logs postgres-pod -n veza-production ``` ### Applications Not Starting ```bash # Check pod status kubectl describe pod -n veza-production # Check logs kubectl logs -n veza-production # Verify secrets kubectl get secret veza-secrets -n veza-production -o yaml # Check resource constraints kubectl top nodes ``` ### DNS Not Propagating ```bash # Check DNS records dig api.veza.com +short nslookup api.veza.com # Verify ingress IP kubectl get ingress veza-ingress -n veza-production # Check DNS provider status # (AWS Route53, Cloudflare, etc.) ``` ## References - [Database Restore Runbook](./data-restore.md) - [Kubernetes Multi-Cluster Setup](https://kubernetes.io/docs/setup/) - [DNS Management Best Practices](../README.md)