veza/k8s/disaster-recovery/runbooks/cluster-failover.md

# Cluster Failover Runbook

This runbook describes the procedure for failing over to a disaster recovery (DR) region when the primary cluster is completely unavailable.

## Prerequisites

- DR cluster provisioned and ready
- Backups available in DR region
- DNS access for failover
- Access to both primary and DR clusters
- Disaster declared and approved

## Pre-Failover Checklist

- [ ] Disaster declared and documented
- [ ] Stakeholders notified
- [ ] DR cluster resources verified
- [ ] Latest backups available in DR
- [ ] DNS access confirmed
- [ ] Team assembled and ready

## Failover Procedure

### Step 1: Verify DR Cluster Status

```bash
# Switch kubectl context to DR cluster
kubectl config use-context veza-dr-cluster

# Verify cluster is healthy
kubectl cluster-info
kubectl get nodes

# Verify namespaces exist
kubectl get namespaces | grep veza
```

### Step 2: Restore Secrets

```bash
# Restore secrets from Vault or backup
# Option A: From Vault
vault kv get -format=json secret/veza/production | \
  kubectl create secret generic veza-secrets \
    --from-literal=database-url="$(vault kv get -field=database-url secret/veza/production)" \
    --from-literal=jwt-secret="$(vault kv get -field=jwt-secret secret/veza/production)" \
    --from-literal=redis-url="$(vault kv get -field=redis-url secret/veza/production)" \
    -n veza-production \
    --dry-run=client -o yaml | kubectl apply -f -

# Option B: From backup file
kubectl create secret generic veza-secrets \
  --from-env-file=secrets-backup.env \
  -n veza-production
```

### Step 3: Restore Database

```bash
# 1. Deploy PostgreSQL in DR cluster
kubectl apply -f k8s/database/postgres-deployment.yaml -n veza-production

# 2. Wait for PostgreSQL to be ready
kubectl wait --for=condition=ready pod \
  -l app=postgres \
  -n veza-production \
  --timeout=300s

# 3. Restore from latest backup
# Get backup from S3 or backup storage
aws s3 cp s3://veza-backups/postgres/latest.dump /tmp/backup.dump

# Restore database
kubectl run postgres-restore --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --env="PGPASSWORD=..." \
  --env="POSTGRES_HOST=postgres-service" \
  --env="POSTGRES_USER=veza_user" \
  --env="POSTGRES_DB=veza_db" \
  --overrides='
{
  "spec": {
    "containers": [{
      "name": "postgres-restore",
      "image": "postgres:15-alpine",
      "command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/latest.dump --clean --if-exists"],
      "volumeMounts": [{
        "name": "backup",
        "mountPath": "/backups"
      }]
    }],
    "volumes": [{
      "name": "backup",
      "hostPath": {
        "path": "/tmp"
      }
    }]
  }
}' \
  -n veza-production

# 4. Verify database restore
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U veza_user -d veza_db -c "SELECT COUNT(*) FROM users;"
```

### Step 4: Deploy Applications

```bash
# Deploy backend API
kubectl apply -f k8s/backend-api/deployment.yaml -n veza-production
kubectl apply -f k8s/backend-api/service.yaml -n veza-production

# Deploy frontend
kubectl apply -f k8s/frontend/deployment.yaml -n veza-production
kubectl apply -f k8s/frontend/service.yaml -n veza-production

# Deploy chat server
kubectl apply -f k8s/chat-server/deployment.yaml -n veza-production
kubectl apply -f k8s/chat-server/service.yaml -n veza-production

# Wait for deployments
kubectl rollout status deployment/veza-backend-api -n veza-production
kubectl rollout status deployment/veza-frontend -n veza-production
kubectl rollout status deployment/veza-chat-server -n veza-production
```

### Step 5: Configure Ingress

```bash
# Deploy ingress
kubectl apply -f k8s/ingress.yaml -n veza-production

# Verify ingress
kubectl get ingress -n veza-production
```

### Step 6: Update DNS

```bash
# Get DR cluster ingress IP
DR_INGRESS_IP=$(kubectl get ingress veza-ingress -n veza-production -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

# Update DNS records
# Option A: Using AWS Route53
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.veza.com",
        "Type": "A",
        "TTL": 300,
        "ResourceRecords": [{"Value": "'$DR_INGRESS_IP'"}]
      }
    }]
  }'

# Option B: Using Cloudflare
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/ZONE_ID/dns_records/RECORD_ID" \
  -H "Authorization: Bearer $CLOUDFLARE_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"content":"'$DR_INGRESS_IP'"}'

# Wait for DNS propagation
dig api.veza.com +short
```

### Step 7: Verify Services

```bash
# Check all pods are running
kubectl get pods -n veza-production

# Test health endpoints
curl https://api.veza.com/health
curl https://app.veza.com/health

# Run smoke tests
# (Use your application's test suite)

# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production
```

### Step 8: Restore Redis (if needed)

```bash
# Deploy Redis
kubectl apply -f k8s/redis/deployment.yaml -n veza-production

# Restore Redis backup if available
kubectl cp redis-backup.rdb redis-pod:/data/dump.rdb -n veza-production
kubectl delete pod redis-pod -n veza-production  # Restart to load backup
```

## Verification Checklist

- [ ] DR cluster is healthy
- [ ] Secrets restored
- [ ] Database restored and verified
- [ ] All applications deployed
- [ ] Ingress configured
- [ ] DNS updated
- [ ] Health checks passing
- [ ] Smoke tests passing
- [ ] Users can access platform
- [ ] Monitoring configured

## Post-Failover Tasks

### Immediate (First Hour)

1. **Monitor Platform**
   - Watch application logs
   - Monitor error rates
   - Check performance metrics

2. **Notify Stakeholders**
   - Send status update
   - Update status page
   - Communicate expected timeline

### Short Term (First Day)

1. **Investigate Primary Cluster**
   - Assess damage
   - Identify root cause
   - Estimate recovery time

2. **Optimize DR Cluster**
   - Scale resources if needed
   - Optimize configurations
   - Monitor performance

### Long Term (Recovery Phase)

1. **Restore Primary Cluster**
   - Fix issues in primary
   - Restore from backups
   - Verify functionality

2. **Plan Failback**
   - Schedule maintenance window
   - Prepare failback procedure
   - Test failback process

## Failback Procedure

Once primary cluster is restored:

```bash
# 1. Sync data from DR to primary
# (Use database replication or restore from DR backup)

# 2. Verify primary cluster
kubectl config use-context veza-primary-cluster
kubectl get pods -n veza-production

# 3. Update DNS back to primary
# (Reverse of Step 6 in failover)

# 4. Monitor both clusters during transition

# 5. Once verified, scale down DR cluster
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production --context=veza-dr-cluster
```

## Troubleshooting

### Database Restore Fails

```bash
# Check backup file integrity
pg_restore --list /backups/latest.dump

# Try restoring specific tables
pg_restore -h postgres-service -U veza_user -d veza_db \
  -t users -t tracks /backups/latest.dump

# Check PostgreSQL logs
kubectl logs postgres-pod -n veza-production
```

### Applications Not Starting

```bash
# Check pod status
kubectl describe pod <pod-name> -n veza-production

# Check logs
kubectl logs <pod-name> -n veza-production

# Verify secrets
kubectl get secret veza-secrets -n veza-production -o yaml

# Check resource constraints
kubectl top nodes
```

### DNS Not Propagating

```bash
# Check DNS records
dig api.veza.com +short
nslookup api.veza.com

# Verify ingress IP
kubectl get ingress veza-ingress -n veza-production

# Check DNS provider status
# (AWS Route53, Cloudflare, etc.)
```

## References

- [Database Restore Runbook](./data-restore.md)
- [Kubernetes Multi-Cluster Setup](https://kubernetes.io/docs/setup/)
- [DNS Management Best Practices](../README.md)
[INFRA-010] infra: Set up disaster recovery plan 2025-12-25 20:40:31 +00:00			`# Cluster Failover Runbook`

			`This runbook describes the procedure for failing over to a disaster recovery (DR) region when the primary cluster is completely unavailable.`

			`## Prerequisites`

			`- DR cluster provisioned and ready`
			`- Backups available in DR region`
			`- DNS access for failover`
			`- Access to both primary and DR clusters`
			`- Disaster declared and approved`

			`## Pre-Failover Checklist`

			`- [ ] Disaster declared and documented`
			`- [ ] Stakeholders notified`
			`- [ ] DR cluster resources verified`
			`- [ ] Latest backups available in DR`
			`- [ ] DNS access confirmed`
			`- [ ] Team assembled and ready`

			`## Failover Procedure`

			`### Step 1: Verify DR Cluster Status`

			```bash
			`# Switch kubectl context to DR cluster`
			`kubectl config use-context veza-dr-cluster`

			`# Verify cluster is healthy`
			`kubectl cluster-info`
			`kubectl get nodes`

			`# Verify namespaces exist`
			`kubectl get namespaces \| grep veza`
			```

			`### Step 2: Restore Secrets`

			```bash
			`# Restore secrets from Vault or backup`
			`# Option A: From Vault`
			`vault kv get -format=json secret/veza/production \| \`
			`kubectl create secret generic veza-secrets \`
			`--from-literal=database-url="$(vault kv get -field=database-url secret/veza/production)" \`
			`--from-literal=jwt-secret="$(vault kv get -field=jwt-secret secret/veza/production)" \`
			`--from-literal=redis-url="$(vault kv get -field=redis-url secret/veza/production)" \`
			`-n veza-production \`
			`--dry-run=client -o yaml \| kubectl apply -f -`

			`# Option B: From backup file`
			`kubectl create secret generic veza-secrets \`
			`--from-env-file=secrets-backup.env \`
			`-n veza-production`
			```

			`### Step 3: Restore Database`

			```bash
			`# 1. Deploy PostgreSQL in DR cluster`
			`kubectl apply -f k8s/database/postgres-deployment.yaml -n veza-production`

			`# 2. Wait for PostgreSQL to be ready`
			`kubectl wait --for=condition=ready pod \`
			`-l app=postgres \`
			`-n veza-production \`
			`--timeout=300s`

			`# 3. Restore from latest backup`
			`# Get backup from S3 or backup storage`
			`aws s3 cp s3://veza-backups/postgres/latest.dump /tmp/backup.dump`

			`# Restore database`
			`kubectl run postgres-restore --rm -it --image=postgres:15-alpine \`
			`--restart=Never \`
			`--env="PGPASSWORD=..." \`
			`--env="POSTGRES_HOST=postgres-service" \`
			`--env="POSTGRES_USER=veza_user" \`
			`--env="POSTGRES_DB=veza_db" \`
			`--overrides='`
			`{`
			`"spec": {`
			`"containers": [{`
			`"name": "postgres-restore",`
			`"image": "postgres:15-alpine",`
			`"command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/latest.dump --clean --if-exists"],`
			`"volumeMounts": [{`
			`"name": "backup",`
			`"mountPath": "/backups"`
			`}]`
			`}],`
			`"volumes": [{`
			`"name": "backup",`
			`"hostPath": {`
			`"path": "/tmp"`
			`}`
			`}]`
			`}`
			`}' \`
			`-n veza-production`

			`# 4. Verify database restore`
			`kubectl exec -it postgres-pod -n veza-production -- \`
			`psql -U veza_user -d veza_db -c "SELECT COUNT(*) FROM users;"`
			```

			`### Step 4: Deploy Applications`

			```bash
			`# Deploy backend API`
			`kubectl apply -f k8s/backend-api/deployment.yaml -n veza-production`
			`kubectl apply -f k8s/backend-api/service.yaml -n veza-production`

			`# Deploy frontend`
			`kubectl apply -f k8s/frontend/deployment.yaml -n veza-production`
			`kubectl apply -f k8s/frontend/service.yaml -n veza-production`

			`# Deploy chat server`
			`kubectl apply -f k8s/chat-server/deployment.yaml -n veza-production`
			`kubectl apply -f k8s/chat-server/service.yaml -n veza-production`

			`# Wait for deployments`
			`kubectl rollout status deployment/veza-backend-api -n veza-production`
			`kubectl rollout status deployment/veza-frontend -n veza-production`
			`kubectl rollout status deployment/veza-chat-server -n veza-production`
			```

			`### Step 5: Configure Ingress`

			```bash
			`# Deploy ingress`
			`kubectl apply -f k8s/ingress.yaml -n veza-production`

			`# Verify ingress`
			`kubectl get ingress -n veza-production`
			```

			`### Step 6: Update DNS`

			```bash
			`# Get DR cluster ingress IP`
			`DR_INGRESS_IP=$(kubectl get ingress veza-ingress -n veza-production -o jsonpath='{.status.loadBalancer.ingress[0].ip}')`

			`# Update DNS records`
			`# Option A: Using AWS Route53`
			`aws route53 change-resource-record-sets \`
			`--hosted-zone-id Z1234567890 \`
			`--change-batch '{`
			`"Changes": [{`
			`"Action": "UPSERT",`
			`"ResourceRecordSet": {`
			`"Name": "api.veza.com",`
			`"Type": "A",`
			`"TTL": 300,`
			`"ResourceRecords": [{"Value": "'$DR_INGRESS_IP'"}]`
			`}`
			`}]`
			`}'`

			`# Option B: Using Cloudflare`
			`curl -X PATCH "https://api.cloudflare.com/client/v4/zones/ZONE_ID/dns_records/RECORD_ID" \`
			`-H "Authorization: Bearer $CLOUDFLARE_TOKEN" \`
			`-H "Content-Type: application/json" \`
			`--data '{"content":"'$DR_INGRESS_IP'"}'`

			`# Wait for DNS propagation`
			`dig api.veza.com +short`
			```

			`### Step 7: Verify Services`

			```bash
			`# Check all pods are running`
			`kubectl get pods -n veza-production`

			`# Test health endpoints`
			`curl https://api.veza.com/health`
			`curl https://app.veza.com/health`

			`# Run smoke tests`
			`# (Use your application's test suite)`

			`# Check application logs`
			`kubectl logs -f deployment/veza-backend-api -n veza-production`
			```

			`### Step 8: Restore Redis (if needed)`

			```bash
			`# Deploy Redis`
			`kubectl apply -f k8s/redis/deployment.yaml -n veza-production`

			`# Restore Redis backup if available`
			`kubectl cp redis-backup.rdb redis-pod:/data/dump.rdb -n veza-production`
			`kubectl delete pod redis-pod -n veza-production # Restart to load backup`
			```

			`## Verification Checklist`

			`- [ ] DR cluster is healthy`
			`- [ ] Secrets restored`
			`- [ ] Database restored and verified`
			`- [ ] All applications deployed`
			`- [ ] Ingress configured`
			`- [ ] DNS updated`
			`- [ ] Health checks passing`
			`- [ ] Smoke tests passing`
			`- [ ] Users can access platform`
			`- [ ] Monitoring configured`

			`## Post-Failover Tasks`

			`### Immediate (First Hour)`

			`1. Monitor Platform`
			`- Watch application logs`
			`- Monitor error rates`
			`- Check performance metrics`

			`2. Notify Stakeholders`
			`- Send status update`
			`- Update status page`
			`- Communicate expected timeline`

			`### Short Term (First Day)`

			`1. Investigate Primary Cluster`
			`- Assess damage`
			`- Identify root cause`
			`- Estimate recovery time`

			`2. Optimize DR Cluster`
			`- Scale resources if needed`
			`- Optimize configurations`
			`- Monitor performance`

			`### Long Term (Recovery Phase)`

			`1. Restore Primary Cluster`
			`- Fix issues in primary`
			`- Restore from backups`
			`- Verify functionality`

			`2. Plan Failback`
			`- Schedule maintenance window`
			`- Prepare failback procedure`
			`- Test failback process`

			`## Failback Procedure`

			`Once primary cluster is restored:`

			```bash
			`# 1. Sync data from DR to primary`
			`# (Use database replication or restore from DR backup)`

			`# 2. Verify primary cluster`
			`kubectl config use-context veza-primary-cluster`
			`kubectl get pods -n veza-production`

			`# 3. Update DNS back to primary`
			`# (Reverse of Step 6 in failover)`

			`# 4. Monitor both clusters during transition`

			`# 5. Once verified, scale down DR cluster`
			`kubectl scale deployment veza-backend-api --replicas=0 -n veza-production --context=veza-dr-cluster`
			```

			`## Troubleshooting`

			`### Database Restore Fails`

			```bash
			`# Check backup file integrity`
			`pg_restore --list /backups/latest.dump`

			`# Try restoring specific tables`
			`pg_restore -h postgres-service -U veza_user -d veza_db \`
			`-t users -t tracks /backups/latest.dump`

			`# Check PostgreSQL logs`
			`kubectl logs postgres-pod -n veza-production`
			```

			`### Applications Not Starting`

			```bash
			`# Check pod status`
			`kubectl describe pod <pod-name> -n veza-production`

			`# Check logs`
			`kubectl logs <pod-name> -n veza-production`

			`# Verify secrets`
			`kubectl get secret veza-secrets -n veza-production -o yaml`

			`# Check resource constraints`
			`kubectl top nodes`
			```

			`### DNS Not Propagating`

			```bash
			`# Check DNS records`
			`dig api.veza.com +short`
			`nslookup api.veza.com`

			`# Verify ingress IP`
			`kubectl get ingress veza-ingress -n veza-production`

			`# Check DNS provider status`
			`# (AWS Route53, Cloudflare, etc.)`
			```

			`## References`

			`- [Database Restore Runbook](./data-restore.md)`
			`- [Kubernetes Multi-Cluster Setup](https://kubernetes.io/docs/setup/)`
			`- [DNS Management Best Practices](../README.md)`