veza/k8s/disaster-recovery/runbooks/database-failover.md

188 lines
4.8 KiB
Markdown
Raw Normal View History

# Database Failover Runbook
This runbook describes the procedure for failing over from a primary PostgreSQL database to a standby replica.
## Prerequisites
- Standby replica configured and synchronized
- Access to Kubernetes cluster
- Database credentials in Vault/Secrets
- Monitoring alerts configured
## Detection
### Automatic Detection
Monitoring alerts will trigger when:
- Primary database is unreachable
- Replication lag exceeds threshold
- Health checks fail
### Manual Detection
```bash
# Check primary database status
kubectl exec -it postgres-primary -n veza-production -- pg_isready
# Check replication status
kubectl exec -it postgres-standby -n veza-production -- \
psql -U postgres -c "SELECT * FROM pg_stat_replication;"
```
## Failover Procedure
### Step 1: Verify Standby Status
```bash
# Check standby is synchronized
kubectl exec -it postgres-standby -n veza-production -- \
psql -U postgres -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"
# Verify replication lag
kubectl exec -it postgres-standby -n veza-production -- \
psql -U postgres -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;"
```
**Expected**: Lag should be < 60 seconds
### Step 2: Promote Standby to Primary
```bash
# Promote standby
kubectl exec -it postgres-standby -n veza-production -- \
pg_ctl promote
# Verify promotion
kubectl exec -it postgres-standby -n veza-production -- \
psql -U postgres -c "SELECT pg_is_in_recovery();"
```
**Expected**: Returns `false` (no longer in recovery mode)
### Step 3: Update Service Endpoint
```bash
# Update postgres service to point to new primary
kubectl patch service postgres -n veza-production \
-p '{"spec":{"selector":{"role":"primary"}}}'
# Or update the service selector to point to standby pod
kubectl get pod postgres-standby -n veza-production -o jsonpath='{.metadata.labels}' | \
jq -r 'to_entries | map("\(.key)=\(.value)") | join(",")'
```
### Step 4: Restart Application Pods
```bash
# Restart to pick up new database connection
kubectl rollout restart deployment/veza-backend-api -n veza-production
kubectl rollout restart deployment/veza-chat-server -n veza-production
# Verify pods are healthy
kubectl rollout status deployment/veza-backend-api -n veza-production
```
### Step 5: Verify Application Health
```bash
# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production
# Test database connectivity
kubectl exec -it deployment/veza-backend-api -n veza-production -- \
psql $DATABASE_URL -c "SELECT 1;"
# Check health endpoint
curl https://api.veza.com/health
```
### Step 6: Set Up New Standby
```bash
# Once primary is recovered, set up new standby
# Follow PostgreSQL replication setup guide
```
## Rollback Procedure
If failover was incorrect or primary recovers:
```bash
# Stop applications
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production
# Revert service endpoint
kubectl patch service postgres -n veza-production \
-p '{"spec":{"selector":{"role":"primary","pod":"postgres-primary"}}}'
# Restart applications
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production
```
## Verification Checklist
- [ ] Standby promoted successfully
- [ ] Service endpoint updated
- [ ] Application pods restarted
- [ ] Database connectivity verified
- [ ] Application health checks passing
- [ ] No data loss detected
- [ ] Monitoring alerts cleared
- [ ] Documentation updated
## Troubleshooting
### Standby Not Synchronized
```bash
# Check replication status
kubectl exec -it postgres-standby -n veza-production -- \
psql -U postgres -c "SELECT * FROM pg_stat_replication;"
# If replication is broken, rebuild standby
# (See PostgreSQL replication setup guide)
```
### Application Cannot Connect
```bash
# Verify service selector
kubectl get service postgres -n veza-production -o yaml
# Check pod labels
kubectl get pod postgres-standby -n veza-production --show-labels
# Verify network connectivity
kubectl run test-connection --rm -it --image=postgres:15-alpine \
--restart=Never \
-- psql -h postgres.veza-production.svc.cluster.local -U veza_user -d veza_db
```
## Post-Failover Tasks
1. **Investigate Root Cause**
- Review primary database logs
- Check system resources
- Identify failure reason
2. **Set Up New Standby**
- Configure replication from new primary
- Verify synchronization
- Update monitoring
3. **Document Incident**
- Document failover procedure
- Note any issues encountered
- Update runbook if needed
4. **Notify Stakeholders**
- Send incident report
- Update status page
- Schedule post-mortem
## References
- [PostgreSQL Replication Documentation](https://www.postgresql.org/docs/current/high-availability.html)
- [Kubernetes Service Documentation](https://kubernetes.io/docs/concepts/services-networking/service/)