veza/k8s/disaster-recovery/runbooks/database-failover.md

# Database Failover Runbook

This runbook describes the procedure for failing over from a primary PostgreSQL database to a standby replica.

## Prerequisites

- Standby replica configured and synchronized
- Access to Kubernetes cluster
- Database credentials in Vault/Secrets
- Monitoring alerts configured

## Detection

### Automatic Detection

Monitoring alerts will trigger when:
- Primary database is unreachable
- Replication lag exceeds threshold
- Health checks fail

### Manual Detection

```bash
# Check primary database status
kubectl exec -it postgres-primary -n veza-production -- pg_isready

# Check replication status
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT * FROM pg_stat_replication;"
```

## Failover Procedure

### Step 1: Verify Standby Status

```bash
# Check standby is synchronized
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"

# Verify replication lag
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;"
```

**Expected**: Lag should be < 60 seconds

### Step 2: Promote Standby to Primary

```bash
# Promote standby
kubectl exec -it postgres-standby -n veza-production -- \
  pg_ctl promote

# Verify promotion
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT pg_is_in_recovery();"
```

**Expected**: Returns `false` (no longer in recovery mode)

### Step 3: Update Service Endpoint

```bash
# Update postgres service to point to new primary
kubectl patch service postgres -n veza-production \
  -p '{"spec":{"selector":{"role":"primary"}}}'

# Or update the service selector to point to standby pod
kubectl get pod postgres-standby -n veza-production -o jsonpath='{.metadata.labels}' | \
  jq -r 'to_entries | map("\(.key)=\(.value)") | join(",")'
```

### Step 4: Restart Application Pods

```bash
# Restart to pick up new database connection
kubectl rollout restart deployment/veza-backend-api -n veza-production
kubectl rollout restart deployment/veza-chat-server -n veza-production

# Verify pods are healthy
kubectl rollout status deployment/veza-backend-api -n veza-production
```

### Step 5: Verify Application Health

```bash
# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production

# Test database connectivity
kubectl exec -it deployment/veza-backend-api -n veza-production -- \
  psql $DATABASE_URL -c "SELECT 1;"

# Check health endpoint
curl https://api.veza.com/health
```

### Step 6: Set Up New Standby

```bash
# Once primary is recovered, set up new standby
# Follow PostgreSQL replication setup guide
```

## Rollback Procedure

If failover was incorrect or primary recovers:

```bash
# Stop applications
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production

# Revert service endpoint
kubectl patch service postgres -n veza-production \
  -p '{"spec":{"selector":{"role":"primary","pod":"postgres-primary"}}}'

# Restart applications
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production
```

## Verification Checklist

- [ ] Standby promoted successfully
- [ ] Service endpoint updated
- [ ] Application pods restarted
- [ ] Database connectivity verified
- [ ] Application health checks passing
- [ ] No data loss detected
- [ ] Monitoring alerts cleared
- [ ] Documentation updated

## Troubleshooting

### Standby Not Synchronized

```bash
# Check replication status
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT * FROM pg_stat_replication;"

# If replication is broken, rebuild standby
# (See PostgreSQL replication setup guide)
```

### Application Cannot Connect

```bash
# Verify service selector
kubectl get service postgres -n veza-production -o yaml

# Check pod labels
kubectl get pod postgres-standby -n veza-production --show-labels

# Verify network connectivity
kubectl run test-connection --rm -it --image=postgres:15-alpine \
  --restart=Never \
  -- psql -h postgres.veza-production.svc.cluster.local -U veza_user -d veza_db
```

## Post-Failover Tasks

1. **Investigate Root Cause**
   - Review primary database logs
   - Check system resources
   - Identify failure reason

2. **Set Up New Standby**
   - Configure replication from new primary
   - Verify synchronization
   - Update monitoring

3. **Document Incident**
   - Document failover procedure
   - Note any issues encountered
   - Update runbook if needed

4. **Notify Stakeholders**
   - Send incident report
   - Update status page
   - Schedule post-mortem

## References

- [PostgreSQL Replication Documentation](https://www.postgresql.org/docs/current/high-availability.html)
- [Kubernetes Service Documentation](https://kubernetes.io/docs/concepts/services-networking/service/)
[INFRA-010] infra: Set up disaster recovery plan 2025-12-25 20:40:31 +00:00			`# Database Failover Runbook`

			`This runbook describes the procedure for failing over from a primary PostgreSQL database to a standby replica.`

			`## Prerequisites`

			`- Standby replica configured and synchronized`
			`- Access to Kubernetes cluster`
			`- Database credentials in Vault/Secrets`
			`- Monitoring alerts configured`

			`## Detection`

			`### Automatic Detection`

			`Monitoring alerts will trigger when:`
			`- Primary database is unreachable`
			`- Replication lag exceeds threshold`
			`- Health checks fail`

			`### Manual Detection`

			```bash
			`# Check primary database status`
			`kubectl exec -it postgres-primary -n veza-production -- pg_isready`

			`# Check replication status`
			`kubectl exec -it postgres-standby -n veza-production -- \`
			`psql -U postgres -c "SELECT * FROM pg_stat_replication;"`
			```

			`## Failover Procedure`

			`### Step 1: Verify Standby Status`

			```bash
			`# Check standby is synchronized`
			`kubectl exec -it postgres-standby -n veza-production -- \`
			`psql -U postgres -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"`

			`# Verify replication lag`
			`kubectl exec -it postgres-standby -n veza-production -- \`
			`psql -U postgres -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;"`
			```

			`Expected: Lag should be < 60 seconds`

			`### Step 2: Promote Standby to Primary`

			```bash
			`# Promote standby`
			`kubectl exec -it postgres-standby -n veza-production -- \`
			`pg_ctl promote`

			`# Verify promotion`
			`kubectl exec -it postgres-standby -n veza-production -- \`
			`psql -U postgres -c "SELECT pg_is_in_recovery();"`
			```

			Expected: Returns `false` (no longer in recovery mode)

			`### Step 3: Update Service Endpoint`

			```bash
			`# Update postgres service to point to new primary`
			`kubectl patch service postgres -n veza-production \`
			`-p '{"spec":{"selector":{"role":"primary"}}}'`

			`# Or update the service selector to point to standby pod`
			`kubectl get pod postgres-standby -n veza-production -o jsonpath='{.metadata.labels}' \| \`
			`jq -r 'to_entries \| map("\(.key)=\(.value)") \| join(",")'`
			```

			`### Step 4: Restart Application Pods`

			```bash
			`# Restart to pick up new database connection`
			`kubectl rollout restart deployment/veza-backend-api -n veza-production`
			`kubectl rollout restart deployment/veza-chat-server -n veza-production`

			`# Verify pods are healthy`
			`kubectl rollout status deployment/veza-backend-api -n veza-production`
			```

			`### Step 5: Verify Application Health`

			```bash
			`# Check application logs`
			`kubectl logs -f deployment/veza-backend-api -n veza-production`

			`# Test database connectivity`
			`kubectl exec -it deployment/veza-backend-api -n veza-production -- \`
			`psql $DATABASE_URL -c "SELECT 1;"`

			`# Check health endpoint`
			`curl https://api.veza.com/health`
			```

			`### Step 6: Set Up New Standby`

			```bash
			`# Once primary is recovered, set up new standby`
			`# Follow PostgreSQL replication setup guide`
			```

			`## Rollback Procedure`

			`If failover was incorrect or primary recovers:`

			```bash
			`# Stop applications`
			`kubectl scale deployment veza-backend-api --replicas=0 -n veza-production`

			`# Revert service endpoint`
			`kubectl patch service postgres -n veza-production \`
			`-p '{"spec":{"selector":{"role":"primary","pod":"postgres-primary"}}}'`

			`# Restart applications`
			`kubectl scale deployment veza-backend-api --replicas=3 -n veza-production`
			```

			`## Verification Checklist`

			`- [ ] Standby promoted successfully`
			`- [ ] Service endpoint updated`
			`- [ ] Application pods restarted`
			`- [ ] Database connectivity verified`
			`- [ ] Application health checks passing`
			`- [ ] No data loss detected`
			`- [ ] Monitoring alerts cleared`
			`- [ ] Documentation updated`

			`## Troubleshooting`

			`### Standby Not Synchronized`

			```bash
			`# Check replication status`
			`kubectl exec -it postgres-standby -n veza-production -- \`
			`psql -U postgres -c "SELECT * FROM pg_stat_replication;"`

			`# If replication is broken, rebuild standby`
			`# (See PostgreSQL replication setup guide)`
			```

			`### Application Cannot Connect`

			```bash
			`# Verify service selector`
			`kubectl get service postgres -n veza-production -o yaml`

			`# Check pod labels`
			`kubectl get pod postgres-standby -n veza-production --show-labels`

			`# Verify network connectivity`
			`kubectl run test-connection --rm -it --image=postgres:15-alpine \`
			`--restart=Never \`
			`-- psql -h postgres.veza-production.svc.cluster.local -U veza_user -d veza_db`
			```

			`## Post-Failover Tasks`

			`1. Investigate Root Cause`
			`- Review primary database logs`
			`- Check system resources`
			`- Identify failure reason`

			`2. Set Up New Standby`
			`- Configure replication from new primary`
			`- Verify synchronization`
			`- Update monitoring`

			`3. Document Incident`
			`- Document failover procedure`
			`- Note any issues encountered`
			`- Update runbook if needed`

			`4. Notify Stakeholders`
			`- Send incident report`
			`- Update status page`
			`- Schedule post-mortem`

			`## References`

			`- [PostgreSQL Replication Documentation](https://www.postgresql.org/docs/current/high-availability.html)`
			`- [Kubernetes Service Documentation](https://kubernetes.io/docs/concepts/services-networking/service/)`