veza/k8s/disaster-recovery/runbooks/database-failover.md

# Database Failover Runbook

This runbook describes the procedure for failing over from a primary PostgreSQL database to a standby replica.

## Prerequisites

- Standby replica configured and synchronized
- Access to Kubernetes cluster
- Database credentials in Vault/Secrets
- Monitoring alerts configured

## Detection

### Automatic Detection

Monitoring alerts will trigger when:

- Primary database is unreachable
- Replication lag exceeds threshold
- Health checks fail

### Manual Detection

```bash
# Check primary database status
kubectl exec -it postgres-primary -n veza-production -- pg_isready

# Check replication status
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT * FROM pg_stat_replication;"
```

## Failover Procedure

### Step 1: Verify Standby Status

```bash
# Check standby is synchronized
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"

# Verify replication lag
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;"
```

**Expected**: Lag should be < 60 seconds

### Step 2: Promote Standby to Primary

```bash
# Promote standby
kubectl exec -it postgres-standby -n veza-production -- \
  pg_ctl promote

# Verify promotion
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT pg_is_in_recovery();"
```

**Expected**: Returns `false` (no longer in recovery mode)

### Step 3: Update Service Endpoint

```bash
# Update postgres service to point to new primary
kubectl patch service postgres -n veza-production \
  -p '{"spec":{"selector":{"role":"primary"}}}'

# Or update the service selector to point to standby pod
kubectl get pod postgres-standby -n veza-production -o jsonpath='{.metadata.labels}' | \
  jq -r 'to_entries | map("\(.key)=\(.value)") | join(",")'
```

### Step 4: Restart Application Pods

```bash
# Restart to pick up new database connection
# (backend-api handles chat since v0.502 merge — no separate chat-server deployment)
kubectl rollout restart deployment/veza-backend-api -n veza-production

# Verify pods are healthy
kubectl rollout status deployment/veza-backend-api -n veza-production
```

### Step 5: Verify Application Health

```bash
# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production

# Test database connectivity
kubectl exec -it deployment/veza-backend-api -n veza-production -- \
  psql $DATABASE_URL -c "SELECT 1;"

# Check health endpoint
curl https://api.veza.com/health
```

### Step 6: Set Up New Standby

```bash
# Once primary is recovered, set up new standby
# Follow PostgreSQL replication setup guide
```

## Rollback Procedure

If failover was incorrect or primary recovers:

```bash
# Stop applications
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production

# Revert service endpoint
kubectl patch service postgres -n veza-production \
  -p '{"spec":{"selector":{"role":"primary","pod":"postgres-primary"}}}'

# Restart applications
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production
```

## Verification Checklist

- [ ] Standby promoted successfully
- [ ] Service endpoint updated
- [ ] Application pods restarted
- [ ] Database connectivity verified
- [ ] Application health checks passing
- [ ] No data loss detected
- [ ] Monitoring alerts cleared
- [ ] Documentation updated

## Troubleshooting

### Standby Not Synchronized

```bash
# Check replication status
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT * FROM pg_stat_replication;"

# If replication is broken, rebuild standby
# (See PostgreSQL replication setup guide)
```

### Application Cannot Connect

```bash
# Verify service selector
kubectl get service postgres -n veza-production -o yaml

# Check pod labels
kubectl get pod postgres-standby -n veza-production --show-labels

# Verify network connectivity
kubectl run test-connection --rm -it --image=postgres:15-alpine \
  --restart=Never \
  -- psql -h postgres.veza-production.svc.cluster.local -U veza_user -d veza_db
```

## Post-Failover Tasks

1. **Investigate Root Cause**
    - Review primary database logs
    - Check system resources
    - Identify failure reason

2. **Set Up New Standby**
    - Configure replication from new primary
    - Verify synchronization
    - Update monitoring

3. **Document Incident**
    - Document failover procedure
    - Note any issues encountered
    - Update runbook if needed

4. **Notify Stakeholders**
    - Send incident report
    - Update status page
    - Schedule post-mortem

## References

- [PostgreSQL Replication Documentation](https://www.postgresql.org/docs/current/high-availability.html)
- [Kubernetes Service Documentation](https://kubernetes.io/docs/concepts/services-networking/service/)