veza/k8s/disaster-recovery/runbooks/cluster-failover.md
senke 2aea1af361 docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.

CLAUDE.md — complete rewrite
  Old version referenced paths that don't exist and a protocol aimed at
  implementing v0.11.0 (current tag: v1.0.3). The agent was following a
  map for a city that had been rebuilt.
  - backend/        → veza-backend-api/
  - frontend/       → apps/web/
  - ORIGIN/ (root)  → veza-docs/ORIGIN/
  - veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
  - apps/desktop/   → never existed
  Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
  commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
  scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
  dark patterns, no public popularity metrics).

README.md — targeted fixes
  - "Version cible: v0.101" → "Version courante: v1.0.4"
  - "Development Setup (v0.9.3)" → "Development Setup"
  - Removed Desktop (Electron) section — never implemented
  - Removed veza-chat-server from structure — merged into backend
  - Removed deprecated compose files section (nothing is DEPRECATED now)

k8s runbooks — remove stale chat-server references
  The disaster-recovery runbooks still scaled/restarted a deployment
  that no longer exists. In a real failover these commands would have
  failed silently and blocked the procedure. Files patched:
    - k8s/disaster-recovery/runbooks/cluster-failover.md
    - k8s/disaster-recovery/runbooks/data-restore.md
    - k8s/disaster-recovery/runbooks/database-failover.md
    - k8s/disaster-recovery/runbooks/rollback-procedure.md
    - k8s/network-policies/README.md
    - k8s/secrets/README.md
    - k8s/secrets.yaml.example
  Each reference is replaced by a short inline note pointing to v0.502
  (commit 279a10d31) so future readers understand the history.

.env.example — remove CHAT_JWT_SECRET
  Legacy env var for the deleted chat server. Replaced by an explanatory
  comment.

Not in this commit (user handles on Forgejo):
  - Closing the 5 open dependabot PRs on veza-chat-server/* branches
  - Deleting those 5 remote branches after the PRs are closed

Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 17:23:50 +02:00

7.8 KiB

Cluster Failover Runbook

This runbook describes the procedure for failing over to a disaster recovery (DR) region when the primary cluster is completely unavailable.

Prerequisites

  • DR cluster provisioned and ready
  • Backups available in DR region
  • DNS access for failover
  • Access to both primary and DR clusters
  • Disaster declared and approved

Pre-Failover Checklist

  • Disaster declared and documented
  • Stakeholders notified
  • DR cluster resources verified
  • Latest backups available in DR
  • DNS access confirmed
  • Team assembled and ready

Failover Procedure

Step 1: Verify DR Cluster Status

# Switch kubectl context to DR cluster
kubectl config use-context veza-dr-cluster

# Verify cluster is healthy
kubectl cluster-info
kubectl get nodes

# Verify namespaces exist
kubectl get namespaces | grep veza

Step 2: Restore Secrets

# Restore secrets from Vault or backup
# Option A: From Vault
vault kv get -format=json secret/veza/production | \
  kubectl create secret generic veza-secrets \
    --from-literal=database-url="$(vault kv get -field=database-url secret/veza/production)" \
    --from-literal=jwt-secret="$(vault kv get -field=jwt-secret secret/veza/production)" \
    --from-literal=redis-url="$(vault kv get -field=redis-url secret/veza/production)" \
    -n veza-production \
    --dry-run=client -o yaml | kubectl apply -f -

# Option B: From backup file
kubectl create secret generic veza-secrets \
  --from-env-file=secrets-backup.env \
  -n veza-production

Step 3: Restore Database

# 1. Deploy PostgreSQL in DR cluster
kubectl apply -f k8s/database/postgres-deployment.yaml -n veza-production

# 2. Wait for PostgreSQL to be ready
kubectl wait --for=condition=ready pod \
  -l app=postgres \
  -n veza-production \
  --timeout=300s

# 3. Restore from latest backup
# Get backup from S3 or backup storage
aws s3 cp s3://veza-backups/postgres/latest.dump /tmp/backup.dump

# Restore database
kubectl run postgres-restore --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --env="PGPASSWORD=..." \
  --env="POSTGRES_HOST=postgres-service" \
  --env="POSTGRES_USER=veza_user" \
  --env="POSTGRES_DB=veza_db" \
  --overrides='
{
  "spec": {
    "containers": [{
      "name": "postgres-restore",
      "image": "postgres:15-alpine",
      "command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/latest.dump --clean --if-exists"],
      "volumeMounts": [{
        "name": "backup",
        "mountPath": "/backups"
      }]
    }],
    "volumes": [{
      "name": "backup",
      "hostPath": {
        "path": "/tmp"
      }
    }]
  }
}' \
  -n veza-production

# 4. Verify database restore
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U veza_user -d veza_db -c "SELECT COUNT(*) FROM users;"

Step 4: Deploy Applications

# Deploy backend API (includes chat since v0.502 merge)
kubectl apply -f k8s/backend-api/deployment.yaml -n veza-production
kubectl apply -f k8s/backend-api/service.yaml -n veza-production

# Deploy frontend
kubectl apply -f k8s/frontend/deployment.yaml -n veza-production
kubectl apply -f k8s/frontend/service.yaml -n veza-production

# Deploy stream server
kubectl apply -f k8s/stream-server/deployment.yaml -n veza-production
kubectl apply -f k8s/stream-server/service.yaml -n veza-production

# Wait for deployments
kubectl rollout status deployment/veza-backend-api -n veza-production
kubectl rollout status deployment/veza-frontend -n veza-production
kubectl rollout status deployment/veza-stream-server -n veza-production

Step 5: Configure Ingress

# Deploy ingress
kubectl apply -f k8s/ingress.yaml -n veza-production

# Verify ingress
kubectl get ingress -n veza-production

Step 6: Update DNS

# Get DR cluster ingress IP
DR_INGRESS_IP=$(kubectl get ingress veza-ingress -n veza-production -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

# Update DNS records
# Option A: Using AWS Route53
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.veza.com",
        "Type": "A",
        "TTL": 300,
        "ResourceRecords": [{"Value": "'$DR_INGRESS_IP'"}]
      }
    }]
  }'

# Option B: Using Cloudflare
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/ZONE_ID/dns_records/RECORD_ID" \
  -H "Authorization: Bearer $CLOUDFLARE_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"content":"'$DR_INGRESS_IP'"}'

# Wait for DNS propagation
dig api.veza.com +short

Step 7: Verify Services

# Check all pods are running
kubectl get pods -n veza-production

# Test health endpoints
curl https://api.veza.com/health
curl https://app.veza.com/health

# Run smoke tests
# (Use your application's test suite)

# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production

Step 8: Restore Redis (if needed)

# Deploy Redis
kubectl apply -f k8s/redis/deployment.yaml -n veza-production

# Restore Redis backup if available
kubectl cp redis-backup.rdb redis-pod:/data/dump.rdb -n veza-production
kubectl delete pod redis-pod -n veza-production  # Restart to load backup

Verification Checklist

  • DR cluster is healthy
  • Secrets restored
  • Database restored and verified
  • All applications deployed
  • Ingress configured
  • DNS updated
  • Health checks passing
  • Smoke tests passing
  • Users can access platform
  • Monitoring configured

Post-Failover Tasks

Immediate (First Hour)

  1. Monitor Platform

    • Watch application logs
    • Monitor error rates
    • Check performance metrics
  2. Notify Stakeholders

    • Send status update
    • Update status page
    • Communicate expected timeline

Short Term (First Day)

  1. Investigate Primary Cluster

    • Assess damage
    • Identify root cause
    • Estimate recovery time
  2. Optimize DR Cluster

    • Scale resources if needed
    • Optimize configurations
    • Monitor performance

Long Term (Recovery Phase)

  1. Restore Primary Cluster

    • Fix issues in primary
    • Restore from backups
    • Verify functionality
  2. Plan Failback

    • Schedule maintenance window
    • Prepare failback procedure
    • Test failback process

Failback Procedure

Once primary cluster is restored:

# 1. Sync data from DR to primary
# (Use database replication or restore from DR backup)

# 2. Verify primary cluster
kubectl config use-context veza-primary-cluster
kubectl get pods -n veza-production

# 3. Update DNS back to primary
# (Reverse of Step 6 in failover)

# 4. Monitor both clusters during transition

# 5. Once verified, scale down DR cluster
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production --context=veza-dr-cluster

Troubleshooting

Database Restore Fails

# Check backup file integrity
pg_restore --list /backups/latest.dump

# Try restoring specific tables
pg_restore -h postgres-service -U veza_user -d veza_db \
  -t users -t tracks /backups/latest.dump

# Check PostgreSQL logs
kubectl logs postgres-pod -n veza-production

Applications Not Starting

# Check pod status
kubectl describe pod <pod-name> -n veza-production

# Check logs
kubectl logs <pod-name> -n veza-production

# Verify secrets
kubectl get secret veza-secrets -n veza-production -o yaml

# Check resource constraints
kubectl top nodes

DNS Not Propagating

# Check DNS records
dig api.veza.com +short
nslookup api.veza.com

# Verify ingress IP
kubectl get ingress veza-ingress -n veza-production

# Check DNS provider status
# (AWS Route53, Cloudflare, etc.)

References