Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation now describes the actual repo layout instead of a fictional one. CLAUDE.md — complete rewrite Old version referenced paths that don't exist and a protocol aimed at implementing v0.11.0 (current tag: v1.0.3). The agent was following a map for a city that had been rebuilt. - backend/ → veza-backend-api/ - frontend/ → apps/web/ - ORIGIN/ (root) → veza-docs/ORIGIN/ - veza-chat-server → merged into backend-api (v0.502, commit279a10d31) - apps/desktop/ → never existed Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8), commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E), scope rules kept as immutable (no AI/ML, no Web3, no gamification, no dark patterns, no public popularity metrics). README.md — targeted fixes - "Version cible: v0.101" → "Version courante: v1.0.4" - "Development Setup (v0.9.3)" → "Development Setup" - Removed Desktop (Electron) section — never implemented - Removed veza-chat-server from structure — merged into backend - Removed deprecated compose files section (nothing is DEPRECATED now) k8s runbooks — remove stale chat-server references The disaster-recovery runbooks still scaled/restarted a deployment that no longer exists. In a real failover these commands would have failed silently and blocked the procedure. Files patched: - k8s/disaster-recovery/runbooks/cluster-failover.md - k8s/disaster-recovery/runbooks/data-restore.md - k8s/disaster-recovery/runbooks/database-failover.md - k8s/disaster-recovery/runbooks/rollback-procedure.md - k8s/network-policies/README.md - k8s/secrets/README.md - k8s/secrets.yaml.example Each reference is replaced by a short inline note pointing to v0.502 (commit279a10d31) so future readers understand the history. .env.example — remove CHAT_JWT_SECRET Legacy env var for the deleted chat server. Replaced by an explanatory comment. Not in this commit (user handles on Forgejo): - Closing the 5 open dependabot PRs on veza-chat-server/* branches - Deleting those 5 remote branches after the PRs are closed Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
7.8 KiB
7.8 KiB
Cluster Failover Runbook
This runbook describes the procedure for failing over to a disaster recovery (DR) region when the primary cluster is completely unavailable.
Prerequisites
- DR cluster provisioned and ready
- Backups available in DR region
- DNS access for failover
- Access to both primary and DR clusters
- Disaster declared and approved
Pre-Failover Checklist
- Disaster declared and documented
- Stakeholders notified
- DR cluster resources verified
- Latest backups available in DR
- DNS access confirmed
- Team assembled and ready
Failover Procedure
Step 1: Verify DR Cluster Status
# Switch kubectl context to DR cluster
kubectl config use-context veza-dr-cluster
# Verify cluster is healthy
kubectl cluster-info
kubectl get nodes
# Verify namespaces exist
kubectl get namespaces | grep veza
Step 2: Restore Secrets
# Restore secrets from Vault or backup
# Option A: From Vault
vault kv get -format=json secret/veza/production | \
kubectl create secret generic veza-secrets \
--from-literal=database-url="$(vault kv get -field=database-url secret/veza/production)" \
--from-literal=jwt-secret="$(vault kv get -field=jwt-secret secret/veza/production)" \
--from-literal=redis-url="$(vault kv get -field=redis-url secret/veza/production)" \
-n veza-production \
--dry-run=client -o yaml | kubectl apply -f -
# Option B: From backup file
kubectl create secret generic veza-secrets \
--from-env-file=secrets-backup.env \
-n veza-production
Step 3: Restore Database
# 1. Deploy PostgreSQL in DR cluster
kubectl apply -f k8s/database/postgres-deployment.yaml -n veza-production
# 2. Wait for PostgreSQL to be ready
kubectl wait --for=condition=ready pod \
-l app=postgres \
-n veza-production \
--timeout=300s
# 3. Restore from latest backup
# Get backup from S3 or backup storage
aws s3 cp s3://veza-backups/postgres/latest.dump /tmp/backup.dump
# Restore database
kubectl run postgres-restore --rm -it --image=postgres:15-alpine \
--restart=Never \
--env="PGPASSWORD=..." \
--env="POSTGRES_HOST=postgres-service" \
--env="POSTGRES_USER=veza_user" \
--env="POSTGRES_DB=veza_db" \
--overrides='
{
"spec": {
"containers": [{
"name": "postgres-restore",
"image": "postgres:15-alpine",
"command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/latest.dump --clean --if-exists"],
"volumeMounts": [{
"name": "backup",
"mountPath": "/backups"
}]
}],
"volumes": [{
"name": "backup",
"hostPath": {
"path": "/tmp"
}
}]
}
}' \
-n veza-production
# 4. Verify database restore
kubectl exec -it postgres-pod -n veza-production -- \
psql -U veza_user -d veza_db -c "SELECT COUNT(*) FROM users;"
Step 4: Deploy Applications
# Deploy backend API (includes chat since v0.502 merge)
kubectl apply -f k8s/backend-api/deployment.yaml -n veza-production
kubectl apply -f k8s/backend-api/service.yaml -n veza-production
# Deploy frontend
kubectl apply -f k8s/frontend/deployment.yaml -n veza-production
kubectl apply -f k8s/frontend/service.yaml -n veza-production
# Deploy stream server
kubectl apply -f k8s/stream-server/deployment.yaml -n veza-production
kubectl apply -f k8s/stream-server/service.yaml -n veza-production
# Wait for deployments
kubectl rollout status deployment/veza-backend-api -n veza-production
kubectl rollout status deployment/veza-frontend -n veza-production
kubectl rollout status deployment/veza-stream-server -n veza-production
Step 5: Configure Ingress
# Deploy ingress
kubectl apply -f k8s/ingress.yaml -n veza-production
# Verify ingress
kubectl get ingress -n veza-production
Step 6: Update DNS
# Get DR cluster ingress IP
DR_INGRESS_IP=$(kubectl get ingress veza-ingress -n veza-production -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
# Update DNS records
# Option A: Using AWS Route53
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.veza.com",
"Type": "A",
"TTL": 300,
"ResourceRecords": [{"Value": "'$DR_INGRESS_IP'"}]
}
}]
}'
# Option B: Using Cloudflare
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/ZONE_ID/dns_records/RECORD_ID" \
-H "Authorization: Bearer $CLOUDFLARE_TOKEN" \
-H "Content-Type: application/json" \
--data '{"content":"'$DR_INGRESS_IP'"}'
# Wait for DNS propagation
dig api.veza.com +short
Step 7: Verify Services
# Check all pods are running
kubectl get pods -n veza-production
# Test health endpoints
curl https://api.veza.com/health
curl https://app.veza.com/health
# Run smoke tests
# (Use your application's test suite)
# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production
Step 8: Restore Redis (if needed)
# Deploy Redis
kubectl apply -f k8s/redis/deployment.yaml -n veza-production
# Restore Redis backup if available
kubectl cp redis-backup.rdb redis-pod:/data/dump.rdb -n veza-production
kubectl delete pod redis-pod -n veza-production # Restart to load backup
Verification Checklist
- DR cluster is healthy
- Secrets restored
- Database restored and verified
- All applications deployed
- Ingress configured
- DNS updated
- Health checks passing
- Smoke tests passing
- Users can access platform
- Monitoring configured
Post-Failover Tasks
Immediate (First Hour)
-
Monitor Platform
- Watch application logs
- Monitor error rates
- Check performance metrics
-
Notify Stakeholders
- Send status update
- Update status page
- Communicate expected timeline
Short Term (First Day)
-
Investigate Primary Cluster
- Assess damage
- Identify root cause
- Estimate recovery time
-
Optimize DR Cluster
- Scale resources if needed
- Optimize configurations
- Monitor performance
Long Term (Recovery Phase)
-
Restore Primary Cluster
- Fix issues in primary
- Restore from backups
- Verify functionality
-
Plan Failback
- Schedule maintenance window
- Prepare failback procedure
- Test failback process
Failback Procedure
Once primary cluster is restored:
# 1. Sync data from DR to primary
# (Use database replication or restore from DR backup)
# 2. Verify primary cluster
kubectl config use-context veza-primary-cluster
kubectl get pods -n veza-production
# 3. Update DNS back to primary
# (Reverse of Step 6 in failover)
# 4. Monitor both clusters during transition
# 5. Once verified, scale down DR cluster
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production --context=veza-dr-cluster
Troubleshooting
Database Restore Fails
# Check backup file integrity
pg_restore --list /backups/latest.dump
# Try restoring specific tables
pg_restore -h postgres-service -U veza_user -d veza_db \
-t users -t tracks /backups/latest.dump
# Check PostgreSQL logs
kubectl logs postgres-pod -n veza-production
Applications Not Starting
# Check pod status
kubectl describe pod <pod-name> -n veza-production
# Check logs
kubectl logs <pod-name> -n veza-production
# Verify secrets
kubectl get secret veza-secrets -n veza-production -o yaml
# Check resource constraints
kubectl top nodes
DNS Not Propagating
# Check DNS records
dig api.veza.com +short
nslookup api.veza.com
# Verify ingress IP
kubectl get ingress veza-ingress -n veza-production
# Check DNS provider status
# (AWS Route53, Cloudflare, etc.)