Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation now describes the actual repo layout instead of a fictional one. CLAUDE.md — complete rewrite Old version referenced paths that don't exist and a protocol aimed at implementing v0.11.0 (current tag: v1.0.3). The agent was following a map for a city that had been rebuilt. - backend/ → veza-backend-api/ - frontend/ → apps/web/ - ORIGIN/ (root) → veza-docs/ORIGIN/ - veza-chat-server → merged into backend-api (v0.502, commit279a10d31) - apps/desktop/ → never existed Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8), commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E), scope rules kept as immutable (no AI/ML, no Web3, no gamification, no dark patterns, no public popularity metrics). README.md — targeted fixes - "Version cible: v0.101" → "Version courante: v1.0.4" - "Development Setup (v0.9.3)" → "Development Setup" - Removed Desktop (Electron) section — never implemented - Removed veza-chat-server from structure — merged into backend - Removed deprecated compose files section (nothing is DEPRECATED now) k8s runbooks — remove stale chat-server references The disaster-recovery runbooks still scaled/restarted a deployment that no longer exists. In a real failover these commands would have failed silently and blocked the procedure. Files patched: - k8s/disaster-recovery/runbooks/cluster-failover.md - k8s/disaster-recovery/runbooks/data-restore.md - k8s/disaster-recovery/runbooks/database-failover.md - k8s/disaster-recovery/runbooks/rollback-procedure.md - k8s/network-policies/README.md - k8s/secrets/README.md - k8s/secrets.yaml.example Each reference is replaced by a short inline note pointing to v0.502 (commit279a10d31) so future readers understand the history. .env.example — remove CHAT_JWT_SECRET Legacy env var for the deleted chat server. Replaced by an explanatory comment. Not in this commit (user handles on Forgejo): - Closing the 5 open dependabot PRs on veza-chat-server/* branches - Deleting those 5 remote branches after the PRs are closed Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
4.8 KiB
4.8 KiB
Database Failover Runbook
This runbook describes the procedure for failing over from a primary PostgreSQL database to a standby replica.
Prerequisites
- Standby replica configured and synchronized
- Access to Kubernetes cluster
- Database credentials in Vault/Secrets
- Monitoring alerts configured
Detection
Automatic Detection
Monitoring alerts will trigger when:
- Primary database is unreachable
- Replication lag exceeds threshold
- Health checks fail
Manual Detection
# Check primary database status
kubectl exec -it postgres-primary -n veza-production -- pg_isready
# Check replication status
kubectl exec -it postgres-standby -n veza-production -- \
psql -U postgres -c "SELECT * FROM pg_stat_replication;"
Failover Procedure
Step 1: Verify Standby Status
# Check standby is synchronized
kubectl exec -it postgres-standby -n veza-production -- \
psql -U postgres -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"
# Verify replication lag
kubectl exec -it postgres-standby -n veza-production -- \
psql -U postgres -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;"
Expected: Lag should be < 60 seconds
Step 2: Promote Standby to Primary
# Promote standby
kubectl exec -it postgres-standby -n veza-production -- \
pg_ctl promote
# Verify promotion
kubectl exec -it postgres-standby -n veza-production -- \
psql -U postgres -c "SELECT pg_is_in_recovery();"
Expected: Returns false (no longer in recovery mode)
Step 3: Update Service Endpoint
# Update postgres service to point to new primary
kubectl patch service postgres -n veza-production \
-p '{"spec":{"selector":{"role":"primary"}}}'
# Or update the service selector to point to standby pod
kubectl get pod postgres-standby -n veza-production -o jsonpath='{.metadata.labels}' | \
jq -r 'to_entries | map("\(.key)=\(.value)") | join(",")'
Step 4: Restart Application Pods
# Restart to pick up new database connection
# (backend-api handles chat since v0.502 merge — no separate chat-server deployment)
kubectl rollout restart deployment/veza-backend-api -n veza-production
# Verify pods are healthy
kubectl rollout status deployment/veza-backend-api -n veza-production
Step 5: Verify Application Health
# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production
# Test database connectivity
kubectl exec -it deployment/veza-backend-api -n veza-production -- \
psql $DATABASE_URL -c "SELECT 1;"
# Check health endpoint
curl https://api.veza.com/health
Step 6: Set Up New Standby
# Once primary is recovered, set up new standby
# Follow PostgreSQL replication setup guide
Rollback Procedure
If failover was incorrect or primary recovers:
# Stop applications
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production
# Revert service endpoint
kubectl patch service postgres -n veza-production \
-p '{"spec":{"selector":{"role":"primary","pod":"postgres-primary"}}}'
# Restart applications
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production
Verification Checklist
- Standby promoted successfully
- Service endpoint updated
- Application pods restarted
- Database connectivity verified
- Application health checks passing
- No data loss detected
- Monitoring alerts cleared
- Documentation updated
Troubleshooting
Standby Not Synchronized
# Check replication status
kubectl exec -it postgres-standby -n veza-production -- \
psql -U postgres -c "SELECT * FROM pg_stat_replication;"
# If replication is broken, rebuild standby
# (See PostgreSQL replication setup guide)
Application Cannot Connect
# Verify service selector
kubectl get service postgres -n veza-production -o yaml
# Check pod labels
kubectl get pod postgres-standby -n veza-production --show-labels
# Verify network connectivity
kubectl run test-connection --rm -it --image=postgres:15-alpine \
--restart=Never \
-- psql -h postgres.veza-production.svc.cluster.local -U veza_user -d veza_db
Post-Failover Tasks
-
Investigate Root Cause
- Review primary database logs
- Check system resources
- Identify failure reason
-
Set Up New Standby
- Configure replication from new primary
- Verify synchronization
- Update monitoring
-
Document Incident
- Document failover procedure
- Note any issues encountered
- Update runbook if needed
-
Notify Stakeholders
- Send incident report
- Update status page
- Schedule post-mortem