2025-12-25 20:40:31 +00:00
|
|
|
# Cluster Failover Runbook
|
|
|
|
|
|
|
|
|
|
This runbook describes the procedure for failing over to a disaster recovery (DR) region when the primary cluster is completely unavailable.
|
|
|
|
|
|
|
|
|
|
## Prerequisites
|
|
|
|
|
|
|
|
|
|
- DR cluster provisioned and ready
|
|
|
|
|
- Backups available in DR region
|
|
|
|
|
- DNS access for failover
|
|
|
|
|
- Access to both primary and DR clusters
|
|
|
|
|
- Disaster declared and approved
|
|
|
|
|
|
|
|
|
|
## Pre-Failover Checklist
|
|
|
|
|
|
|
|
|
|
- [ ] Disaster declared and documented
|
|
|
|
|
- [ ] Stakeholders notified
|
|
|
|
|
- [ ] DR cluster resources verified
|
|
|
|
|
- [ ] Latest backups available in DR
|
|
|
|
|
- [ ] DNS access confirmed
|
|
|
|
|
- [ ] Team assembled and ready
|
|
|
|
|
|
|
|
|
|
## Failover Procedure
|
|
|
|
|
|
|
|
|
|
### Step 1: Verify DR Cluster Status
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Switch kubectl context to DR cluster
|
|
|
|
|
kubectl config use-context veza-dr-cluster
|
|
|
|
|
|
|
|
|
|
# Verify cluster is healthy
|
|
|
|
|
kubectl cluster-info
|
|
|
|
|
kubectl get nodes
|
|
|
|
|
|
|
|
|
|
# Verify namespaces exist
|
|
|
|
|
kubectl get namespaces | grep veza
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Step 2: Restore Secrets
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Restore secrets from Vault or backup
|
|
|
|
|
# Option A: From Vault
|
|
|
|
|
vault kv get -format=json secret/veza/production | \
|
|
|
|
|
kubectl create secret generic veza-secrets \
|
|
|
|
|
--from-literal=database-url="$(vault kv get -field=database-url secret/veza/production)" \
|
|
|
|
|
--from-literal=jwt-secret="$(vault kv get -field=jwt-secret secret/veza/production)" \
|
|
|
|
|
--from-literal=redis-url="$(vault kv get -field=redis-url secret/veza/production)" \
|
|
|
|
|
-n veza-production \
|
|
|
|
|
--dry-run=client -o yaml | kubectl apply -f -
|
|
|
|
|
|
|
|
|
|
# Option B: From backup file
|
|
|
|
|
kubectl create secret generic veza-secrets \
|
|
|
|
|
--from-env-file=secrets-backup.env \
|
|
|
|
|
-n veza-production
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Step 3: Restore Database
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# 1. Deploy PostgreSQL in DR cluster
|
|
|
|
|
kubectl apply -f k8s/database/postgres-deployment.yaml -n veza-production
|
|
|
|
|
|
|
|
|
|
# 2. Wait for PostgreSQL to be ready
|
|
|
|
|
kubectl wait --for=condition=ready pod \
|
|
|
|
|
-l app=postgres \
|
|
|
|
|
-n veza-production \
|
|
|
|
|
--timeout=300s
|
|
|
|
|
|
|
|
|
|
# 3. Restore from latest backup
|
|
|
|
|
# Get backup from S3 or backup storage
|
|
|
|
|
aws s3 cp s3://veza-backups/postgres/latest.dump /tmp/backup.dump
|
|
|
|
|
|
|
|
|
|
# Restore database
|
|
|
|
|
kubectl run postgres-restore --rm -it --image=postgres:15-alpine \
|
|
|
|
|
--restart=Never \
|
|
|
|
|
--env="PGPASSWORD=..." \
|
|
|
|
|
--env="POSTGRES_HOST=postgres-service" \
|
|
|
|
|
--env="POSTGRES_USER=veza_user" \
|
|
|
|
|
--env="POSTGRES_DB=veza_db" \
|
|
|
|
|
--overrides='
|
|
|
|
|
{
|
|
|
|
|
"spec": {
|
|
|
|
|
"containers": [{
|
|
|
|
|
"name": "postgres-restore",
|
|
|
|
|
"image": "postgres:15-alpine",
|
|
|
|
|
"command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/latest.dump --clean --if-exists"],
|
|
|
|
|
"volumeMounts": [{
|
|
|
|
|
"name": "backup",
|
|
|
|
|
"mountPath": "/backups"
|
|
|
|
|
}]
|
|
|
|
|
}],
|
|
|
|
|
"volumes": [{
|
|
|
|
|
"name": "backup",
|
|
|
|
|
"hostPath": {
|
|
|
|
|
"path": "/tmp"
|
|
|
|
|
}
|
|
|
|
|
}]
|
|
|
|
|
}
|
|
|
|
|
}' \
|
|
|
|
|
-n veza-production
|
|
|
|
|
|
|
|
|
|
# 4. Verify database restore
|
|
|
|
|
kubectl exec -it postgres-pod -n veza-production -- \
|
|
|
|
|
psql -U veza_user -d veza_db -c "SELECT COUNT(*) FROM users;"
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Step 4: Deploy Applications
|
|
|
|
|
|
|
|
|
|
```bash
|
docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.
CLAUDE.md — complete rewrite
Old version referenced paths that don't exist and a protocol aimed at
implementing v0.11.0 (current tag: v1.0.3). The agent was following a
map for a city that had been rebuilt.
- backend/ → veza-backend-api/
- frontend/ → apps/web/
- ORIGIN/ (root) → veza-docs/ORIGIN/
- veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
- apps/desktop/ → never existed
Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
dark patterns, no public popularity metrics).
README.md — targeted fixes
- "Version cible: v0.101" → "Version courante: v1.0.4"
- "Development Setup (v0.9.3)" → "Development Setup"
- Removed Desktop (Electron) section — never implemented
- Removed veza-chat-server from structure — merged into backend
- Removed deprecated compose files section (nothing is DEPRECATED now)
k8s runbooks — remove stale chat-server references
The disaster-recovery runbooks still scaled/restarted a deployment
that no longer exists. In a real failover these commands would have
failed silently and blocked the procedure. Files patched:
- k8s/disaster-recovery/runbooks/cluster-failover.md
- k8s/disaster-recovery/runbooks/data-restore.md
- k8s/disaster-recovery/runbooks/database-failover.md
- k8s/disaster-recovery/runbooks/rollback-procedure.md
- k8s/network-policies/README.md
- k8s/secrets/README.md
- k8s/secrets.yaml.example
Each reference is replaced by a short inline note pointing to v0.502
(commit 279a10d31) so future readers understand the history.
.env.example — remove CHAT_JWT_SECRET
Legacy env var for the deleted chat server. Replaced by an explanatory
comment.
Not in this commit (user handles on Forgejo):
- Closing the 5 open dependabot PRs on veza-chat-server/* branches
- Deleting those 5 remote branches after the PRs are closed
Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 15:23:50 +00:00
|
|
|
# Deploy backend API (includes chat since v0.502 merge)
|
2025-12-25 20:40:31 +00:00
|
|
|
kubectl apply -f k8s/backend-api/deployment.yaml -n veza-production
|
|
|
|
|
kubectl apply -f k8s/backend-api/service.yaml -n veza-production
|
|
|
|
|
|
|
|
|
|
# Deploy frontend
|
|
|
|
|
kubectl apply -f k8s/frontend/deployment.yaml -n veza-production
|
|
|
|
|
kubectl apply -f k8s/frontend/service.yaml -n veza-production
|
|
|
|
|
|
docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.
CLAUDE.md — complete rewrite
Old version referenced paths that don't exist and a protocol aimed at
implementing v0.11.0 (current tag: v1.0.3). The agent was following a
map for a city that had been rebuilt.
- backend/ → veza-backend-api/
- frontend/ → apps/web/
- ORIGIN/ (root) → veza-docs/ORIGIN/
- veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
- apps/desktop/ → never existed
Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
dark patterns, no public popularity metrics).
README.md — targeted fixes
- "Version cible: v0.101" → "Version courante: v1.0.4"
- "Development Setup (v0.9.3)" → "Development Setup"
- Removed Desktop (Electron) section — never implemented
- Removed veza-chat-server from structure — merged into backend
- Removed deprecated compose files section (nothing is DEPRECATED now)
k8s runbooks — remove stale chat-server references
The disaster-recovery runbooks still scaled/restarted a deployment
that no longer exists. In a real failover these commands would have
failed silently and blocked the procedure. Files patched:
- k8s/disaster-recovery/runbooks/cluster-failover.md
- k8s/disaster-recovery/runbooks/data-restore.md
- k8s/disaster-recovery/runbooks/database-failover.md
- k8s/disaster-recovery/runbooks/rollback-procedure.md
- k8s/network-policies/README.md
- k8s/secrets/README.md
- k8s/secrets.yaml.example
Each reference is replaced by a short inline note pointing to v0.502
(commit 279a10d31) so future readers understand the history.
.env.example — remove CHAT_JWT_SECRET
Legacy env var for the deleted chat server. Replaced by an explanatory
comment.
Not in this commit (user handles on Forgejo):
- Closing the 5 open dependabot PRs on veza-chat-server/* branches
- Deleting those 5 remote branches after the PRs are closed
Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 15:23:50 +00:00
|
|
|
# Deploy stream server
|
|
|
|
|
kubectl apply -f k8s/stream-server/deployment.yaml -n veza-production
|
|
|
|
|
kubectl apply -f k8s/stream-server/service.yaml -n veza-production
|
2025-12-25 20:40:31 +00:00
|
|
|
|
|
|
|
|
# Wait for deployments
|
|
|
|
|
kubectl rollout status deployment/veza-backend-api -n veza-production
|
|
|
|
|
kubectl rollout status deployment/veza-frontend -n veza-production
|
docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.
CLAUDE.md — complete rewrite
Old version referenced paths that don't exist and a protocol aimed at
implementing v0.11.0 (current tag: v1.0.3). The agent was following a
map for a city that had been rebuilt.
- backend/ → veza-backend-api/
- frontend/ → apps/web/
- ORIGIN/ (root) → veza-docs/ORIGIN/
- veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
- apps/desktop/ → never existed
Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
dark patterns, no public popularity metrics).
README.md — targeted fixes
- "Version cible: v0.101" → "Version courante: v1.0.4"
- "Development Setup (v0.9.3)" → "Development Setup"
- Removed Desktop (Electron) section — never implemented
- Removed veza-chat-server from structure — merged into backend
- Removed deprecated compose files section (nothing is DEPRECATED now)
k8s runbooks — remove stale chat-server references
The disaster-recovery runbooks still scaled/restarted a deployment
that no longer exists. In a real failover these commands would have
failed silently and blocked the procedure. Files patched:
- k8s/disaster-recovery/runbooks/cluster-failover.md
- k8s/disaster-recovery/runbooks/data-restore.md
- k8s/disaster-recovery/runbooks/database-failover.md
- k8s/disaster-recovery/runbooks/rollback-procedure.md
- k8s/network-policies/README.md
- k8s/secrets/README.md
- k8s/secrets.yaml.example
Each reference is replaced by a short inline note pointing to v0.502
(commit 279a10d31) so future readers understand the history.
.env.example — remove CHAT_JWT_SECRET
Legacy env var for the deleted chat server. Replaced by an explanatory
comment.
Not in this commit (user handles on Forgejo):
- Closing the 5 open dependabot PRs on veza-chat-server/* branches
- Deleting those 5 remote branches after the PRs are closed
Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 15:23:50 +00:00
|
|
|
kubectl rollout status deployment/veza-stream-server -n veza-production
|
2025-12-25 20:40:31 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Step 5: Configure Ingress
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Deploy ingress
|
|
|
|
|
kubectl apply -f k8s/ingress.yaml -n veza-production
|
|
|
|
|
|
|
|
|
|
# Verify ingress
|
|
|
|
|
kubectl get ingress -n veza-production
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Step 6: Update DNS
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Get DR cluster ingress IP
|
|
|
|
|
DR_INGRESS_IP=$(kubectl get ingress veza-ingress -n veza-production -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
|
|
|
|
|
|
|
|
|
|
# Update DNS records
|
|
|
|
|
# Option A: Using AWS Route53
|
|
|
|
|
aws route53 change-resource-record-sets \
|
|
|
|
|
--hosted-zone-id Z1234567890 \
|
|
|
|
|
--change-batch '{
|
|
|
|
|
"Changes": [{
|
|
|
|
|
"Action": "UPSERT",
|
|
|
|
|
"ResourceRecordSet": {
|
|
|
|
|
"Name": "api.veza.com",
|
|
|
|
|
"Type": "A",
|
|
|
|
|
"TTL": 300,
|
|
|
|
|
"ResourceRecords": [{"Value": "'$DR_INGRESS_IP'"}]
|
|
|
|
|
}
|
|
|
|
|
}]
|
|
|
|
|
}'
|
|
|
|
|
|
|
|
|
|
# Option B: Using Cloudflare
|
|
|
|
|
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/ZONE_ID/dns_records/RECORD_ID" \
|
|
|
|
|
-H "Authorization: Bearer $CLOUDFLARE_TOKEN" \
|
|
|
|
|
-H "Content-Type: application/json" \
|
|
|
|
|
--data '{"content":"'$DR_INGRESS_IP'"}'
|
|
|
|
|
|
|
|
|
|
# Wait for DNS propagation
|
|
|
|
|
dig api.veza.com +short
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Step 7: Verify Services
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Check all pods are running
|
|
|
|
|
kubectl get pods -n veza-production
|
|
|
|
|
|
|
|
|
|
# Test health endpoints
|
|
|
|
|
curl https://api.veza.com/health
|
|
|
|
|
curl https://app.veza.com/health
|
|
|
|
|
|
|
|
|
|
# Run smoke tests
|
|
|
|
|
# (Use your application's test suite)
|
|
|
|
|
|
|
|
|
|
# Check application logs
|
|
|
|
|
kubectl logs -f deployment/veza-backend-api -n veza-production
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Step 8: Restore Redis (if needed)
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Deploy Redis
|
|
|
|
|
kubectl apply -f k8s/redis/deployment.yaml -n veza-production
|
|
|
|
|
|
|
|
|
|
# Restore Redis backup if available
|
|
|
|
|
kubectl cp redis-backup.rdb redis-pod:/data/dump.rdb -n veza-production
|
|
|
|
|
kubectl delete pod redis-pod -n veza-production # Restart to load backup
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Verification Checklist
|
|
|
|
|
|
|
|
|
|
- [ ] DR cluster is healthy
|
|
|
|
|
- [ ] Secrets restored
|
|
|
|
|
- [ ] Database restored and verified
|
|
|
|
|
- [ ] All applications deployed
|
|
|
|
|
- [ ] Ingress configured
|
|
|
|
|
- [ ] DNS updated
|
|
|
|
|
- [ ] Health checks passing
|
|
|
|
|
- [ ] Smoke tests passing
|
|
|
|
|
- [ ] Users can access platform
|
|
|
|
|
- [ ] Monitoring configured
|
|
|
|
|
|
|
|
|
|
## Post-Failover Tasks
|
|
|
|
|
|
|
|
|
|
### Immediate (First Hour)
|
|
|
|
|
|
|
|
|
|
1. **Monitor Platform**
|
docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.
CLAUDE.md — complete rewrite
Old version referenced paths that don't exist and a protocol aimed at
implementing v0.11.0 (current tag: v1.0.3). The agent was following a
map for a city that had been rebuilt.
- backend/ → veza-backend-api/
- frontend/ → apps/web/
- ORIGIN/ (root) → veza-docs/ORIGIN/
- veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
- apps/desktop/ → never existed
Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
dark patterns, no public popularity metrics).
README.md — targeted fixes
- "Version cible: v0.101" → "Version courante: v1.0.4"
- "Development Setup (v0.9.3)" → "Development Setup"
- Removed Desktop (Electron) section — never implemented
- Removed veza-chat-server from structure — merged into backend
- Removed deprecated compose files section (nothing is DEPRECATED now)
k8s runbooks — remove stale chat-server references
The disaster-recovery runbooks still scaled/restarted a deployment
that no longer exists. In a real failover these commands would have
failed silently and blocked the procedure. Files patched:
- k8s/disaster-recovery/runbooks/cluster-failover.md
- k8s/disaster-recovery/runbooks/data-restore.md
- k8s/disaster-recovery/runbooks/database-failover.md
- k8s/disaster-recovery/runbooks/rollback-procedure.md
- k8s/network-policies/README.md
- k8s/secrets/README.md
- k8s/secrets.yaml.example
Each reference is replaced by a short inline note pointing to v0.502
(commit 279a10d31) so future readers understand the history.
.env.example — remove CHAT_JWT_SECRET
Legacy env var for the deleted chat server. Replaced by an explanatory
comment.
Not in this commit (user handles on Forgejo):
- Closing the 5 open dependabot PRs on veza-chat-server/* branches
- Deleting those 5 remote branches after the PRs are closed
Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 15:23:50 +00:00
|
|
|
- Watch application logs
|
|
|
|
|
- Monitor error rates
|
|
|
|
|
- Check performance metrics
|
2025-12-25 20:40:31 +00:00
|
|
|
|
|
|
|
|
2. **Notify Stakeholders**
|
docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.
CLAUDE.md — complete rewrite
Old version referenced paths that don't exist and a protocol aimed at
implementing v0.11.0 (current tag: v1.0.3). The agent was following a
map for a city that had been rebuilt.
- backend/ → veza-backend-api/
- frontend/ → apps/web/
- ORIGIN/ (root) → veza-docs/ORIGIN/
- veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
- apps/desktop/ → never existed
Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
dark patterns, no public popularity metrics).
README.md — targeted fixes
- "Version cible: v0.101" → "Version courante: v1.0.4"
- "Development Setup (v0.9.3)" → "Development Setup"
- Removed Desktop (Electron) section — never implemented
- Removed veza-chat-server from structure — merged into backend
- Removed deprecated compose files section (nothing is DEPRECATED now)
k8s runbooks — remove stale chat-server references
The disaster-recovery runbooks still scaled/restarted a deployment
that no longer exists. In a real failover these commands would have
failed silently and blocked the procedure. Files patched:
- k8s/disaster-recovery/runbooks/cluster-failover.md
- k8s/disaster-recovery/runbooks/data-restore.md
- k8s/disaster-recovery/runbooks/database-failover.md
- k8s/disaster-recovery/runbooks/rollback-procedure.md
- k8s/network-policies/README.md
- k8s/secrets/README.md
- k8s/secrets.yaml.example
Each reference is replaced by a short inline note pointing to v0.502
(commit 279a10d31) so future readers understand the history.
.env.example — remove CHAT_JWT_SECRET
Legacy env var for the deleted chat server. Replaced by an explanatory
comment.
Not in this commit (user handles on Forgejo):
- Closing the 5 open dependabot PRs on veza-chat-server/* branches
- Deleting those 5 remote branches after the PRs are closed
Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 15:23:50 +00:00
|
|
|
- Send status update
|
|
|
|
|
- Update status page
|
|
|
|
|
- Communicate expected timeline
|
2025-12-25 20:40:31 +00:00
|
|
|
|
|
|
|
|
### Short Term (First Day)
|
|
|
|
|
|
|
|
|
|
1. **Investigate Primary Cluster**
|
docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.
CLAUDE.md — complete rewrite
Old version referenced paths that don't exist and a protocol aimed at
implementing v0.11.0 (current tag: v1.0.3). The agent was following a
map for a city that had been rebuilt.
- backend/ → veza-backend-api/
- frontend/ → apps/web/
- ORIGIN/ (root) → veza-docs/ORIGIN/
- veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
- apps/desktop/ → never existed
Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
dark patterns, no public popularity metrics).
README.md — targeted fixes
- "Version cible: v0.101" → "Version courante: v1.0.4"
- "Development Setup (v0.9.3)" → "Development Setup"
- Removed Desktop (Electron) section — never implemented
- Removed veza-chat-server from structure — merged into backend
- Removed deprecated compose files section (nothing is DEPRECATED now)
k8s runbooks — remove stale chat-server references
The disaster-recovery runbooks still scaled/restarted a deployment
that no longer exists. In a real failover these commands would have
failed silently and blocked the procedure. Files patched:
- k8s/disaster-recovery/runbooks/cluster-failover.md
- k8s/disaster-recovery/runbooks/data-restore.md
- k8s/disaster-recovery/runbooks/database-failover.md
- k8s/disaster-recovery/runbooks/rollback-procedure.md
- k8s/network-policies/README.md
- k8s/secrets/README.md
- k8s/secrets.yaml.example
Each reference is replaced by a short inline note pointing to v0.502
(commit 279a10d31) so future readers understand the history.
.env.example — remove CHAT_JWT_SECRET
Legacy env var for the deleted chat server. Replaced by an explanatory
comment.
Not in this commit (user handles on Forgejo):
- Closing the 5 open dependabot PRs on veza-chat-server/* branches
- Deleting those 5 remote branches after the PRs are closed
Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 15:23:50 +00:00
|
|
|
- Assess damage
|
|
|
|
|
- Identify root cause
|
|
|
|
|
- Estimate recovery time
|
2025-12-25 20:40:31 +00:00
|
|
|
|
|
|
|
|
2. **Optimize DR Cluster**
|
docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.
CLAUDE.md — complete rewrite
Old version referenced paths that don't exist and a protocol aimed at
implementing v0.11.0 (current tag: v1.0.3). The agent was following a
map for a city that had been rebuilt.
- backend/ → veza-backend-api/
- frontend/ → apps/web/
- ORIGIN/ (root) → veza-docs/ORIGIN/
- veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
- apps/desktop/ → never existed
Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
dark patterns, no public popularity metrics).
README.md — targeted fixes
- "Version cible: v0.101" → "Version courante: v1.0.4"
- "Development Setup (v0.9.3)" → "Development Setup"
- Removed Desktop (Electron) section — never implemented
- Removed veza-chat-server from structure — merged into backend
- Removed deprecated compose files section (nothing is DEPRECATED now)
k8s runbooks — remove stale chat-server references
The disaster-recovery runbooks still scaled/restarted a deployment
that no longer exists. In a real failover these commands would have
failed silently and blocked the procedure. Files patched:
- k8s/disaster-recovery/runbooks/cluster-failover.md
- k8s/disaster-recovery/runbooks/data-restore.md
- k8s/disaster-recovery/runbooks/database-failover.md
- k8s/disaster-recovery/runbooks/rollback-procedure.md
- k8s/network-policies/README.md
- k8s/secrets/README.md
- k8s/secrets.yaml.example
Each reference is replaced by a short inline note pointing to v0.502
(commit 279a10d31) so future readers understand the history.
.env.example — remove CHAT_JWT_SECRET
Legacy env var for the deleted chat server. Replaced by an explanatory
comment.
Not in this commit (user handles on Forgejo):
- Closing the 5 open dependabot PRs on veza-chat-server/* branches
- Deleting those 5 remote branches after the PRs are closed
Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 15:23:50 +00:00
|
|
|
- Scale resources if needed
|
|
|
|
|
- Optimize configurations
|
|
|
|
|
- Monitor performance
|
2025-12-25 20:40:31 +00:00
|
|
|
|
|
|
|
|
### Long Term (Recovery Phase)
|
|
|
|
|
|
|
|
|
|
1. **Restore Primary Cluster**
|
docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.
CLAUDE.md — complete rewrite
Old version referenced paths that don't exist and a protocol aimed at
implementing v0.11.0 (current tag: v1.0.3). The agent was following a
map for a city that had been rebuilt.
- backend/ → veza-backend-api/
- frontend/ → apps/web/
- ORIGIN/ (root) → veza-docs/ORIGIN/
- veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
- apps/desktop/ → never existed
Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
dark patterns, no public popularity metrics).
README.md — targeted fixes
- "Version cible: v0.101" → "Version courante: v1.0.4"
- "Development Setup (v0.9.3)" → "Development Setup"
- Removed Desktop (Electron) section — never implemented
- Removed veza-chat-server from structure — merged into backend
- Removed deprecated compose files section (nothing is DEPRECATED now)
k8s runbooks — remove stale chat-server references
The disaster-recovery runbooks still scaled/restarted a deployment
that no longer exists. In a real failover these commands would have
failed silently and blocked the procedure. Files patched:
- k8s/disaster-recovery/runbooks/cluster-failover.md
- k8s/disaster-recovery/runbooks/data-restore.md
- k8s/disaster-recovery/runbooks/database-failover.md
- k8s/disaster-recovery/runbooks/rollback-procedure.md
- k8s/network-policies/README.md
- k8s/secrets/README.md
- k8s/secrets.yaml.example
Each reference is replaced by a short inline note pointing to v0.502
(commit 279a10d31) so future readers understand the history.
.env.example — remove CHAT_JWT_SECRET
Legacy env var for the deleted chat server. Replaced by an explanatory
comment.
Not in this commit (user handles on Forgejo):
- Closing the 5 open dependabot PRs on veza-chat-server/* branches
- Deleting those 5 remote branches after the PRs are closed
Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 15:23:50 +00:00
|
|
|
- Fix issues in primary
|
|
|
|
|
- Restore from backups
|
|
|
|
|
- Verify functionality
|
2025-12-25 20:40:31 +00:00
|
|
|
|
|
|
|
|
2. **Plan Failback**
|
docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.
CLAUDE.md — complete rewrite
Old version referenced paths that don't exist and a protocol aimed at
implementing v0.11.0 (current tag: v1.0.3). The agent was following a
map for a city that had been rebuilt.
- backend/ → veza-backend-api/
- frontend/ → apps/web/
- ORIGIN/ (root) → veza-docs/ORIGIN/
- veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
- apps/desktop/ → never existed
Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
dark patterns, no public popularity metrics).
README.md — targeted fixes
- "Version cible: v0.101" → "Version courante: v1.0.4"
- "Development Setup (v0.9.3)" → "Development Setup"
- Removed Desktop (Electron) section — never implemented
- Removed veza-chat-server from structure — merged into backend
- Removed deprecated compose files section (nothing is DEPRECATED now)
k8s runbooks — remove stale chat-server references
The disaster-recovery runbooks still scaled/restarted a deployment
that no longer exists. In a real failover these commands would have
failed silently and blocked the procedure. Files patched:
- k8s/disaster-recovery/runbooks/cluster-failover.md
- k8s/disaster-recovery/runbooks/data-restore.md
- k8s/disaster-recovery/runbooks/database-failover.md
- k8s/disaster-recovery/runbooks/rollback-procedure.md
- k8s/network-policies/README.md
- k8s/secrets/README.md
- k8s/secrets.yaml.example
Each reference is replaced by a short inline note pointing to v0.502
(commit 279a10d31) so future readers understand the history.
.env.example — remove CHAT_JWT_SECRET
Legacy env var for the deleted chat server. Replaced by an explanatory
comment.
Not in this commit (user handles on Forgejo):
- Closing the 5 open dependabot PRs on veza-chat-server/* branches
- Deleting those 5 remote branches after the PRs are closed
Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 15:23:50 +00:00
|
|
|
- Schedule maintenance window
|
|
|
|
|
- Prepare failback procedure
|
|
|
|
|
- Test failback process
|
2025-12-25 20:40:31 +00:00
|
|
|
|
|
|
|
|
## Failback Procedure
|
|
|
|
|
|
|
|
|
|
Once primary cluster is restored:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# 1. Sync data from DR to primary
|
|
|
|
|
# (Use database replication or restore from DR backup)
|
|
|
|
|
|
|
|
|
|
# 2. Verify primary cluster
|
|
|
|
|
kubectl config use-context veza-primary-cluster
|
|
|
|
|
kubectl get pods -n veza-production
|
|
|
|
|
|
|
|
|
|
# 3. Update DNS back to primary
|
|
|
|
|
# (Reverse of Step 6 in failover)
|
|
|
|
|
|
|
|
|
|
# 4. Monitor both clusters during transition
|
|
|
|
|
|
|
|
|
|
# 5. Once verified, scale down DR cluster
|
|
|
|
|
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production --context=veza-dr-cluster
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Troubleshooting
|
|
|
|
|
|
|
|
|
|
### Database Restore Fails
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Check backup file integrity
|
|
|
|
|
pg_restore --list /backups/latest.dump
|
|
|
|
|
|
|
|
|
|
# Try restoring specific tables
|
|
|
|
|
pg_restore -h postgres-service -U veza_user -d veza_db \
|
|
|
|
|
-t users -t tracks /backups/latest.dump
|
|
|
|
|
|
|
|
|
|
# Check PostgreSQL logs
|
|
|
|
|
kubectl logs postgres-pod -n veza-production
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Applications Not Starting
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Check pod status
|
|
|
|
|
kubectl describe pod <pod-name> -n veza-production
|
|
|
|
|
|
|
|
|
|
# Check logs
|
|
|
|
|
kubectl logs <pod-name> -n veza-production
|
|
|
|
|
|
|
|
|
|
# Verify secrets
|
|
|
|
|
kubectl get secret veza-secrets -n veza-production -o yaml
|
|
|
|
|
|
|
|
|
|
# Check resource constraints
|
|
|
|
|
kubectl top nodes
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### DNS Not Propagating
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Check DNS records
|
|
|
|
|
dig api.veza.com +short
|
|
|
|
|
nslookup api.veza.com
|
|
|
|
|
|
|
|
|
|
# Verify ingress IP
|
|
|
|
|
kubectl get ingress veza-ingress -n veza-production
|
|
|
|
|
|
|
|
|
|
# Check DNS provider status
|
|
|
|
|
# (AWS Route53, Cloudflare, etc.)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## References
|
|
|
|
|
|
|
|
|
|
- [Database Restore Runbook](./data-restore.md)
|
|
|
|
|
- [Kubernetes Multi-Cluster Setup](https://kubernetes.io/docs/setup/)
|
|
|
|
|
- [DNS Management Best Practices](../README.md)
|