2025-12-25 20:40:31 +00:00
|
|
|
# Database Failover Runbook
|
|
|
|
|
|
|
|
|
|
This runbook describes the procedure for failing over from a primary PostgreSQL database to a standby replica.
|
|
|
|
|
|
|
|
|
|
## Prerequisites
|
|
|
|
|
|
|
|
|
|
- Standby replica configured and synchronized
|
|
|
|
|
- Access to Kubernetes cluster
|
|
|
|
|
- Database credentials in Vault/Secrets
|
|
|
|
|
- Monitoring alerts configured
|
|
|
|
|
|
|
|
|
|
## Detection
|
|
|
|
|
|
|
|
|
|
### Automatic Detection
|
|
|
|
|
|
|
|
|
|
Monitoring alerts will trigger when:
|
docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.
CLAUDE.md — complete rewrite
Old version referenced paths that don't exist and a protocol aimed at
implementing v0.11.0 (current tag: v1.0.3). The agent was following a
map for a city that had been rebuilt.
- backend/ → veza-backend-api/
- frontend/ → apps/web/
- ORIGIN/ (root) → veza-docs/ORIGIN/
- veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
- apps/desktop/ → never existed
Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
dark patterns, no public popularity metrics).
README.md — targeted fixes
- "Version cible: v0.101" → "Version courante: v1.0.4"
- "Development Setup (v0.9.3)" → "Development Setup"
- Removed Desktop (Electron) section — never implemented
- Removed veza-chat-server from structure — merged into backend
- Removed deprecated compose files section (nothing is DEPRECATED now)
k8s runbooks — remove stale chat-server references
The disaster-recovery runbooks still scaled/restarted a deployment
that no longer exists. In a real failover these commands would have
failed silently and blocked the procedure. Files patched:
- k8s/disaster-recovery/runbooks/cluster-failover.md
- k8s/disaster-recovery/runbooks/data-restore.md
- k8s/disaster-recovery/runbooks/database-failover.md
- k8s/disaster-recovery/runbooks/rollback-procedure.md
- k8s/network-policies/README.md
- k8s/secrets/README.md
- k8s/secrets.yaml.example
Each reference is replaced by a short inline note pointing to v0.502
(commit 279a10d31) so future readers understand the history.
.env.example — remove CHAT_JWT_SECRET
Legacy env var for the deleted chat server. Replaced by an explanatory
comment.
Not in this commit (user handles on Forgejo):
- Closing the 5 open dependabot PRs on veza-chat-server/* branches
- Deleting those 5 remote branches after the PRs are closed
Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 15:23:50 +00:00
|
|
|
|
2025-12-25 20:40:31 +00:00
|
|
|
- Primary database is unreachable
|
|
|
|
|
- Replication lag exceeds threshold
|
|
|
|
|
- Health checks fail
|
|
|
|
|
|
|
|
|
|
### Manual Detection
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Check primary database status
|
|
|
|
|
kubectl exec -it postgres-primary -n veza-production -- pg_isready
|
|
|
|
|
|
|
|
|
|
# Check replication status
|
|
|
|
|
kubectl exec -it postgres-standby -n veza-production -- \
|
|
|
|
|
psql -U postgres -c "SELECT * FROM pg_stat_replication;"
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Failover Procedure
|
|
|
|
|
|
|
|
|
|
### Step 1: Verify Standby Status
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Check standby is synchronized
|
|
|
|
|
kubectl exec -it postgres-standby -n veza-production -- \
|
|
|
|
|
psql -U postgres -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"
|
|
|
|
|
|
|
|
|
|
# Verify replication lag
|
|
|
|
|
kubectl exec -it postgres-standby -n veza-production -- \
|
|
|
|
|
psql -U postgres -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;"
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Expected**: Lag should be < 60 seconds
|
|
|
|
|
|
|
|
|
|
### Step 2: Promote Standby to Primary
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Promote standby
|
|
|
|
|
kubectl exec -it postgres-standby -n veza-production -- \
|
|
|
|
|
pg_ctl promote
|
|
|
|
|
|
|
|
|
|
# Verify promotion
|
|
|
|
|
kubectl exec -it postgres-standby -n veza-production -- \
|
|
|
|
|
psql -U postgres -c "SELECT pg_is_in_recovery();"
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Expected**: Returns `false` (no longer in recovery mode)
|
|
|
|
|
|
|
|
|
|
### Step 3: Update Service Endpoint
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Update postgres service to point to new primary
|
|
|
|
|
kubectl patch service postgres -n veza-production \
|
|
|
|
|
-p '{"spec":{"selector":{"role":"primary"}}}'
|
|
|
|
|
|
|
|
|
|
# Or update the service selector to point to standby pod
|
|
|
|
|
kubectl get pod postgres-standby -n veza-production -o jsonpath='{.metadata.labels}' | \
|
|
|
|
|
jq -r 'to_entries | map("\(.key)=\(.value)") | join(",")'
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Step 4: Restart Application Pods
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Restart to pick up new database connection
|
docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.
CLAUDE.md — complete rewrite
Old version referenced paths that don't exist and a protocol aimed at
implementing v0.11.0 (current tag: v1.0.3). The agent was following a
map for a city that had been rebuilt.
- backend/ → veza-backend-api/
- frontend/ → apps/web/
- ORIGIN/ (root) → veza-docs/ORIGIN/
- veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
- apps/desktop/ → never existed
Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
dark patterns, no public popularity metrics).
README.md — targeted fixes
- "Version cible: v0.101" → "Version courante: v1.0.4"
- "Development Setup (v0.9.3)" → "Development Setup"
- Removed Desktop (Electron) section — never implemented
- Removed veza-chat-server from structure — merged into backend
- Removed deprecated compose files section (nothing is DEPRECATED now)
k8s runbooks — remove stale chat-server references
The disaster-recovery runbooks still scaled/restarted a deployment
that no longer exists. In a real failover these commands would have
failed silently and blocked the procedure. Files patched:
- k8s/disaster-recovery/runbooks/cluster-failover.md
- k8s/disaster-recovery/runbooks/data-restore.md
- k8s/disaster-recovery/runbooks/database-failover.md
- k8s/disaster-recovery/runbooks/rollback-procedure.md
- k8s/network-policies/README.md
- k8s/secrets/README.md
- k8s/secrets.yaml.example
Each reference is replaced by a short inline note pointing to v0.502
(commit 279a10d31) so future readers understand the history.
.env.example — remove CHAT_JWT_SECRET
Legacy env var for the deleted chat server. Replaced by an explanatory
comment.
Not in this commit (user handles on Forgejo):
- Closing the 5 open dependabot PRs on veza-chat-server/* branches
- Deleting those 5 remote branches after the PRs are closed
Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 15:23:50 +00:00
|
|
|
# (backend-api handles chat since v0.502 merge — no separate chat-server deployment)
|
2025-12-25 20:40:31 +00:00
|
|
|
kubectl rollout restart deployment/veza-backend-api -n veza-production
|
|
|
|
|
|
|
|
|
|
# Verify pods are healthy
|
|
|
|
|
kubectl rollout status deployment/veza-backend-api -n veza-production
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Step 5: Verify Application Health
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Check application logs
|
|
|
|
|
kubectl logs -f deployment/veza-backend-api -n veza-production
|
|
|
|
|
|
|
|
|
|
# Test database connectivity
|
|
|
|
|
kubectl exec -it deployment/veza-backend-api -n veza-production -- \
|
|
|
|
|
psql $DATABASE_URL -c "SELECT 1;"
|
|
|
|
|
|
|
|
|
|
# Check health endpoint
|
|
|
|
|
curl https://api.veza.com/health
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Step 6: Set Up New Standby
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Once primary is recovered, set up new standby
|
|
|
|
|
# Follow PostgreSQL replication setup guide
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Rollback Procedure
|
|
|
|
|
|
|
|
|
|
If failover was incorrect or primary recovers:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Stop applications
|
|
|
|
|
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production
|
|
|
|
|
|
|
|
|
|
# Revert service endpoint
|
|
|
|
|
kubectl patch service postgres -n veza-production \
|
|
|
|
|
-p '{"spec":{"selector":{"role":"primary","pod":"postgres-primary"}}}'
|
|
|
|
|
|
|
|
|
|
# Restart applications
|
|
|
|
|
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Verification Checklist
|
|
|
|
|
|
|
|
|
|
- [ ] Standby promoted successfully
|
|
|
|
|
- [ ] Service endpoint updated
|
|
|
|
|
- [ ] Application pods restarted
|
|
|
|
|
- [ ] Database connectivity verified
|
|
|
|
|
- [ ] Application health checks passing
|
|
|
|
|
- [ ] No data loss detected
|
|
|
|
|
- [ ] Monitoring alerts cleared
|
|
|
|
|
- [ ] Documentation updated
|
|
|
|
|
|
|
|
|
|
## Troubleshooting
|
|
|
|
|
|
|
|
|
|
### Standby Not Synchronized
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Check replication status
|
|
|
|
|
kubectl exec -it postgres-standby -n veza-production -- \
|
|
|
|
|
psql -U postgres -c "SELECT * FROM pg_stat_replication;"
|
|
|
|
|
|
|
|
|
|
# If replication is broken, rebuild standby
|
|
|
|
|
# (See PostgreSQL replication setup guide)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Application Cannot Connect
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Verify service selector
|
|
|
|
|
kubectl get service postgres -n veza-production -o yaml
|
|
|
|
|
|
|
|
|
|
# Check pod labels
|
|
|
|
|
kubectl get pod postgres-standby -n veza-production --show-labels
|
|
|
|
|
|
|
|
|
|
# Verify network connectivity
|
|
|
|
|
kubectl run test-connection --rm -it --image=postgres:15-alpine \
|
|
|
|
|
--restart=Never \
|
|
|
|
|
-- psql -h postgres.veza-production.svc.cluster.local -U veza_user -d veza_db
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Post-Failover Tasks
|
|
|
|
|
|
|
|
|
|
1. **Investigate Root Cause**
|
docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.
CLAUDE.md — complete rewrite
Old version referenced paths that don't exist and a protocol aimed at
implementing v0.11.0 (current tag: v1.0.3). The agent was following a
map for a city that had been rebuilt.
- backend/ → veza-backend-api/
- frontend/ → apps/web/
- ORIGIN/ (root) → veza-docs/ORIGIN/
- veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
- apps/desktop/ → never existed
Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
dark patterns, no public popularity metrics).
README.md — targeted fixes
- "Version cible: v0.101" → "Version courante: v1.0.4"
- "Development Setup (v0.9.3)" → "Development Setup"
- Removed Desktop (Electron) section — never implemented
- Removed veza-chat-server from structure — merged into backend
- Removed deprecated compose files section (nothing is DEPRECATED now)
k8s runbooks — remove stale chat-server references
The disaster-recovery runbooks still scaled/restarted a deployment
that no longer exists. In a real failover these commands would have
failed silently and blocked the procedure. Files patched:
- k8s/disaster-recovery/runbooks/cluster-failover.md
- k8s/disaster-recovery/runbooks/data-restore.md
- k8s/disaster-recovery/runbooks/database-failover.md
- k8s/disaster-recovery/runbooks/rollback-procedure.md
- k8s/network-policies/README.md
- k8s/secrets/README.md
- k8s/secrets.yaml.example
Each reference is replaced by a short inline note pointing to v0.502
(commit 279a10d31) so future readers understand the history.
.env.example — remove CHAT_JWT_SECRET
Legacy env var for the deleted chat server. Replaced by an explanatory
comment.
Not in this commit (user handles on Forgejo):
- Closing the 5 open dependabot PRs on veza-chat-server/* branches
- Deleting those 5 remote branches after the PRs are closed
Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 15:23:50 +00:00
|
|
|
- Review primary database logs
|
|
|
|
|
- Check system resources
|
|
|
|
|
- Identify failure reason
|
2025-12-25 20:40:31 +00:00
|
|
|
|
|
|
|
|
2. **Set Up New Standby**
|
docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.
CLAUDE.md — complete rewrite
Old version referenced paths that don't exist and a protocol aimed at
implementing v0.11.0 (current tag: v1.0.3). The agent was following a
map for a city that had been rebuilt.
- backend/ → veza-backend-api/
- frontend/ → apps/web/
- ORIGIN/ (root) → veza-docs/ORIGIN/
- veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
- apps/desktop/ → never existed
Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
dark patterns, no public popularity metrics).
README.md — targeted fixes
- "Version cible: v0.101" → "Version courante: v1.0.4"
- "Development Setup (v0.9.3)" → "Development Setup"
- Removed Desktop (Electron) section — never implemented
- Removed veza-chat-server from structure — merged into backend
- Removed deprecated compose files section (nothing is DEPRECATED now)
k8s runbooks — remove stale chat-server references
The disaster-recovery runbooks still scaled/restarted a deployment
that no longer exists. In a real failover these commands would have
failed silently and blocked the procedure. Files patched:
- k8s/disaster-recovery/runbooks/cluster-failover.md
- k8s/disaster-recovery/runbooks/data-restore.md
- k8s/disaster-recovery/runbooks/database-failover.md
- k8s/disaster-recovery/runbooks/rollback-procedure.md
- k8s/network-policies/README.md
- k8s/secrets/README.md
- k8s/secrets.yaml.example
Each reference is replaced by a short inline note pointing to v0.502
(commit 279a10d31) so future readers understand the history.
.env.example — remove CHAT_JWT_SECRET
Legacy env var for the deleted chat server. Replaced by an explanatory
comment.
Not in this commit (user handles on Forgejo):
- Closing the 5 open dependabot PRs on veza-chat-server/* branches
- Deleting those 5 remote branches after the PRs are closed
Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 15:23:50 +00:00
|
|
|
- Configure replication from new primary
|
|
|
|
|
- Verify synchronization
|
|
|
|
|
- Update monitoring
|
2025-12-25 20:40:31 +00:00
|
|
|
|
|
|
|
|
3. **Document Incident**
|
docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.
CLAUDE.md — complete rewrite
Old version referenced paths that don't exist and a protocol aimed at
implementing v0.11.0 (current tag: v1.0.3). The agent was following a
map for a city that had been rebuilt.
- backend/ → veza-backend-api/
- frontend/ → apps/web/
- ORIGIN/ (root) → veza-docs/ORIGIN/
- veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
- apps/desktop/ → never existed
Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
dark patterns, no public popularity metrics).
README.md — targeted fixes
- "Version cible: v0.101" → "Version courante: v1.0.4"
- "Development Setup (v0.9.3)" → "Development Setup"
- Removed Desktop (Electron) section — never implemented
- Removed veza-chat-server from structure — merged into backend
- Removed deprecated compose files section (nothing is DEPRECATED now)
k8s runbooks — remove stale chat-server references
The disaster-recovery runbooks still scaled/restarted a deployment
that no longer exists. In a real failover these commands would have
failed silently and blocked the procedure. Files patched:
- k8s/disaster-recovery/runbooks/cluster-failover.md
- k8s/disaster-recovery/runbooks/data-restore.md
- k8s/disaster-recovery/runbooks/database-failover.md
- k8s/disaster-recovery/runbooks/rollback-procedure.md
- k8s/network-policies/README.md
- k8s/secrets/README.md
- k8s/secrets.yaml.example
Each reference is replaced by a short inline note pointing to v0.502
(commit 279a10d31) so future readers understand the history.
.env.example — remove CHAT_JWT_SECRET
Legacy env var for the deleted chat server. Replaced by an explanatory
comment.
Not in this commit (user handles on Forgejo):
- Closing the 5 open dependabot PRs on veza-chat-server/* branches
- Deleting those 5 remote branches after the PRs are closed
Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 15:23:50 +00:00
|
|
|
- Document failover procedure
|
|
|
|
|
- Note any issues encountered
|
|
|
|
|
- Update runbook if needed
|
2025-12-25 20:40:31 +00:00
|
|
|
|
|
|
|
|
4. **Notify Stakeholders**
|
docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.
CLAUDE.md — complete rewrite
Old version referenced paths that don't exist and a protocol aimed at
implementing v0.11.0 (current tag: v1.0.3). The agent was following a
map for a city that had been rebuilt.
- backend/ → veza-backend-api/
- frontend/ → apps/web/
- ORIGIN/ (root) → veza-docs/ORIGIN/
- veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
- apps/desktop/ → never existed
Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
dark patterns, no public popularity metrics).
README.md — targeted fixes
- "Version cible: v0.101" → "Version courante: v1.0.4"
- "Development Setup (v0.9.3)" → "Development Setup"
- Removed Desktop (Electron) section — never implemented
- Removed veza-chat-server from structure — merged into backend
- Removed deprecated compose files section (nothing is DEPRECATED now)
k8s runbooks — remove stale chat-server references
The disaster-recovery runbooks still scaled/restarted a deployment
that no longer exists. In a real failover these commands would have
failed silently and blocked the procedure. Files patched:
- k8s/disaster-recovery/runbooks/cluster-failover.md
- k8s/disaster-recovery/runbooks/data-restore.md
- k8s/disaster-recovery/runbooks/database-failover.md
- k8s/disaster-recovery/runbooks/rollback-procedure.md
- k8s/network-policies/README.md
- k8s/secrets/README.md
- k8s/secrets.yaml.example
Each reference is replaced by a short inline note pointing to v0.502
(commit 279a10d31) so future readers understand the history.
.env.example — remove CHAT_JWT_SECRET
Legacy env var for the deleted chat server. Replaced by an explanatory
comment.
Not in this commit (user handles on Forgejo):
- Closing the 5 open dependabot PRs on veza-chat-server/* branches
- Deleting those 5 remote branches after the PRs are closed
Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 15:23:50 +00:00
|
|
|
- Send incident report
|
|
|
|
|
- Update status page
|
|
|
|
|
- Schedule post-mortem
|
2025-12-25 20:40:31 +00:00
|
|
|
|
|
|
|
|
## References
|
|
|
|
|
|
|
|
|
|
- [PostgreSQL Replication Documentation](https://www.postgresql.org/docs/current/high-availability.html)
|
|
|
|
|
- [Kubernetes Service Documentation](https://kubernetes.io/docs/concepts/services-networking/service/)
|