senke/veza

senke 2aea1af361 docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs

Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.

CLAUDE.md — complete rewrite
  Old version referenced paths that don't exist and a protocol aimed at
  implementing v0.11.0 (current tag: v1.0.3). The agent was following a
  map for a city that had been rebuilt.
  - backend/        → veza-backend-api/
  - frontend/       → apps/web/
  - ORIGIN/ (root)  → veza-docs/ORIGIN/
  - veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
  - apps/desktop/   → never existed
  Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
  commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
  scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
  dark patterns, no public popularity metrics).

README.md — targeted fixes
  - "Version cible: v0.101" → "Version courante: v1.0.4"
  - "Development Setup (v0.9.3)" → "Development Setup"
  - Removed Desktop (Electron) section — never implemented
  - Removed veza-chat-server from structure — merged into backend
  - Removed deprecated compose files section (nothing is DEPRECATED now)

k8s runbooks — remove stale chat-server references
  The disaster-recovery runbooks still scaled/restarted a deployment
  that no longer exists. In a real failover these commands would have
  failed silently and blocked the procedure. Files patched:
    - k8s/disaster-recovery/runbooks/cluster-failover.md
    - k8s/disaster-recovery/runbooks/data-restore.md
    - k8s/disaster-recovery/runbooks/database-failover.md
    - k8s/disaster-recovery/runbooks/rollback-procedure.md
    - k8s/network-policies/README.md
    - k8s/secrets/README.md
    - k8s/secrets.yaml.example
  Each reference is replaced by a short inline note pointing to v0.502
  (commit 279a10d31) so future readers understand the history.

.env.example — remove CHAT_JWT_SECRET
  Legacy env var for the deleted chat server. Replaced by an explanatory
  comment.

Not in this commit (user handles on Forgejo):
  - Closing the 5 open dependabot PRs on veza-chat-server/* branches
  - Deleting those 5 remote branches after the PRs are closed

Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4

2026-04-14 17:23:50 +02:00

4.8 KiB

Raw Blame History

Database Failover Runbook

This runbook describes the procedure for failing over from a primary PostgreSQL database to a standby replica.

Prerequisites

Standby replica configured and synchronized
Access to Kubernetes cluster
Database credentials in Vault/Secrets
Monitoring alerts configured

Detection

Automatic Detection

Monitoring alerts will trigger when:

Primary database is unreachable
Replication lag exceeds threshold
Health checks fail

Manual Detection

# Check primary database status
kubectl exec -it postgres-primary -n veza-production -- pg_isready

# Check replication status
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT * FROM pg_stat_replication;"

Failover Procedure

Step 1: Verify Standby Status

# Check standby is synchronized
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"

# Verify replication lag
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;"

Expected: Lag should be < 60 seconds

Step 2: Promote Standby to Primary

# Promote standby
kubectl exec -it postgres-standby -n veza-production -- \
  pg_ctl promote

# Verify promotion
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT pg_is_in_recovery();"

Expected: Returns false (no longer in recovery mode)

Step 3: Update Service Endpoint

# Update postgres service to point to new primary
kubectl patch service postgres -n veza-production \
  -p '{"spec":{"selector":{"role":"primary"}}}'

# Or update the service selector to point to standby pod
kubectl get pod postgres-standby -n veza-production -o jsonpath='{.metadata.labels}' | \
  jq -r 'to_entries | map("\(.key)=\(.value)") | join(",")'

Step 4: Restart Application Pods

# Restart to pick up new database connection
# (backend-api handles chat since v0.502 merge — no separate chat-server deployment)
kubectl rollout restart deployment/veza-backend-api -n veza-production

# Verify pods are healthy
kubectl rollout status deployment/veza-backend-api -n veza-production

Step 5: Verify Application Health

# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production

# Test database connectivity
kubectl exec -it deployment/veza-backend-api -n veza-production -- \
  psql $DATABASE_URL -c "SELECT 1;"

# Check health endpoint
curl https://api.veza.com/health

Step 6: Set Up New Standby

# Once primary is recovered, set up new standby
# Follow PostgreSQL replication setup guide

Rollback Procedure

If failover was incorrect or primary recovers:

# Stop applications
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production

# Revert service endpoint
kubectl patch service postgres -n veza-production \
  -p '{"spec":{"selector":{"role":"primary","pod":"postgres-primary"}}}'

# Restart applications
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production

Verification Checklist

Standby promoted successfully
Service endpoint updated
Application pods restarted
Database connectivity verified
Application health checks passing
No data loss detected
Monitoring alerts cleared
Documentation updated

Troubleshooting

Standby Not Synchronized

# Check replication status
kubectl exec -it postgres-standby -n veza-production -- \
  psql -U postgres -c "SELECT * FROM pg_stat_replication;"

# If replication is broken, rebuild standby
# (See PostgreSQL replication setup guide)

Application Cannot Connect

# Verify service selector
kubectl get service postgres -n veza-production -o yaml

# Check pod labels
kubectl get pod postgres-standby -n veza-production --show-labels

# Verify network connectivity
kubectl run test-connection --rm -it --image=postgres:15-alpine \
  --restart=Never \
  -- psql -h postgres.veza-production.svc.cluster.local -U veza_user -d veza_db

Post-Failover Tasks

Investigate Root Cause
- Review primary database logs
- Check system resources
- Identify failure reason
Set Up New Standby
- Configure replication from new primary
- Verify synchronization
- Update monitoring
Document Incident
- Document failover procedure
- Note any issues encountered
- Update runbook if needed
Notify Stakeholders
- Send incident report
- Update status page
- Schedule post-mortem

4.8 KiB Raw Blame History