veza/k8s/disaster-recovery/runbooks/data-restore.md
senke 2aea1af361 docs(J2): align docs with reality — rewrite CLAUDE.md, fix README, purge chat-server refs
Completes Day 2 of the v1.0.3 → v1.0.4 cleanup sprint. The documentation
now describes the actual repo layout instead of a fictional one.

CLAUDE.md — complete rewrite
  Old version referenced paths that don't exist and a protocol aimed at
  implementing v0.11.0 (current tag: v1.0.3). The agent was following a
  map for a city that had been rebuilt.
  - backend/        → veza-backend-api/
  - frontend/       → apps/web/
  - ORIGIN/ (root)  → veza-docs/ORIGIN/
  - veza-chat-server → merged into backend-api (v0.502, commit 279a10d31)
  - apps/desktop/   → never existed
  Also refreshed: stack versions (Go 1.25, Vite 5, React 18.2, Axum 0.8),
  commands, conventions, hook bypasses (SKIP_TYPES/SKIP_TESTS/SKIP_E2E),
  scope rules kept as immutable (no AI/ML, no Web3, no gamification, no
  dark patterns, no public popularity metrics).

README.md — targeted fixes
  - "Version cible: v0.101" → "Version courante: v1.0.4"
  - "Development Setup (v0.9.3)" → "Development Setup"
  - Removed Desktop (Electron) section — never implemented
  - Removed veza-chat-server from structure — merged into backend
  - Removed deprecated compose files section (nothing is DEPRECATED now)

k8s runbooks — remove stale chat-server references
  The disaster-recovery runbooks still scaled/restarted a deployment
  that no longer exists. In a real failover these commands would have
  failed silently and blocked the procedure. Files patched:
    - k8s/disaster-recovery/runbooks/cluster-failover.md
    - k8s/disaster-recovery/runbooks/data-restore.md
    - k8s/disaster-recovery/runbooks/database-failover.md
    - k8s/disaster-recovery/runbooks/rollback-procedure.md
    - k8s/network-policies/README.md
    - k8s/secrets/README.md
    - k8s/secrets.yaml.example
  Each reference is replaced by a short inline note pointing to v0.502
  (commit 279a10d31) so future readers understand the history.

.env.example — remove CHAT_JWT_SECRET
  Legacy env var for the deleted chat server. Replaced by an explanatory
  comment.

Not in this commit (user handles on Forgejo):
  - Closing the 5 open dependabot PRs on veza-chat-server/* branches
  - Deleting those 5 remote branches after the PRs are closed

Refs: AUDIT_REPORT.md §5.1, §7.1, §10 P1, §10 P4
2026-04-14 17:23:50 +02:00

8.3 KiB

Data Restore Runbook

This runbook describes the procedure for restoring data from backups after data loss or corruption.

Prerequisites

  • Access to backup storage
  • Database credentials
  • kubectl access to cluster
  • Backup file identified

Pre-Restore Checklist

  • Backup file identified and verified
  • Backup integrity checked
  • Restore point confirmed
  • Applications stopped (to prevent writes)
  • Current data backed up (if possible)

Restore Procedure

Step 1: Identify Backup

# List available backups
kubectl get pvc postgres-backup-storage -n veza-production

# List backups in storage
kubectl run backup-lister --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --overrides='
{
  "spec": {
    "containers": [{
      "name": "backup-lister",
      "image": "postgres:15-alpine",
      "command": ["/bin/sh", "-c", "ls -lh /backups/postgres/"],
      "volumeMounts": [{
        "name": "backup-storage",
        "mountPath": "/backups"
      }]
    }],
    "volumes": [{
      "name": "backup-storage",
      "persistentVolumeClaim": {
        "claimName": "postgres-backup-storage"
      }
    }]
  }
}' \
  -n veza-production

# Or from S3
aws s3 ls s3://veza-backups/postgres/ --recursive | sort

Step 2: Stop Applications

# Scale down applications to prevent writes
# (backend-api handles chat since v0.502 merge — no separate chat-server deployment)
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production

# Verify pods are stopped
kubectl get pods -n veza-production -l app=veza-backend-api

Step 3: Backup Current State (Optional)

# Create backup of current state before restore
kubectl create job --from=cronjob/postgres-backup \
  postgres-backup-pre-restore-$(date +%s) \
  -n veza-production

# Wait for backup to complete
kubectl wait --for=condition=complete job/postgres-backup-pre-restore-* \
  -n veza-production \
  --timeout=600s

Step 4: Restore Database

Full Database Restore

# Get database credentials
DB_PASSWORD=$(kubectl get secret veza-secrets -n veza-production \
  -o jsonpath='{.data.database-url}' | \
  base64 -d | grep -oP 'password=\K[^&]+')

# Restore database
kubectl run postgres-restore --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --env="PGPASSWORD=$DB_PASSWORD" \
  --env="POSTGRES_HOST=postgres-service" \
  --env="POSTGRES_USER=veza_user" \
  --env="POSTGRES_DB=veza_db" \
  --overrides='
{
  "spec": {
    "containers": [{
      "name": "postgres-restore",
      "image": "postgres:15-alpine",
      "command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump --clean --if-exists --verbose"],
      "env": [
        {"name": "PGPASSWORD", "value": "'$DB_PASSWORD'"},
        {"name": "POSTGRES_HOST", "value": "postgres-service"},
        {"name": "POSTGRES_USER", "value": "veza_user"},
        {"name": "POSTGRES_DB", "value": "veza_db"}
      ],
      "volumeMounts": [{
        "name": "backup-storage",
        "mountPath": "/backups"
      }]
    }],
    "volumes": [{
      "name": "backup-storage",
      "persistentVolumeClaim": {
        "claimName": "postgres-backup-storage"
      }
    }]
  }
}' \
  -n veza-production

Restore from S3

# Download backup from S3
aws s3 cp s3://veza-backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump /tmp/backup.dump

# Restore
kubectl run postgres-restore --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --env="PGPASSWORD=$DB_PASSWORD" \
  --overrides='
{
  "spec": {
    "containers": [{
      "name": "postgres-restore",
      "image": "postgres:15-alpine",
      "command": ["/bin/sh", "-c", "pg_restore -h postgres-service -U veza_user -d veza_db -F c /backups/backup.dump --clean --if-exists"],
      "env": [{"name": "PGPASSWORD", "value": "'$DB_PASSWORD'"}],
      "volumeMounts": [{
        "name": "backup",
        "mountPath": "/backups"
      }]
    }],
    "volumes": [{
      "name": "backup",
      "hostPath": {
        "path": "/tmp"
      }
    }]
  }
}' \
  -n veza-production

Point-in-Time Recovery

# Restore to specific timestamp using WAL archives
pg_restore -h postgres-service -U veza_user -d veza_db \
  --recovery-target-time="2025-01-01 12:00:00" \
  /backups/postgres/base_backup.dump

Step 5: Verify Data Integrity

# Check table counts
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U veza_user -d veza_db -c "
    SELECT
      'users' as table_name, COUNT(*) as count FROM users
    UNION ALL
    SELECT 'tracks', COUNT(*) FROM tracks
    UNION ALL
    SELECT 'playlists', COUNT(*) FROM playlists;
  "

# Verify specific data
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U veza_user -d veza_db -c "
    SELECT id, username, email, created_at
    FROM users
    ORDER BY created_at DESC
    LIMIT 10;
  "

# Check database size
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U veza_user -d veza_db -c "
    SELECT pg_size_pretty(pg_database_size('veza_db'));
  "

Step 6: Restart Applications

# Scale up applications (backend-api handles chat since v0.502)
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production

# Wait for pods to be ready
kubectl rollout status deployment/veza-backend-api -n veza-production

Step 7: Verify Application Functionality

# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production

# Test health endpoint
curl https://api.veza.com/health

# Test API endpoints
curl https://api.veza.com/api/v1/tracks
curl https://api.veza.com/api/v1/users/me

# Run smoke tests
# (Use your application's test suite)

Partial Restore

Restore Specific Tables

# Restore only specific tables
pg_restore -h postgres-service -U veza_user -d veza_db \
  -t users -t tracks \
  /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump

Restore Specific Schema

# Restore only specific schema
pg_restore -h postgres-service -U veza_user -d veza_db \
  -n public \
  /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump

Verification Checklist

  • Backup file identified and verified
  • Applications stopped
  • Current state backed up (if possible)
  • Database restored successfully
  • Data integrity verified
  • Applications restarted
  • Health checks passing
  • API endpoints responding
  • Smoke tests passing
  • Users can access platform

Troubleshooting

Restore Fails with Permission Error

# Check database user permissions
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U postgres -c "\du veza_user"

# Grant necessary permissions
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE veza_db TO veza_user;"

Restore Fails with Connection Error

# Verify database is accessible
kubectl exec -it postgres-pod -n veza-production -- \
  pg_isready -U veza_user -d veza_db

# Check service endpoint
kubectl get svc postgres -n veza-production

# Test connection
kubectl run test-connection --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --env="PGPASSWORD=$DB_PASSWORD" \
  -- psql -h postgres-service -U veza_user -d veza_db -c "SELECT 1;"

Data Inconsistencies After Restore

# Compare record counts with expected values
# Check application logs for errors
kubectl logs -f deployment/veza-backend-api -n veza-production

# Verify foreign key constraints
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U veza_user -d veza_db -c "
    SELECT conname, conrelid::regclass, confrelid::regclass
    FROM pg_constraint
    WHERE contype = 'f';
  "

Post-Restore Tasks

  1. Monitor Platform

    • Watch application logs
    • Monitor error rates
    • Check performance metrics
  2. Verify Data

    • Run data integrity checks
    • Compare with expected values
    • Test critical user flows
  3. Document Incident

    • Document restore procedure
    • Note any issues encountered
    • Update runbook if needed
  4. Investigate Root Cause

    • Review logs and events
    • Identify what caused data loss
    • Implement prevention measures

References