senke/veza

Fork 0

senke d8f613b9b3 [INFRA-010] infra: Set up disaster recovery plan

2025-12-25 21:40:31 +01:00

8.4 KiB

Raw Blame History

Data Restore Runbook

This runbook describes the procedure for restoring data from backups after data loss or corruption.

Prerequisites

Access to backup storage
Database credentials
kubectl access to cluster
Backup file identified

Pre-Restore Checklist

Backup file identified and verified
Backup integrity checked
Restore point confirmed
Applications stopped (to prevent writes)
Current data backed up (if possible)

Restore Procedure

Step 1: Identify Backup

# List available backups
kubectl get pvc postgres-backup-storage -n veza-production

# List backups in storage
kubectl run backup-lister --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --overrides='
{
  "spec": {
    "containers": [{
      "name": "backup-lister",
      "image": "postgres:15-alpine",
      "command": ["/bin/sh", "-c", "ls -lh /backups/postgres/"],
      "volumeMounts": [{
        "name": "backup-storage",
        "mountPath": "/backups"
      }]
    }],
    "volumes": [{
      "name": "backup-storage",
      "persistentVolumeClaim": {
        "claimName": "postgres-backup-storage"
      }
    }]
  }
}' \
  -n veza-production

# Or from S3
aws s3 ls s3://veza-backups/postgres/ --recursive | sort

Step 2: Stop Applications

# Scale down applications to prevent writes
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production
kubectl scale deployment veza-chat-server --replicas=0 -n veza-production

# Verify pods are stopped
kubectl get pods -n veza-production -l app=veza-backend-api

Step 3: Backup Current State (Optional)

# Create backup of current state before restore
kubectl create job --from=cronjob/postgres-backup \
  postgres-backup-pre-restore-$(date +%s) \
  -n veza-production

# Wait for backup to complete
kubectl wait --for=condition=complete job/postgres-backup-pre-restore-* \
  -n veza-production \
  --timeout=600s

Step 4: Restore Database

Full Database Restore

# Get database credentials
DB_PASSWORD=$(kubectl get secret veza-secrets -n veza-production \
  -o jsonpath='{.data.database-url}' | \
  base64 -d | grep -oP 'password=\K[^&]+')

# Restore database
kubectl run postgres-restore --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --env="PGPASSWORD=$DB_PASSWORD" \
  --env="POSTGRES_HOST=postgres-service" \
  --env="POSTGRES_USER=veza_user" \
  --env="POSTGRES_DB=veza_db" \
  --overrides='
{
  "spec": {
    "containers": [{
      "name": "postgres-restore",
      "image": "postgres:15-alpine",
      "command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump --clean --if-exists --verbose"],
      "env": [
        {"name": "PGPASSWORD", "value": "'$DB_PASSWORD'"},
        {"name": "POSTGRES_HOST", "value": "postgres-service"},
        {"name": "POSTGRES_USER", "value": "veza_user"},
        {"name": "POSTGRES_DB", "value": "veza_db"}
      ],
      "volumeMounts": [{
        "name": "backup-storage",
        "mountPath": "/backups"
      }]
    }],
    "volumes": [{
      "name": "backup-storage",
      "persistentVolumeClaim": {
        "claimName": "postgres-backup-storage"
      }
    }]
  }
}' \
  -n veza-production

Restore from S3

# Download backup from S3
aws s3 cp s3://veza-backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump /tmp/backup.dump

# Restore
kubectl run postgres-restore --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --env="PGPASSWORD=$DB_PASSWORD" \
  --overrides='
{
  "spec": {
    "containers": [{
      "name": "postgres-restore",
      "image": "postgres:15-alpine",
      "command": ["/bin/sh", "-c", "pg_restore -h postgres-service -U veza_user -d veza_db -F c /backups/backup.dump --clean --if-exists"],
      "env": [{"name": "PGPASSWORD", "value": "'$DB_PASSWORD'"}],
      "volumeMounts": [{
        "name": "backup",
        "mountPath": "/backups"
      }]
    }],
    "volumes": [{
      "name": "backup",
      "hostPath": {
        "path": "/tmp"
      }
    }]
  }
}' \
  -n veza-production

Point-in-Time Recovery

# Restore to specific timestamp using WAL archives
pg_restore -h postgres-service -U veza_user -d veza_db \
  --recovery-target-time="2025-01-01 12:00:00" \
  /backups/postgres/base_backup.dump

Step 5: Verify Data Integrity

# Check table counts
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U veza_user -d veza_db -c "
    SELECT 
      'users' as table_name, COUNT(*) as count FROM users
    UNION ALL
    SELECT 'tracks', COUNT(*) FROM tracks
    UNION ALL
    SELECT 'playlists', COUNT(*) FROM playlists;
  "

# Verify specific data
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U veza_user -d veza_db -c "
    SELECT id, username, email, created_at 
    FROM users 
    ORDER BY created_at DESC 
    LIMIT 10;
  "

# Check database size
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U veza_user -d veza_db -c "
    SELECT pg_size_pretty(pg_database_size('veza_db'));
  "

Step 6: Restart Applications

# Scale up applications
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production
kubectl scale deployment veza-chat-server --replicas=2 -n veza-production

# Wait for pods to be ready
kubectl rollout status deployment/veza-backend-api -n veza-production
kubectl rollout status deployment/veza-chat-server -n veza-production

Step 7: Verify Application Functionality

# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production

# Test health endpoint
curl https://api.veza.com/health

# Test API endpoints
curl https://api.veza.com/api/v1/tracks
curl https://api.veza.com/api/v1/users/me

# Run smoke tests
# (Use your application's test suite)

Partial Restore

Restore Specific Tables

# Restore only specific tables
pg_restore -h postgres-service -U veza_user -d veza_db \
  -t users -t tracks \
  /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump

Restore Specific Schema

# Restore only specific schema
pg_restore -h postgres-service -U veza_user -d veza_db \
  -n public \
  /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump

Verification Checklist

Backup file identified and verified
Applications stopped
Current state backed up (if possible)
Database restored successfully
Data integrity verified
Applications restarted
Health checks passing
API endpoints responding
Smoke tests passing
Users can access platform

Troubleshooting

Restore Fails with Permission Error

# Check database user permissions
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U postgres -c "\du veza_user"

# Grant necessary permissions
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE veza_db TO veza_user;"

Restore Fails with Connection Error

# Verify database is accessible
kubectl exec -it postgres-pod -n veza-production -- \
  pg_isready -U veza_user -d veza_db

# Check service endpoint
kubectl get svc postgres -n veza-production

# Test connection
kubectl run test-connection --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --env="PGPASSWORD=$DB_PASSWORD" \
  -- psql -h postgres-service -U veza_user -d veza_db -c "SELECT 1;"

Data Inconsistencies After Restore

# Compare record counts with expected values
# Check application logs for errors
kubectl logs -f deployment/veza-backend-api -n veza-production

# Verify foreign key constraints
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U veza_user -d veza_db -c "
    SELECT conname, conrelid::regclass, confrelid::regclass
    FROM pg_constraint
    WHERE contype = 'f';
  "

Post-Restore Tasks

Monitor Platform
- Watch application logs
- Monitor error rates
- Check performance metrics
Verify Data
- Run data integrity checks
- Compare with expected values
- Test critical user flows
Document Incident
- Document restore procedure
- Note any issues encountered
- Update runbook if needed
Investigate Root Cause
- Review logs and events
- Identify what caused data loss
- Implement prevention measures

8.4 KiB Raw Blame History

Data Restore Runbook

Prerequisites

Pre-Restore Checklist

Restore Procedure

Step 1: Identify Backup

Step 2: Stop Applications

Step 3: Backup Current State (Optional)

Step 4: Restore Database

Full Database Restore

Restore from S3

Point-in-Time Recovery

Step 5: Verify Data Integrity

Step 6: Restart Applications

Step 7: Verify Application Functionality

Partial Restore

Restore Specific Tables

Restore Specific Schema

Verification Checklist

Troubleshooting

Restore Fails with Permission Error

Restore Fails with Connection Error

Data Inconsistencies After Restore

Post-Restore Tasks

References

8.4 KiB

Raw Blame History