veza/k8s/disaster-recovery/runbooks/data-restore.md

# Data Restore Runbook

This runbook describes the procedure for restoring data from backups after data loss or corruption.

## Prerequisites

- Access to backup storage
- Database credentials
- kubectl access to cluster
- Backup file identified

## Pre-Restore Checklist

- [ ] Backup file identified and verified
- [ ] Backup integrity checked
- [ ] Restore point confirmed
- [ ] Applications stopped (to prevent writes)
- [ ] Current data backed up (if possible)

## Restore Procedure

### Step 1: Identify Backup

```bash
# List available backups
kubectl get pvc postgres-backup-storage -n veza-production

# List backups in storage
kubectl run backup-lister --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --overrides='
{
  "spec": {
    "containers": [{
      "name": "backup-lister",
      "image": "postgres:15-alpine",
      "command": ["/bin/sh", "-c", "ls -lh /backups/postgres/"],
      "volumeMounts": [{
        "name": "backup-storage",
        "mountPath": "/backups"
      }]
    }],
    "volumes": [{
      "name": "backup-storage",
      "persistentVolumeClaim": {
        "claimName": "postgres-backup-storage"
      }
    }]
  }
}' \
  -n veza-production

# Or from S3
aws s3 ls s3://veza-backups/postgres/ --recursive | sort
```

### Step 2: Stop Applications

```bash
# Scale down applications to prevent writes
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production
kubectl scale deployment veza-chat-server --replicas=0 -n veza-production

# Verify pods are stopped
kubectl get pods -n veza-production -l app=veza-backend-api
```

### Step 3: Backup Current State (Optional)

```bash
# Create backup of current state before restore
kubectl create job --from=cronjob/postgres-backup \
  postgres-backup-pre-restore-$(date +%s) \
  -n veza-production

# Wait for backup to complete
kubectl wait --for=condition=complete job/postgres-backup-pre-restore-* \
  -n veza-production \
  --timeout=600s
```

### Step 4: Restore Database

#### Full Database Restore

```bash
# Get database credentials
DB_PASSWORD=$(kubectl get secret veza-secrets -n veza-production \
  -o jsonpath='{.data.database-url}' | \
  base64 -d | grep -oP 'password=\K[^&]+')

# Restore database
kubectl run postgres-restore --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --env="PGPASSWORD=$DB_PASSWORD" \
  --env="POSTGRES_HOST=postgres-service" \
  --env="POSTGRES_USER=veza_user" \
  --env="POSTGRES_DB=veza_db" \
  --overrides='
{
  "spec": {
    "containers": [{
      "name": "postgres-restore",
      "image": "postgres:15-alpine",
      "command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump --clean --if-exists --verbose"],
      "env": [
        {"name": "PGPASSWORD", "value": "'$DB_PASSWORD'"},
        {"name": "POSTGRES_HOST", "value": "postgres-service"},
        {"name": "POSTGRES_USER", "value": "veza_user"},
        {"name": "POSTGRES_DB", "value": "veza_db"}
      ],
      "volumeMounts": [{
        "name": "backup-storage",
        "mountPath": "/backups"
      }]
    }],
    "volumes": [{
      "name": "backup-storage",
      "persistentVolumeClaim": {
        "claimName": "postgres-backup-storage"
      }
    }]
  }
}' \
  -n veza-production
```

#### Restore from S3

```bash
# Download backup from S3
aws s3 cp s3://veza-backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump /tmp/backup.dump

# Restore
kubectl run postgres-restore --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --env="PGPASSWORD=$DB_PASSWORD" \
  --overrides='
{
  "spec": {
    "containers": [{
      "name": "postgres-restore",
      "image": "postgres:15-alpine",
      "command": ["/bin/sh", "-c", "pg_restore -h postgres-service -U veza_user -d veza_db -F c /backups/backup.dump --clean --if-exists"],
      "env": [{"name": "PGPASSWORD", "value": "'$DB_PASSWORD'"}],
      "volumeMounts": [{
        "name": "backup",
        "mountPath": "/backups"
      }]
    }],
    "volumes": [{
      "name": "backup",
      "hostPath": {
        "path": "/tmp"
      }
    }]
  }
}' \
  -n veza-production
```

#### Point-in-Time Recovery

```bash
# Restore to specific timestamp using WAL archives
pg_restore -h postgres-service -U veza_user -d veza_db \
  --recovery-target-time="2025-01-01 12:00:00" \
  /backups/postgres/base_backup.dump
```

### Step 5: Verify Data Integrity

```bash
# Check table counts
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U veza_user -d veza_db -c "
    SELECT 
      'users' as table_name, COUNT(*) as count FROM users
    UNION ALL
    SELECT 'tracks', COUNT(*) FROM tracks
    UNION ALL
    SELECT 'playlists', COUNT(*) FROM playlists;
  "

# Verify specific data
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U veza_user -d veza_db -c "
    SELECT id, username, email, created_at 
    FROM users 
    ORDER BY created_at DESC 
    LIMIT 10;
  "

# Check database size
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U veza_user -d veza_db -c "
    SELECT pg_size_pretty(pg_database_size('veza_db'));
  "
```

### Step 6: Restart Applications

```bash
# Scale up applications
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production
kubectl scale deployment veza-chat-server --replicas=2 -n veza-production

# Wait for pods to be ready
kubectl rollout status deployment/veza-backend-api -n veza-production
kubectl rollout status deployment/veza-chat-server -n veza-production
```

### Step 7: Verify Application Functionality

```bash
# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production

# Test health endpoint
curl https://api.veza.com/health

# Test API endpoints
curl https://api.veza.com/api/v1/tracks
curl https://api.veza.com/api/v1/users/me

# Run smoke tests
# (Use your application's test suite)
```

## Partial Restore

### Restore Specific Tables

```bash
# Restore only specific tables
pg_restore -h postgres-service -U veza_user -d veza_db \
  -t users -t tracks \
  /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump
```

### Restore Specific Schema

```bash
# Restore only specific schema
pg_restore -h postgres-service -U veza_user -d veza_db \
  -n public \
  /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump
```

## Verification Checklist

- [ ] Backup file identified and verified
- [ ] Applications stopped
- [ ] Current state backed up (if possible)
- [ ] Database restored successfully
- [ ] Data integrity verified
- [ ] Applications restarted
- [ ] Health checks passing
- [ ] API endpoints responding
- [ ] Smoke tests passing
- [ ] Users can access platform

## Troubleshooting

### Restore Fails with Permission Error

```bash
# Check database user permissions
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U postgres -c "\du veza_user"

# Grant necessary permissions
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE veza_db TO veza_user;"
```

### Restore Fails with Connection Error

```bash
# Verify database is accessible
kubectl exec -it postgres-pod -n veza-production -- \
  pg_isready -U veza_user -d veza_db

# Check service endpoint
kubectl get svc postgres -n veza-production

# Test connection
kubectl run test-connection --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --env="PGPASSWORD=$DB_PASSWORD" \
  -- psql -h postgres-service -U veza_user -d veza_db -c "SELECT 1;"
```

### Data Inconsistencies After Restore

```bash
# Compare record counts with expected values
# Check application logs for errors
kubectl logs -f deployment/veza-backend-api -n veza-production

# Verify foreign key constraints
kubectl exec -it postgres-pod -n veza-production -- \
  psql -U veza_user -d veza_db -c "
    SELECT conname, conrelid::regclass, confrelid::regclass
    FROM pg_constraint
    WHERE contype = 'f';
  "
```

## Post-Restore Tasks

1. **Monitor Platform**
   - Watch application logs
   - Monitor error rates
   - Check performance metrics

2. **Verify Data**
   - Run data integrity checks
   - Compare with expected values
   - Test critical user flows

3. **Document Incident**
   - Document restore procedure
   - Note any issues encountered
   - Update runbook if needed

4. **Investigate Root Cause**
   - Review logs and events
   - Identify what caused data loss
   - Implement prevention measures

## References

- [Backup Strategy](../backups/README.md)
- [PostgreSQL Restore Documentation](https://www.postgresql.org/docs/current/app-pgrestore.html)
[INFRA-010] infra: Set up disaster recovery plan 2025-12-25 20:40:31 +00:00			`# Data Restore Runbook`

			`This runbook describes the procedure for restoring data from backups after data loss or corruption.`

			`## Prerequisites`

			`- Access to backup storage`
			`- Database credentials`
			`- kubectl access to cluster`
			`- Backup file identified`

			`## Pre-Restore Checklist`

			`- [ ] Backup file identified and verified`
			`- [ ] Backup integrity checked`
			`- [ ] Restore point confirmed`
			`- [ ] Applications stopped (to prevent writes)`
			`- [ ] Current data backed up (if possible)`

			`## Restore Procedure`

			`### Step 1: Identify Backup`

			```bash
			`# List available backups`
			`kubectl get pvc postgres-backup-storage -n veza-production`

			`# List backups in storage`
			`kubectl run backup-lister --rm -it --image=postgres:15-alpine \`
			`--restart=Never \`
			`--overrides='`
			`{`
			`"spec": {`
			`"containers": [{`
			`"name": "backup-lister",`
			`"image": "postgres:15-alpine",`
			`"command": ["/bin/sh", "-c", "ls -lh /backups/postgres/"],`
			`"volumeMounts": [{`
			`"name": "backup-storage",`
			`"mountPath": "/backups"`
			`}]`
			`}],`
			`"volumes": [{`
			`"name": "backup-storage",`
			`"persistentVolumeClaim": {`
			`"claimName": "postgres-backup-storage"`
			`}`
			`}]`
			`}`
			`}' \`
			`-n veza-production`

			`# Or from S3`
			`aws s3 ls s3://veza-backups/postgres/ --recursive \| sort`
			```

			`### Step 2: Stop Applications`

			```bash
			`# Scale down applications to prevent writes`
			`kubectl scale deployment veza-backend-api --replicas=0 -n veza-production`
			`kubectl scale deployment veza-chat-server --replicas=0 -n veza-production`

			`# Verify pods are stopped`
			`kubectl get pods -n veza-production -l app=veza-backend-api`
			```

			`### Step 3: Backup Current State (Optional)`

			```bash
			`# Create backup of current state before restore`
			`kubectl create job --from=cronjob/postgres-backup \`
			`postgres-backup-pre-restore-$(date +%s) \`
			`-n veza-production`

			`# Wait for backup to complete`
			`kubectl wait --for=condition=complete job/postgres-backup-pre-restore-* \`
			`-n veza-production \`
			`--timeout=600s`
			```

			`### Step 4: Restore Database`

			`#### Full Database Restore`

			```bash
			`# Get database credentials`
			`DB_PASSWORD=$(kubectl get secret veza-secrets -n veza-production \`
			`-o jsonpath='{.data.database-url}' \| \`
			`base64 -d \| grep -oP 'password=\K[^&]+')`

			`# Restore database`
			`kubectl run postgres-restore --rm -it --image=postgres:15-alpine \`
			`--restart=Never \`
			`--env="PGPASSWORD=$DB_PASSWORD" \`
			`--env="POSTGRES_HOST=postgres-service" \`
			`--env="POSTGRES_USER=veza_user" \`
			`--env="POSTGRES_DB=veza_db" \`
			`--overrides='`
			`{`
			`"spec": {`
			`"containers": [{`
			`"name": "postgres-restore",`
			`"image": "postgres:15-alpine",`
			`"command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump --clean --if-exists --verbose"],`
			`"env": [`
			`{"name": "PGPASSWORD", "value": "'$DB_PASSWORD'"},`
			`{"name": "POSTGRES_HOST", "value": "postgres-service"},`
			`{"name": "POSTGRES_USER", "value": "veza_user"},`
			`{"name": "POSTGRES_DB", "value": "veza_db"}`
			`],`
			`"volumeMounts": [{`
			`"name": "backup-storage",`
			`"mountPath": "/backups"`
			`}]`
			`}],`
			`"volumes": [{`
			`"name": "backup-storage",`
			`"persistentVolumeClaim": {`
			`"claimName": "postgres-backup-storage"`
			`}`
			`}]`
			`}`
			`}' \`
			`-n veza-production`
			```

			`#### Restore from S3`

			```bash
			`# Download backup from S3`
			`aws s3 cp s3://veza-backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump /tmp/backup.dump`

			`# Restore`
			`kubectl run postgres-restore --rm -it --image=postgres:15-alpine \`
			`--restart=Never \`
			`--env="PGPASSWORD=$DB_PASSWORD" \`
			`--overrides='`
			`{`
			`"spec": {`
			`"containers": [{`
			`"name": "postgres-restore",`
			`"image": "postgres:15-alpine",`
			`"command": ["/bin/sh", "-c", "pg_restore -h postgres-service -U veza_user -d veza_db -F c /backups/backup.dump --clean --if-exists"],`
			`"env": [{"name": "PGPASSWORD", "value": "'$DB_PASSWORD'"}],`
			`"volumeMounts": [{`
			`"name": "backup",`
			`"mountPath": "/backups"`
			`}]`
			`}],`
			`"volumes": [{`
			`"name": "backup",`
			`"hostPath": {`
			`"path": "/tmp"`
			`}`
			`}]`
			`}`
			`}' \`
			`-n veza-production`
			```

			`#### Point-in-Time Recovery`

			```bash
			`# Restore to specific timestamp using WAL archives`
			`pg_restore -h postgres-service -U veza_user -d veza_db \`
			`--recovery-target-time="2025-01-01 12:00:00" \`
			`/backups/postgres/base_backup.dump`
			```

			`### Step 5: Verify Data Integrity`

			```bash
			`# Check table counts`
			`kubectl exec -it postgres-pod -n veza-production -- \`
			`psql -U veza_user -d veza_db -c "`
			`SELECT`
			`'users' as table_name, COUNT(*) as count FROM users`
			`UNION ALL`
			`SELECT 'tracks', COUNT(*) FROM tracks`
			`UNION ALL`
			`SELECT 'playlists', COUNT(*) FROM playlists;`
			`"`

			`# Verify specific data`
			`kubectl exec -it postgres-pod -n veza-production -- \`
			`psql -U veza_user -d veza_db -c "`
			`SELECT id, username, email, created_at`
			`FROM users`
			`ORDER BY created_at DESC`
			`LIMIT 10;`
			`"`

			`# Check database size`
			`kubectl exec -it postgres-pod -n veza-production -- \`
			`psql -U veza_user -d veza_db -c "`
			`SELECT pg_size_pretty(pg_database_size('veza_db'));`
			`"`
			```

			`### Step 6: Restart Applications`

			```bash
			`# Scale up applications`
			`kubectl scale deployment veza-backend-api --replicas=3 -n veza-production`
			`kubectl scale deployment veza-chat-server --replicas=2 -n veza-production`

			`# Wait for pods to be ready`
			`kubectl rollout status deployment/veza-backend-api -n veza-production`
			`kubectl rollout status deployment/veza-chat-server -n veza-production`
			```

			`### Step 7: Verify Application Functionality`

			```bash
			`# Check application logs`
			`kubectl logs -f deployment/veza-backend-api -n veza-production`

			`# Test health endpoint`
			`curl https://api.veza.com/health`

			`# Test API endpoints`
			`curl https://api.veza.com/api/v1/tracks`
			`curl https://api.veza.com/api/v1/users/me`

			`# Run smoke tests`
			`# (Use your application's test suite)`
			```

			`## Partial Restore`

			`### Restore Specific Tables`

			```bash
			`# Restore only specific tables`
			`pg_restore -h postgres-service -U veza_user -d veza_db \`
			`-t users -t tracks \`
			`/backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump`
			```

			`### Restore Specific Schema`

			```bash
			`# Restore only specific schema`
			`pg_restore -h postgres-service -U veza_user -d veza_db \`
			`-n public \`
			`/backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump`
			```

			`## Verification Checklist`

			`- [ ] Backup file identified and verified`
			`- [ ] Applications stopped`
			`- [ ] Current state backed up (if possible)`
			`- [ ] Database restored successfully`
			`- [ ] Data integrity verified`
			`- [ ] Applications restarted`
			`- [ ] Health checks passing`
			`- [ ] API endpoints responding`
			`- [ ] Smoke tests passing`
			`- [ ] Users can access platform`

			`## Troubleshooting`

			`### Restore Fails with Permission Error`

			```bash
			`# Check database user permissions`
			`kubectl exec -it postgres-pod -n veza-production -- \`
			`psql -U postgres -c "\du veza_user"`

			`# Grant necessary permissions`
			`kubectl exec -it postgres-pod -n veza-production -- \`
			`psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE veza_db TO veza_user;"`
			```

			`### Restore Fails with Connection Error`

			```bash
			`# Verify database is accessible`
			`kubectl exec -it postgres-pod -n veza-production -- \`
			`pg_isready -U veza_user -d veza_db`

			`# Check service endpoint`
			`kubectl get svc postgres -n veza-production`

			`# Test connection`
			`kubectl run test-connection --rm -it --image=postgres:15-alpine \`
			`--restart=Never \`
			`--env="PGPASSWORD=$DB_PASSWORD" \`
			`-- psql -h postgres-service -U veza_user -d veza_db -c "SELECT 1;"`
			```

			`### Data Inconsistencies After Restore`

			```bash
			`# Compare record counts with expected values`
			`# Check application logs for errors`
			`kubectl logs -f deployment/veza-backend-api -n veza-production`

			`# Verify foreign key constraints`
			`kubectl exec -it postgres-pod -n veza-production -- \`
			`psql -U veza_user -d veza_db -c "`
			`SELECT conname, conrelid::regclass, confrelid::regclass`
			`FROM pg_constraint`
			`WHERE contype = 'f';`
			`"`
			```

			`## Post-Restore Tasks`

			`1. Monitor Platform`
			`- Watch application logs`
			`- Monitor error rates`
			`- Check performance metrics`

			`2. Verify Data`
			`- Run data integrity checks`
			`- Compare with expected values`
			`- Test critical user flows`

			`3. Document Incident`
			`- Document restore procedure`
			`- Note any issues encountered`
			`- Update runbook if needed`

			`4. Investigate Root Cause`
			`- Review logs and events`
			`- Identify what caused data loss`
			`- Implement prevention measures`

			`## References`

			`- [Backup Strategy](../backups/README.md)`
			`- [PostgreSQL Restore Documentation](https://www.postgresql.org/docs/current/app-pgrestore.html)`