# Disaster Recovery Plan for Veza Platform This document outlines the comprehensive disaster recovery plan for the Veza platform, including recovery procedures, testing protocols, and operational runbooks. ## Executive Summary **Recovery Time Objective (RTO)**: < 4 hours **Recovery Point Objective (RPO)**: < 1 hour **Maximum Acceptable Downtime**: 4 hours **Data Loss Tolerance**: 1 hour ## Table of Contents 1. [Overview](#overview) 2. [Recovery Objectives](#recovery-objectives) 3. [Disaster Scenarios](#disaster-scenarios) 4. [Recovery Procedures](#recovery-procedures) 5. [Testing Procedures](#testing-procedures) 6. [Communication Plan](#communication-plan) 7. [Runbooks](#runbooks) 8. [Maintenance and Updates](#maintenance-and-updates) ## Overview The Veza platform disaster recovery plan ensures business continuity and data protection in the event of various failure scenarios. The plan covers: - **Infrastructure Failures**: Node failures, cluster failures, network outages - **Application Failures**: Service crashes, deployment failures, configuration errors - **Data Failures**: Database corruption, data loss, backup failures - **Regional Failures**: Complete datacenter or region outages - **Security Incidents**: Breaches, ransomware, unauthorized access ## Recovery Objectives ### RTO (Recovery Time Objective) | Component | RTO | Description | |-----------|-----|-------------| | Critical Services | < 1 hour | Backend API, Frontend, Authentication | | Database | < 2 hours | PostgreSQL with failover to standby | | Complete Platform | < 4 hours | Full platform recovery in DR region | | Non-Critical Services | < 8 hours | Chat server, monitoring, etc. | ### RPO (Recovery Point Objective) | Data Type | RPO | Backup Frequency | |-----------|-----|------------------| | Database | < 1 hour | Hourly incremental + Daily full | | Application State | < 15 minutes | Real-time replication | | User Uploads | < 1 hour | Hourly sync to S3 | | Configuration | < 5 minutes | Git-based version control | ## Disaster Scenarios ### Scenario 1: Single Node Failure **Impact**: Minimal - Pods rescheduled to other nodes **RTO**: < 5 minutes **Procedure**: Automatic pod rescheduling by Kubernetes **Recovery Steps**: 1. Kubernetes automatically detects node failure 2. Pods are rescheduled to healthy nodes 3. Services continue with minimal interruption 4. Monitor for any pod startup issues ### Scenario 2: Database Primary Failure **Impact**: High - Application cannot write data **RTO**: < 5 minutes (with standby) **Procedure**: Failover to standby replica **Recovery Steps**: 1. Detect primary database failure 2. Promote standby replica to primary 3. Update connection strings in ConfigMaps/Secrets 4. Restart application pods to pick up new connection 5. Verify data integrity 6. Set up new standby replica **Runbook**: See `runbooks/database-failover.md` ### Scenario 3: Application Deployment Failure **Impact**: Medium - Service degradation or outage **RTO**: < 5 minutes **Procedure**: Automatic rollback **Recovery Steps**: 1. Health checks detect failure 2. Automatic rollback to previous version 3. Verify service health 4. Investigate root cause 5. Fix and redeploy **Runbook**: See `runbooks/rollback-procedure.md` ### Scenario 4: Complete Cluster Failure **Impact**: Critical - Complete platform outage **RTO**: < 4 hours **Procedure**: Failover to DR region **Recovery Steps**: 1. Declare disaster 2. Activate DR region 3. Restore database from latest backup 4. Deploy applications 5. Update DNS to point to DR region 6. Verify all services 7. Begin recovery of primary region **Runbook**: See `runbooks/cluster-failover.md` ### Scenario 5: Data Corruption or Loss **Impact**: Critical - Data integrity compromised **RTO**: < 2 hours **Procedure**: Restore from backup **Recovery Steps**: 1. Identify affected data 2. Stop writes to affected database 3. Restore from most recent clean backup 4. Verify data integrity 5. Resume operations 6. Investigate root cause **Runbook**: See `runbooks/data-restore.md` ### Scenario 6: Security Breach **Impact**: Critical - Security and data integrity **RTO**: < 1 hour (containment) **Procedure**: Incident response and isolation **Recovery Steps**: 1. Isolate affected systems 2. Preserve evidence 3. Assess scope of breach 4. Revoke compromised credentials 5. Restore from clean backup if needed 6. Patch vulnerabilities 7. Resume operations with enhanced monitoring **Runbook**: See `runbooks/security-incident.md` ## Recovery Procedures ### Pre-Recovery Checklist Before initiating any recovery procedure: - [ ] Verify backup availability and integrity - [ ] Confirm DR resources are available - [ ] Notify stakeholders - [ ] Document incident details - [ ] Prepare recovery runbook - [ ] Verify access credentials ### Database Recovery #### Full Database Restore ```bash # 1. Stop application to prevent writes kubectl scale deployment veza-backend-api --replicas=0 -n veza-production # 2. Identify backup to restore kubectl get pvc postgres-backup-storage -n veza-production # List available backups kubectl exec -it postgres-pod -n veza-production -- ls -lh /backups/postgres/ # 3. Restore database kubectl run postgres-restore --rm -it --image=postgres:15-alpine \ --restart=Never \ --env="PGPASSWORD=$(kubectl get secret veza-secrets -n veza-production -o jsonpath='{.data.database-url}' | base64 -d | grep -oP 'password=\K[^&]+')" \ --env="POSTGRES_HOST=postgres-service" \ --env="POSTGRES_USER=veza_user" \ --env="POSTGRES_DB=veza_db" \ --overrides=' { "spec": { "containers": [{ "name": "postgres-restore", "image": "postgres:15-alpine", "command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump --clean --if-exists"], "volumeMounts": [{ "name": "backup-storage", "mountPath": "/backups" }] }], "volumes": [{ "name": "backup-storage", "persistentVolumeClaim": { "claimName": "postgres-backup-storage" } }] } }' \ -n veza-production # 4. Verify data integrity kubectl exec -it postgres-pod -n veza-production -- psql -U veza_user -d veza_db -c "SELECT COUNT(*) FROM users;" # 5. Restart application kubectl scale deployment veza-backend-api --replicas=3 -n veza-production ``` #### Point-in-Time Recovery ```bash # Restore to specific timestamp using WAL archives pg_restore -h postgres-service -U veza_user -d veza_db \ --recovery-target-time="2025-01-01 12:00:00" \ /backups/postgres/base_backup.dump ``` ### Application Recovery #### Rollback Deployment ```bash # Rollback to previous version kubectl rollout undo deployment/veza-backend-api -n veza-production # Verify rollback kubectl rollout status deployment/veza-backend-api -n veza-production # Check logs kubectl logs -f deployment/veza-backend-api -n veza-production ``` #### Redeploy from Git ```bash # Get latest code git checkout main git pull origin main # Build and push image docker build -t veza-backend-api:latest . docker push veza-backend-api:latest # Update deployment kubectl set image deployment/veza-backend-api \ veza-backend-api=veza-backend-api:latest \ -n veza-production # Verify kubectl rollout status deployment/veza-backend-api -n veza-production ``` ### Infrastructure Recovery #### Cluster Rebuild ```bash # 1. Provision new cluster (using Terraform/Infrastructure as Code) cd infrastructure/terraform terraform apply # 2. Install prerequisites kubectl apply -f k8s/namespace.yaml kubectl apply -f k8s/secrets/ # Restore secrets from Vault # 3. Deploy applications kubectl apply -f k8s/backend-api/ kubectl apply -f k8s/frontend/ # 4. Restore data # Follow database recovery procedure # 5. Verify services kubectl get pods -n veza-production kubectl get svc -n veza-production ``` ## Testing Procedures ### Test Schedule | Test Type | Frequency | Duration | |-----------|-----------|----------| | Database Restore | Monthly | 2 hours | | Application Rollback | After each deployment | 15 minutes | | Full DR Drill | Quarterly | 4 hours | | Security Incident Drill | Semi-annually | 2 hours | ### Database Restore Test **Objective**: Verify database backups are valid and can be restored **Procedure**: 1. Create test namespace: `kubectl create namespace veza-dr-test` 2. Deploy test PostgreSQL instance 3. Restore latest backup to test instance 4. Verify data integrity 5. Run smoke tests against restored database 6. Document results 7. Clean up test namespace **Success Criteria**: - Backup restores successfully - All tables present and accessible - Data integrity verified - Restore completes within RTO ### Application Rollback Test **Objective**: Verify rollback procedure works correctly **Procedure**: 1. Deploy test version with known issue 2. Verify health checks fail 3. Execute rollback procedure 4. Verify service returns to healthy state 5. Document rollback time **Success Criteria**: - Rollback completes within 5 minutes - Service returns to healthy state - No data loss - Users experience minimal disruption ### Full DR Drill **Objective**: Test complete disaster recovery in DR region **Procedure**: 1. Schedule maintenance window 2. Simulate primary region failure 3. Activate DR region 4. Restore all services 5. Verify platform functionality 6. Document lessons learned 7. Restore primary region **Success Criteria**: - DR activation within 4 hours - All critical services operational - Data integrity maintained - Users can access platform ## Communication Plan ### Incident Notification **Immediate (P0 - Critical)**: - DevOps Lead: +1-XXX-XXX-XXXX - CTO: +1-XXX-XXX-XXXX - On-Call Engineer: PagerDuty **Escalation (P1 - High)**: - Engineering Manager - Product Manager **Status Updates**: - Internal: Slack #incidents channel - External: Status page (status.veza.com) - Customers: Email notification (if > 1 hour downtime) ### Communication Templates #### Initial Incident Notification ``` Subject: [INCIDENT] Veza Platform - Severity: P0/P1/P2 Status: Investigating Impact: ETA: