| .. | ||
| runbooks | ||
| README.md | ||
Disaster Recovery Plan for Veza Platform
This document outlines the comprehensive disaster recovery plan for the Veza platform, including recovery procedures, testing protocols, and operational runbooks.
Executive Summary
Recovery Time Objective (RTO): < 4 hours
Recovery Point Objective (RPO): < 1 hour
Maximum Acceptable Downtime: 4 hours
Data Loss Tolerance: 1 hour
Table of Contents
- Overview
- Recovery Objectives
- Disaster Scenarios
- Recovery Procedures
- Testing Procedures
- Communication Plan
- Runbooks
- Maintenance and Updates
Overview
The Veza platform disaster recovery plan ensures business continuity and data protection in the event of various failure scenarios. The plan covers:
- Infrastructure Failures: Node failures, cluster failures, network outages
- Application Failures: Service crashes, deployment failures, configuration errors
- Data Failures: Database corruption, data loss, backup failures
- Regional Failures: Complete datacenter or region outages
- Security Incidents: Breaches, ransomware, unauthorized access
Recovery Objectives
RTO (Recovery Time Objective)
| Component | RTO | Description |
|---|---|---|
| Critical Services | < 1 hour | Backend API, Frontend, Authentication |
| Database | < 2 hours | PostgreSQL with failover to standby |
| Complete Platform | < 4 hours | Full platform recovery in DR region |
| Non-Critical Services | < 8 hours | Chat server, monitoring, etc. |
RPO (Recovery Point Objective)
| Data Type | RPO | Backup Frequency |
|---|---|---|
| Database | < 1 hour | Hourly incremental + Daily full |
| Application State | < 15 minutes | Real-time replication |
| User Uploads | < 1 hour | Hourly sync to S3 |
| Configuration | < 5 minutes | Git-based version control |
Disaster Scenarios
Scenario 1: Single Node Failure
Impact: Minimal - Pods rescheduled to other nodes
RTO: < 5 minutes
Procedure: Automatic pod rescheduling by Kubernetes
Recovery Steps:
- Kubernetes automatically detects node failure
- Pods are rescheduled to healthy nodes
- Services continue with minimal interruption
- Monitor for any pod startup issues
Scenario 2: Database Primary Failure
Impact: High - Application cannot write data
RTO: < 5 minutes (with standby)
Procedure: Failover to standby replica
Recovery Steps:
- Detect primary database failure
- Promote standby replica to primary
- Update connection strings in ConfigMaps/Secrets
- Restart application pods to pick up new connection
- Verify data integrity
- Set up new standby replica
Runbook: See runbooks/database-failover.md
Scenario 3: Application Deployment Failure
Impact: Medium - Service degradation or outage
RTO: < 5 minutes
Procedure: Automatic rollback
Recovery Steps:
- Health checks detect failure
- Automatic rollback to previous version
- Verify service health
- Investigate root cause
- Fix and redeploy
Runbook: See runbooks/rollback-procedure.md
Scenario 4: Complete Cluster Failure
Impact: Critical - Complete platform outage
RTO: < 4 hours
Procedure: Failover to DR region
Recovery Steps:
- Declare disaster
- Activate DR region
- Restore database from latest backup
- Deploy applications
- Update DNS to point to DR region
- Verify all services
- Begin recovery of primary region
Runbook: See runbooks/cluster-failover.md
Scenario 5: Data Corruption or Loss
Impact: Critical - Data integrity compromised
RTO: < 2 hours
Procedure: Restore from backup
Recovery Steps:
- Identify affected data
- Stop writes to affected database
- Restore from most recent clean backup
- Verify data integrity
- Resume operations
- Investigate root cause
Runbook: See runbooks/data-restore.md
Scenario 6: Security Breach
Impact: Critical - Security and data integrity
RTO: < 1 hour (containment)
Procedure: Incident response and isolation
Recovery Steps:
- Isolate affected systems
- Preserve evidence
- Assess scope of breach
- Revoke compromised credentials
- Restore from clean backup if needed
- Patch vulnerabilities
- Resume operations with enhanced monitoring
Runbook: See runbooks/security-incident.md
Recovery Procedures
Pre-Recovery Checklist
Before initiating any recovery procedure:
- Verify backup availability and integrity
- Confirm DR resources are available
- Notify stakeholders
- Document incident details
- Prepare recovery runbook
- Verify access credentials
Database Recovery
Full Database Restore
# 1. Stop application to prevent writes
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production
# 2. Identify backup to restore
kubectl get pvc postgres-backup-storage -n veza-production
# List available backups
kubectl exec -it postgres-pod -n veza-production -- ls -lh /backups/postgres/
# 3. Restore database
kubectl run postgres-restore --rm -it --image=postgres:15-alpine \
--restart=Never \
--env="PGPASSWORD=$(kubectl get secret veza-secrets -n veza-production -o jsonpath='{.data.database-url}' | base64 -d | grep -oP 'password=\K[^&]+')" \
--env="POSTGRES_HOST=postgres-service" \
--env="POSTGRES_USER=veza_user" \
--env="POSTGRES_DB=veza_db" \
--overrides='
{
"spec": {
"containers": [{
"name": "postgres-restore",
"image": "postgres:15-alpine",
"command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump --clean --if-exists"],
"volumeMounts": [{
"name": "backup-storage",
"mountPath": "/backups"
}]
}],
"volumes": [{
"name": "backup-storage",
"persistentVolumeClaim": {
"claimName": "postgres-backup-storage"
}
}]
}
}' \
-n veza-production
# 4. Verify data integrity
kubectl exec -it postgres-pod -n veza-production -- psql -U veza_user -d veza_db -c "SELECT COUNT(*) FROM users;"
# 5. Restart application
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production
Point-in-Time Recovery
# Restore to specific timestamp using WAL archives
pg_restore -h postgres-service -U veza_user -d veza_db \
--recovery-target-time="2025-01-01 12:00:00" \
/backups/postgres/base_backup.dump
Application Recovery
Rollback Deployment
# Rollback to previous version
kubectl rollout undo deployment/veza-backend-api -n veza-production
# Verify rollback
kubectl rollout status deployment/veza-backend-api -n veza-production
# Check logs
kubectl logs -f deployment/veza-backend-api -n veza-production
Redeploy from Git
# Get latest code
git checkout main
git pull origin main
# Build and push image
docker build -t veza-backend-api:latest .
docker push veza-backend-api:latest
# Update deployment
kubectl set image deployment/veza-backend-api \
veza-backend-api=veza-backend-api:latest \
-n veza-production
# Verify
kubectl rollout status deployment/veza-backend-api -n veza-production
Infrastructure Recovery
Cluster Rebuild
# 1. Provision new cluster (using Terraform/Infrastructure as Code)
cd infrastructure/terraform
terraform apply
# 2. Install prerequisites
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/secrets/ # Restore secrets from Vault
# 3. Deploy applications
kubectl apply -f k8s/backend-api/
kubectl apply -f k8s/frontend/
kubectl apply -f k8s/chat-server/
# 4. Restore data
# Follow database recovery procedure
# 5. Verify services
kubectl get pods -n veza-production
kubectl get svc -n veza-production
Testing Procedures
Test Schedule
| Test Type | Frequency | Duration |
|---|---|---|
| Database Restore | Monthly | 2 hours |
| Application Rollback | After each deployment | 15 minutes |
| Full DR Drill | Quarterly | 4 hours |
| Security Incident Drill | Semi-annually | 2 hours |
Database Restore Test
Objective: Verify database backups are valid and can be restored
Procedure:
- Create test namespace:
kubectl create namespace veza-dr-test - Deploy test PostgreSQL instance
- Restore latest backup to test instance
- Verify data integrity
- Run smoke tests against restored database
- Document results
- Clean up test namespace
Success Criteria:
- Backup restores successfully
- All tables present and accessible
- Data integrity verified
- Restore completes within RTO
Application Rollback Test
Objective: Verify rollback procedure works correctly
Procedure:
- Deploy test version with known issue
- Verify health checks fail
- Execute rollback procedure
- Verify service returns to healthy state
- Document rollback time
Success Criteria:
- Rollback completes within 5 minutes
- Service returns to healthy state
- No data loss
- Users experience minimal disruption
Full DR Drill
Objective: Test complete disaster recovery in DR region
Procedure:
- Schedule maintenance window
- Simulate primary region failure
- Activate DR region
- Restore all services
- Verify platform functionality
- Document lessons learned
- Restore primary region
Success Criteria:
- DR activation within 4 hours
- All critical services operational
- Data integrity maintained
- Users can access platform
Communication Plan
Incident Notification
Immediate (P0 - Critical):
- DevOps Lead: +1-XXX-XXX-XXXX
- CTO: +1-XXX-XXX-XXXX
- On-Call Engineer: PagerDuty
Escalation (P1 - High):
- Engineering Manager
- Product Manager
Status Updates:
- Internal: Slack #incidents channel
- External: Status page (status.veza.com)
- Customers: Email notification (if > 1 hour downtime)
Communication Templates
Initial Incident Notification
Subject: [INCIDENT] Veza Platform - <Issue Description>
Severity: P0/P1/P2
Status: Investigating
Impact: <Description>
ETA: <Time>
Team is actively working on resolution.
Updates will be posted every 15 minutes.
Resolution Notification
Subject: [RESOLVED] Veza Platform - <Issue Description>
Status: Resolved
Duration: <Time>
Root Cause: <Description>
Prevention: <Actions taken>
Platform is fully operational.
Runbooks
Detailed runbooks are available in the runbooks/ directory:
database-failover.md- Database failover procedurerollback-procedure.md- Application rollback stepscluster-failover.md- Complete cluster failoverdata-restore.md- Data restoration proceduressecurity-incident.md- Security incident response
Maintenance and Updates
Regular Reviews
- Monthly: Review and update recovery procedures
- Quarterly: Full DR drill and documentation update
- Annually: Complete disaster recovery plan review
Backup Verification
- Daily: Automated backup verification
- Weekly: Manual backup integrity check
- Monthly: Full restore test
Documentation Updates
All changes to infrastructure, applications, or procedures must be reflected in this disaster recovery plan within 7 days.
References
Contact Information
Disaster Recovery Team:
- DevOps Lead: devops@veza.com
- SRE Team: sre@veza.com
- On-Call: PagerDuty rotation
Last Updated: 2025-12-25
Next Review: 2026-03-25
Owner: DevOps Team