senke/veza

History

senke 3cf385cdf2 [INFRA-010] infra: Set up disaster recovery plan		2025-12-25 21:40:31 +01:00
..
runbooks	[INFRA-010] infra: Set up disaster recovery plan	2025-12-25 21:40:31 +01:00
README.md	[INFRA-010] infra: Set up disaster recovery plan	2025-12-25 21:40:31 +01:00

README.md

Disaster Recovery Plan for Veza Platform

This document outlines the comprehensive disaster recovery plan for the Veza platform, including recovery procedures, testing protocols, and operational runbooks.

Executive Summary

Recovery Time Objective (RTO): < 4 hours
Recovery Point Objective (RPO): < 1 hour
Maximum Acceptable Downtime: 4 hours
Data Loss Tolerance: 1 hour

Overview
Recovery Objectives
Disaster Scenarios
Recovery Procedures
Testing Procedures
Communication Plan
Runbooks
Maintenance and Updates

Overview

The Veza platform disaster recovery plan ensures business continuity and data protection in the event of various failure scenarios. The plan covers:

Infrastructure Failures: Node failures, cluster failures, network outages
Application Failures: Service crashes, deployment failures, configuration errors
Data Failures: Database corruption, data loss, backup failures
Regional Failures: Complete datacenter or region outages
Security Incidents: Breaches, ransomware, unauthorized access

Recovery Objectives

RTO (Recovery Time Objective)

Component	RTO	Description
Critical Services	< 1 hour	Backend API, Frontend, Authentication
Database	< 2 hours	PostgreSQL with failover to standby
Complete Platform	< 4 hours	Full platform recovery in DR region
Non-Critical Services	< 8 hours	Chat server, monitoring, etc.

RPO (Recovery Point Objective)

Data Type	RPO	Backup Frequency
Database	< 1 hour	Hourly incremental + Daily full
Application State	< 15 minutes	Real-time replication
User Uploads	< 1 hour	Hourly sync to S3
Configuration	< 5 minutes	Git-based version control

Disaster Scenarios

Scenario 1: Single Node Failure

Impact: Minimal - Pods rescheduled to other nodes
RTO: < 5 minutes
Procedure: Automatic pod rescheduling by Kubernetes

Recovery Steps:

Kubernetes automatically detects node failure
Pods are rescheduled to healthy nodes
Services continue with minimal interruption
Monitor for any pod startup issues

Scenario 2: Database Primary Failure

Impact: High - Application cannot write data
RTO: < 5 minutes (with standby)
Procedure: Failover to standby replica

Recovery Steps:

Detect primary database failure
Promote standby replica to primary
Update connection strings in ConfigMaps/Secrets
Restart application pods to pick up new connection
Verify data integrity
Set up new standby replica

Runbook: See runbooks/database-failover.md

Scenario 3: Application Deployment Failure

Impact: Medium - Service degradation or outage
RTO: < 5 minutes
Procedure: Automatic rollback

Recovery Steps:

Health checks detect failure
Automatic rollback to previous version
Verify service health
Investigate root cause
Fix and redeploy

Runbook: See runbooks/rollback-procedure.md

Scenario 4: Complete Cluster Failure

Impact: Critical - Complete platform outage
RTO: < 4 hours
Procedure: Failover to DR region

Recovery Steps:

Declare disaster
Activate DR region
Restore database from latest backup
Deploy applications
Update DNS to point to DR region
Verify all services
Begin recovery of primary region

Runbook: See runbooks/cluster-failover.md

Scenario 5: Data Corruption or Loss

Impact: Critical - Data integrity compromised
RTO: < 2 hours
Procedure: Restore from backup

Recovery Steps:

Identify affected data
Stop writes to affected database
Restore from most recent clean backup
Verify data integrity
Resume operations
Investigate root cause

Runbook: See runbooks/data-restore.md

Scenario 6: Security Breach

Impact: Critical - Security and data integrity
RTO: < 1 hour (containment)
Procedure: Incident response and isolation

Recovery Steps:

Isolate affected systems
Preserve evidence
Assess scope of breach
Revoke compromised credentials
Restore from clean backup if needed
Patch vulnerabilities
Resume operations with enhanced monitoring

Runbook: See runbooks/security-incident.md

Recovery Procedures

Pre-Recovery Checklist

Before initiating any recovery procedure:

Verify backup availability and integrity
Confirm DR resources are available
Notify stakeholders
Document incident details
Prepare recovery runbook
Verify access credentials

Database Recovery

Full Database Restore

# 1. Stop application to prevent writes
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production

# 2. Identify backup to restore
kubectl get pvc postgres-backup-storage -n veza-production
# List available backups
kubectl exec -it postgres-pod -n veza-production -- ls -lh /backups/postgres/

# 3. Restore database
kubectl run postgres-restore --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --env="PGPASSWORD=$(kubectl get secret veza-secrets -n veza-production -o jsonpath='{.data.database-url}' | base64 -d | grep -oP 'password=\K[^&]+')" \
  --env="POSTGRES_HOST=postgres-service" \
  --env="POSTGRES_USER=veza_user" \
  --env="POSTGRES_DB=veza_db" \
  --overrides='
{
  "spec": {
    "containers": [{
      "name": "postgres-restore",
      "image": "postgres:15-alpine",
      "command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump --clean --if-exists"],
      "volumeMounts": [{
        "name": "backup-storage",
        "mountPath": "/backups"
      }]
    }],
    "volumes": [{
      "name": "backup-storage",
      "persistentVolumeClaim": {
        "claimName": "postgres-backup-storage"
      }
    }]
  }
}' \
  -n veza-production

# 4. Verify data integrity
kubectl exec -it postgres-pod -n veza-production -- psql -U veza_user -d veza_db -c "SELECT COUNT(*) FROM users;"

# 5. Restart application
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production

Point-in-Time Recovery

# Restore to specific timestamp using WAL archives
pg_restore -h postgres-service -U veza_user -d veza_db \
  --recovery-target-time="2025-01-01 12:00:00" \
  /backups/postgres/base_backup.dump

Application Recovery

Rollback Deployment

# Rollback to previous version
kubectl rollout undo deployment/veza-backend-api -n veza-production

# Verify rollback
kubectl rollout status deployment/veza-backend-api -n veza-production

# Check logs
kubectl logs -f deployment/veza-backend-api -n veza-production

Redeploy from Git

# Get latest code
git checkout main
git pull origin main

# Build and push image
docker build -t veza-backend-api:latest .
docker push veza-backend-api:latest

# Update deployment
kubectl set image deployment/veza-backend-api \
  veza-backend-api=veza-backend-api:latest \
  -n veza-production

# Verify
kubectl rollout status deployment/veza-backend-api -n veza-production

Infrastructure Recovery

Cluster Rebuild

# 1. Provision new cluster (using Terraform/Infrastructure as Code)
cd infrastructure/terraform
terraform apply

# 2. Install prerequisites
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/secrets/  # Restore secrets from Vault

# 3. Deploy applications
kubectl apply -f k8s/backend-api/
kubectl apply -f k8s/frontend/
kubectl apply -f k8s/chat-server/

# 4. Restore data
# Follow database recovery procedure

# 5. Verify services
kubectl get pods -n veza-production
kubectl get svc -n veza-production

Testing Procedures

Test Schedule

Test Type	Frequency	Duration
Database Restore	Monthly	2 hours
Application Rollback	After each deployment	15 minutes
Full DR Drill	Quarterly	4 hours
Security Incident Drill	Semi-annually	2 hours

Database Restore Test

Objective: Verify database backups are valid and can be restored

Procedure:

Create test namespace: kubectl create namespace veza-dr-test
Deploy test PostgreSQL instance
Restore latest backup to test instance
Verify data integrity
Run smoke tests against restored database
Document results
Clean up test namespace

Success Criteria:

Backup restores successfully
All tables present and accessible
Data integrity verified
Restore completes within RTO

Application Rollback Test

Objective: Verify rollback procedure works correctly

Procedure:

Deploy test version with known issue
Verify health checks fail
Execute rollback procedure
Verify service returns to healthy state
Document rollback time

Success Criteria:

Rollback completes within 5 minutes
Service returns to healthy state
No data loss
Users experience minimal disruption

Full DR Drill

Objective: Test complete disaster recovery in DR region

Procedure:

Schedule maintenance window
Simulate primary region failure
Activate DR region
Restore all services
Verify platform functionality
Document lessons learned
Restore primary region

Success Criteria:

DR activation within 4 hours
All critical services operational
Data integrity maintained
Users can access platform

Communication Plan

Incident Notification

Immediate (P0 - Critical):

DevOps Lead: +1-XXX-XXX-XXXX
CTO: +1-XXX-XXX-XXXX
On-Call Engineer: PagerDuty

Escalation (P1 - High):

Engineering Manager
Product Manager

Status Updates:

Internal: Slack #incidents channel
External: Status page (status.veza.com)
Customers: Email notification (if > 1 hour downtime)

Communication Templates

Initial Incident Notification

Subject: [INCIDENT] Veza Platform - <Issue Description>

Severity: P0/P1/P2
Status: Investigating
Impact: <Description>
ETA: <Time>

Team is actively working on resolution.
Updates will be posted every 15 minutes.

Resolution Notification

Subject: [RESOLVED] Veza Platform - <Issue Description>

Status: Resolved
Duration: <Time>
Root Cause: <Description>
Prevention: <Actions taken>

Platform is fully operational.

Runbooks

Detailed runbooks are available in the runbooks/ directory:

database-failover.md - Database failover procedure
rollback-procedure.md - Application rollback steps
cluster-failover.md - Complete cluster failover
data-restore.md - Data restoration procedures
security-incident.md - Security incident response

Maintenance and Updates

Regular Reviews

Monthly: Review and update recovery procedures
Quarterly: Full DR drill and documentation update
Annually: Complete disaster recovery plan review

Backup Verification

Daily: Automated backup verification
Weekly: Manual backup integrity check
Monthly: Full restore test

Documentation Updates

All changes to infrastructure, applications, or procedures must be reflected in this disaster recovery plan within 7 days.

References

Contact Information

Disaster Recovery Team:

DevOps Lead: devops@veza.com
SRE Team: sre@veza.com
On-Call: PagerDuty rotation

Last Updated: 2025-12-25
Next Review: 2026-03-25
Owner: DevOps Team

README.md

Disaster Recovery Plan for Veza Platform

Executive Summary

Table of Contents

Overview

Recovery Objectives

RTO (Recovery Time Objective)

RPO (Recovery Point Objective)

Disaster Scenarios

Scenario 1: Single Node Failure

Scenario 2: Database Primary Failure

Scenario 3: Application Deployment Failure

Scenario 4: Complete Cluster Failure

Scenario 5: Data Corruption or Loss

Scenario 6: Security Breach

Recovery Procedures

Pre-Recovery Checklist

Database Recovery

Full Database Restore

Point-in-Time Recovery

Application Recovery

Rollback Deployment

Redeploy from Git

Infrastructure Recovery

Cluster Rebuild

Testing Procedures

Test Schedule

Database Restore Test

Application Rollback Test

Full DR Drill

Communication Plan

Incident Notification

Communication Templates

Initial Incident Notification

Resolution Notification

Runbooks

Maintenance and Updates

Regular Reviews

Backup Verification

Documentation Updates

References

Contact Information