veza/k8s/disaster-recovery/README.md

# Disaster Recovery Plan for Veza Platform

This document outlines the comprehensive disaster recovery plan for the Veza platform, including recovery procedures, testing protocols, and operational runbooks.

## Executive Summary

**Recovery Time Objective (RTO)**: < 4 hours
**Recovery Point Objective (RPO)**: < 1 hour
**Maximum Acceptable Downtime**: 4 hours
**Data Loss Tolerance**: 1 hour

## Table of Contents

1. [Overview](#overview)
2. [Recovery Objectives](#recovery-objectives)
3. [Disaster Scenarios](#disaster-scenarios)
4. [Recovery Procedures](#recovery-procedures)
5. [Testing Procedures](#testing-procedures)
6. [Communication Plan](#communication-plan)
7. [Runbooks](#runbooks)
8. [Maintenance and Updates](#maintenance-and-updates)

## Overview

The Veza platform disaster recovery plan ensures business continuity and data protection in the event of various failure scenarios. The plan covers:

- **Infrastructure Failures**: Node failures, cluster failures, network outages
- **Application Failures**: Service crashes, deployment failures, configuration errors
- **Data Failures**: Database corruption, data loss, backup failures
- **Regional Failures**: Complete datacenter or region outages
- **Security Incidents**: Breaches, ransomware, unauthorized access

## Recovery Objectives

### RTO (Recovery Time Objective)

| Component | RTO | Description |
|-----------|-----|-------------|
| Critical Services | < 1 hour | Backend API, Frontend, Authentication |
| Database | < 2 hours | PostgreSQL with failover to standby |
| Complete Platform | < 4 hours | Full platform recovery in DR region |
| Non-Critical Services | < 8 hours | Chat server, monitoring, etc. |

### RPO (Recovery Point Objective)

| Data Type | RPO | Backup Frequency |
|-----------|-----|------------------|
| Database | < 1 hour | Hourly incremental + Daily full |
| Application State | < 15 minutes | Real-time replication |
| User Uploads | < 1 hour | Hourly sync to S3 |
| Configuration | < 5 minutes | Git-based version control |

## Disaster Scenarios

### Scenario 1: Single Node Failure

**Impact**: Minimal - Pods rescheduled to other nodes
**RTO**: < 5 minutes
**Procedure**: Automatic pod rescheduling by Kubernetes

**Recovery Steps**:
1. Kubernetes automatically detects node failure
2. Pods are rescheduled to healthy nodes
3. Services continue with minimal interruption
4. Monitor for any pod startup issues

### Scenario 2: Database Primary Failure

**Impact**: High - Application cannot write data
**RTO**: < 5 minutes (with standby)
**Procedure**: Failover to standby replica

**Recovery Steps**:
1. Detect primary database failure
2. Promote standby replica to primary
3. Update connection strings in ConfigMaps/Secrets
4. Restart application pods to pick up new connection
5. Verify data integrity
6. Set up new standby replica

**Runbook**: See `runbooks/database-failover.md`

### Scenario 3: Application Deployment Failure

**Impact**: Medium - Service degradation or outage
**RTO**: < 5 minutes
**Procedure**: Automatic rollback

**Recovery Steps**:
1. Health checks detect failure
2. Automatic rollback to previous version
3. Verify service health
4. Investigate root cause
5. Fix and redeploy

**Runbook**: See `runbooks/rollback-procedure.md`

### Scenario 4: Complete Cluster Failure

**Impact**: Critical - Complete platform outage
**RTO**: < 4 hours
**Procedure**: Failover to DR region

**Recovery Steps**:
1. Declare disaster
2. Activate DR region
3. Restore database from latest backup
4. Deploy applications
5. Update DNS to point to DR region
6. Verify all services
7. Begin recovery of primary region

**Runbook**: See `runbooks/cluster-failover.md`

### Scenario 5: Data Corruption or Loss

**Impact**: Critical - Data integrity compromised
**RTO**: < 2 hours
**Procedure**: Restore from backup

**Recovery Steps**:
1. Identify affected data
2. Stop writes to affected database
3. Restore from most recent clean backup
4. Verify data integrity
5. Resume operations
6. Investigate root cause

**Runbook**: See `runbooks/data-restore.md`

### Scenario 6: Security Breach

**Impact**: Critical - Security and data integrity
**RTO**: < 1 hour (containment)
**Procedure**: Incident response and isolation

**Recovery Steps**:
1. Isolate affected systems
2. Preserve evidence
3. Assess scope of breach
4. Revoke compromised credentials
5. Restore from clean backup if needed
6. Patch vulnerabilities
7. Resume operations with enhanced monitoring

**Runbook**: See `runbooks/security-incident.md`

## Recovery Procedures

### Pre-Recovery Checklist

Before initiating any recovery procedure:

- [ ] Verify backup availability and integrity
- [ ] Confirm DR resources are available
- [ ] Notify stakeholders
- [ ] Document incident details
- [ ] Prepare recovery runbook
- [ ] Verify access credentials

### Database Recovery

#### Full Database Restore

```bash
# 1. Stop application to prevent writes
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production

# 2. Identify backup to restore
kubectl get pvc postgres-backup-storage -n veza-production
# List available backups
kubectl exec -it postgres-pod -n veza-production -- ls -lh /backups/postgres/

# 3. Restore database
kubectl run postgres-restore --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --env="PGPASSWORD=$(kubectl get secret veza-secrets -n veza-production -o jsonpath='{.data.database-url}' | base64 -d | grep -oP 'password=\K[^&]+')" \
  --env="POSTGRES_HOST=postgres-service" \
  --env="POSTGRES_USER=veza_user" \
  --env="POSTGRES_DB=veza_db" \
  --overrides='
{
  "spec": {
    "containers": [{
      "name": "postgres-restore",
      "image": "postgres:15-alpine",
      "command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump --clean --if-exists"],
      "volumeMounts": [{
        "name": "backup-storage",
        "mountPath": "/backups"
      }]
    }],
    "volumes": [{
      "name": "backup-storage",
      "persistentVolumeClaim": {
        "claimName": "postgres-backup-storage"
      }
    }]
  }
}' \
  -n veza-production

# 4. Verify data integrity
kubectl exec -it postgres-pod -n veza-production -- psql -U veza_user -d veza_db -c "SELECT COUNT(*) FROM users;"

# 5. Restart application
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production
```

#### Point-in-Time Recovery

```bash
# Restore to specific timestamp using WAL archives
pg_restore -h postgres-service -U veza_user -d veza_db \
  --recovery-target-time="2025-01-01 12:00:00" \
  /backups/postgres/base_backup.dump
```

### Application Recovery

#### Rollback Deployment

```bash
# Rollback to previous version
kubectl rollout undo deployment/veza-backend-api -n veza-production

# Verify rollback
kubectl rollout status deployment/veza-backend-api -n veza-production

# Check logs
kubectl logs -f deployment/veza-backend-api -n veza-production
```

#### Redeploy from Git

```bash
# Get latest code
git checkout main
git pull origin main

# Build and push image
docker build -t veza-backend-api:latest .
docker push veza-backend-api:latest

# Update deployment
kubectl set image deployment/veza-backend-api \
  veza-backend-api=veza-backend-api:latest \
  -n veza-production

# Verify
kubectl rollout status deployment/veza-backend-api -n veza-production
```

### Infrastructure Recovery

#### Cluster Rebuild

```bash
# 1. Provision new cluster (using Terraform/Infrastructure as Code)
cd infrastructure/terraform
terraform apply

# 2. Install prerequisites
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/secrets/  # Restore secrets from Vault

# 3. Deploy applications
kubectl apply -f k8s/backend-api/
kubectl apply -f k8s/frontend/

# 4. Restore data
# Follow database recovery procedure

# 5. Verify services
kubectl get pods -n veza-production
kubectl get svc -n veza-production
```

## Testing Procedures

### Test Schedule

| Test Type | Frequency | Duration |
|-----------|-----------|----------|
| Database Restore | Monthly | 2 hours |
| Application Rollback | After each deployment | 15 minutes |
| Full DR Drill | Quarterly | 4 hours |
| Security Incident Drill | Semi-annually | 2 hours |

### Database Restore Test

**Objective**: Verify database backups are valid and can be restored

**Procedure**:
1. Create test namespace: `kubectl create namespace veza-dr-test`
2. Deploy test PostgreSQL instance
3. Restore latest backup to test instance
4. Verify data integrity
5. Run smoke tests against restored database
6. Document results
7. Clean up test namespace

**Success Criteria**:
- Backup restores successfully
- All tables present and accessible
- Data integrity verified
- Restore completes within RTO

### Application Rollback Test

**Objective**: Verify rollback procedure works correctly

**Procedure**:
1. Deploy test version with known issue
2. Verify health checks fail
3. Execute rollback procedure
4. Verify service returns to healthy state
5. Document rollback time

**Success Criteria**:
- Rollback completes within 5 minutes
- Service returns to healthy state
- No data loss
- Users experience minimal disruption

### Full DR Drill

**Objective**: Test complete disaster recovery in DR region

**Procedure**:
1. Schedule maintenance window
2. Simulate primary region failure
3. Activate DR region
4. Restore all services
5. Verify platform functionality
6. Document lessons learned
7. Restore primary region

**Success Criteria**:
- DR activation within 4 hours
- All critical services operational
- Data integrity maintained
- Users can access platform

## Communication Plan

### Incident Notification

**Immediate (P0 - Critical)**:
- DevOps Lead: +1-XXX-XXX-XXXX
- CTO: +1-XXX-XXX-XXXX
- On-Call Engineer: PagerDuty

**Escalation (P1 - High)**:
- Engineering Manager
- Product Manager

**Status Updates**:
- Internal: Slack #incidents channel
- External: Status page (status.veza.com)
- Customers: Email notification (if > 1 hour downtime)

### Communication Templates

#### Initial Incident Notification

```
Subject: [INCIDENT] Veza Platform - <Issue Description>

Severity: P0/P1/P2
Status: Investigating
Impact: <Description>
ETA: <Time>

Team is actively working on resolution.
Updates will be posted every 15 minutes.
```

#### Resolution Notification

```
Subject: [RESOLVED] Veza Platform - <Issue Description>

Status: Resolved
Duration: <Time>
Root Cause: <Description>
Prevention: <Actions taken>

Platform is fully operational.
```

## Runbooks

Detailed runbooks are available in the `runbooks/` directory:

- `database-failover.md` - Database failover procedure
- `rollback-procedure.md` - Application rollback steps
- `cluster-failover.md` - Complete cluster failover
- `data-restore.md` - Data restoration procedures
- `security-incident.md` - Security incident response

## Maintenance and Updates

### Regular Reviews

- **Monthly**: Review and update recovery procedures
- **Quarterly**: Full DR drill and documentation update
- **Annually**: Complete disaster recovery plan review

### Backup Verification

- **Daily**: Automated backup verification
- **Weekly**: Manual backup integrity check
- **Monthly**: Full restore test

### Documentation Updates

All changes to infrastructure, applications, or procedures must be reflected in this disaster recovery plan within 7 days.

## References

- [Backup Strategy](../backups/README.md)
- [Secrets Management](../secrets/README.md)
- [Monitoring Setup](../monitoring/README.md)
- [Kubernetes Deployment](../README.md)

## Contact Information

**Disaster Recovery Team**:
- DevOps Lead: devops@veza.com
- SRE Team: sre@veza.com
- On-Call: PagerDuty rotation

**Last Updated**: 2025-12-25
**Next Review**: 2026-03-25
**Owner**: DevOps Team