veza/k8s/disaster-recovery/README.md
senke f9120c322b
Some checks failed
Backend API CI / test-unit (push) Failing after 0s
Backend API CI / test-integration (push) Failing after 0s
Frontend CI / test (push) Failing after 0s
Storybook Audit / Build & audit Storybook (push) Failing after 0s
Stream Server CI / test (push) Failing after 0s
release(v0.903): Vault - ORDER BY whitelist, rate limiter, VERSION sync, chat-server cleanup, Go 1.24
- ORDER BY dynamiques : whitelist explicite, fallback created_at DESC
- Login/register soumis au rate limiter global
- VERSION sync + check CI
- Nettoyage références veza-chat-server
- Go 1.24 partout (Dockerfile, workflows)
- TODO/FIXME/HACK convertis en issues ou résolus
2026-02-27 09:43:25 +01:00

12 KiB

Disaster Recovery Plan for Veza Platform

This document outlines the comprehensive disaster recovery plan for the Veza platform, including recovery procedures, testing protocols, and operational runbooks.

Executive Summary

Recovery Time Objective (RTO): < 4 hours
Recovery Point Objective (RPO): < 1 hour
Maximum Acceptable Downtime: 4 hours
Data Loss Tolerance: 1 hour

Table of Contents

  1. Overview
  2. Recovery Objectives
  3. Disaster Scenarios
  4. Recovery Procedures
  5. Testing Procedures
  6. Communication Plan
  7. Runbooks
  8. Maintenance and Updates

Overview

The Veza platform disaster recovery plan ensures business continuity and data protection in the event of various failure scenarios. The plan covers:

  • Infrastructure Failures: Node failures, cluster failures, network outages
  • Application Failures: Service crashes, deployment failures, configuration errors
  • Data Failures: Database corruption, data loss, backup failures
  • Regional Failures: Complete datacenter or region outages
  • Security Incidents: Breaches, ransomware, unauthorized access

Recovery Objectives

RTO (Recovery Time Objective)

Component RTO Description
Critical Services < 1 hour Backend API, Frontend, Authentication
Database < 2 hours PostgreSQL with failover to standby
Complete Platform < 4 hours Full platform recovery in DR region
Non-Critical Services < 8 hours Chat server, monitoring, etc.

RPO (Recovery Point Objective)

Data Type RPO Backup Frequency
Database < 1 hour Hourly incremental + Daily full
Application State < 15 minutes Real-time replication
User Uploads < 1 hour Hourly sync to S3
Configuration < 5 minutes Git-based version control

Disaster Scenarios

Scenario 1: Single Node Failure

Impact: Minimal - Pods rescheduled to other nodes
RTO: < 5 minutes
Procedure: Automatic pod rescheduling by Kubernetes

Recovery Steps:

  1. Kubernetes automatically detects node failure
  2. Pods are rescheduled to healthy nodes
  3. Services continue with minimal interruption
  4. Monitor for any pod startup issues

Scenario 2: Database Primary Failure

Impact: High - Application cannot write data
RTO: < 5 minutes (with standby)
Procedure: Failover to standby replica

Recovery Steps:

  1. Detect primary database failure
  2. Promote standby replica to primary
  3. Update connection strings in ConfigMaps/Secrets
  4. Restart application pods to pick up new connection
  5. Verify data integrity
  6. Set up new standby replica

Runbook: See runbooks/database-failover.md

Scenario 3: Application Deployment Failure

Impact: Medium - Service degradation or outage
RTO: < 5 minutes
Procedure: Automatic rollback

Recovery Steps:

  1. Health checks detect failure
  2. Automatic rollback to previous version
  3. Verify service health
  4. Investigate root cause
  5. Fix and redeploy

Runbook: See runbooks/rollback-procedure.md

Scenario 4: Complete Cluster Failure

Impact: Critical - Complete platform outage
RTO: < 4 hours
Procedure: Failover to DR region

Recovery Steps:

  1. Declare disaster
  2. Activate DR region
  3. Restore database from latest backup
  4. Deploy applications
  5. Update DNS to point to DR region
  6. Verify all services
  7. Begin recovery of primary region

Runbook: See runbooks/cluster-failover.md

Scenario 5: Data Corruption or Loss

Impact: Critical - Data integrity compromised
RTO: < 2 hours
Procedure: Restore from backup

Recovery Steps:

  1. Identify affected data
  2. Stop writes to affected database
  3. Restore from most recent clean backup
  4. Verify data integrity
  5. Resume operations
  6. Investigate root cause

Runbook: See runbooks/data-restore.md

Scenario 6: Security Breach

Impact: Critical - Security and data integrity
RTO: < 1 hour (containment)
Procedure: Incident response and isolation

Recovery Steps:

  1. Isolate affected systems
  2. Preserve evidence
  3. Assess scope of breach
  4. Revoke compromised credentials
  5. Restore from clean backup if needed
  6. Patch vulnerabilities
  7. Resume operations with enhanced monitoring

Runbook: See runbooks/security-incident.md

Recovery Procedures

Pre-Recovery Checklist

Before initiating any recovery procedure:

  • Verify backup availability and integrity
  • Confirm DR resources are available
  • Notify stakeholders
  • Document incident details
  • Prepare recovery runbook
  • Verify access credentials

Database Recovery

Full Database Restore

# 1. Stop application to prevent writes
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production

# 2. Identify backup to restore
kubectl get pvc postgres-backup-storage -n veza-production
# List available backups
kubectl exec -it postgres-pod -n veza-production -- ls -lh /backups/postgres/

# 3. Restore database
kubectl run postgres-restore --rm -it --image=postgres:15-alpine \
  --restart=Never \
  --env="PGPASSWORD=$(kubectl get secret veza-secrets -n veza-production -o jsonpath='{.data.database-url}' | base64 -d | grep -oP 'password=\K[^&]+')" \
  --env="POSTGRES_HOST=postgres-service" \
  --env="POSTGRES_USER=veza_user" \
  --env="POSTGRES_DB=veza_db" \
  --overrides='
{
  "spec": {
    "containers": [{
      "name": "postgres-restore",
      "image": "postgres:15-alpine",
      "command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump --clean --if-exists"],
      "volumeMounts": [{
        "name": "backup-storage",
        "mountPath": "/backups"
      }]
    }],
    "volumes": [{
      "name": "backup-storage",
      "persistentVolumeClaim": {
        "claimName": "postgres-backup-storage"
      }
    }]
  }
}' \
  -n veza-production

# 4. Verify data integrity
kubectl exec -it postgres-pod -n veza-production -- psql -U veza_user -d veza_db -c "SELECT COUNT(*) FROM users;"

# 5. Restart application
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production

Point-in-Time Recovery

# Restore to specific timestamp using WAL archives
pg_restore -h postgres-service -U veza_user -d veza_db \
  --recovery-target-time="2025-01-01 12:00:00" \
  /backups/postgres/base_backup.dump

Application Recovery

Rollback Deployment

# Rollback to previous version
kubectl rollout undo deployment/veza-backend-api -n veza-production

# Verify rollback
kubectl rollout status deployment/veza-backend-api -n veza-production

# Check logs
kubectl logs -f deployment/veza-backend-api -n veza-production

Redeploy from Git

# Get latest code
git checkout main
git pull origin main

# Build and push image
docker build -t veza-backend-api:latest .
docker push veza-backend-api:latest

# Update deployment
kubectl set image deployment/veza-backend-api \
  veza-backend-api=veza-backend-api:latest \
  -n veza-production

# Verify
kubectl rollout status deployment/veza-backend-api -n veza-production

Infrastructure Recovery

Cluster Rebuild

# 1. Provision new cluster (using Terraform/Infrastructure as Code)
cd infrastructure/terraform
terraform apply

# 2. Install prerequisites
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/secrets/  # Restore secrets from Vault

# 3. Deploy applications
kubectl apply -f k8s/backend-api/
kubectl apply -f k8s/frontend/

# 4. Restore data
# Follow database recovery procedure

# 5. Verify services
kubectl get pods -n veza-production
kubectl get svc -n veza-production

Testing Procedures

Test Schedule

Test Type Frequency Duration
Database Restore Monthly 2 hours
Application Rollback After each deployment 15 minutes
Full DR Drill Quarterly 4 hours
Security Incident Drill Semi-annually 2 hours

Database Restore Test

Objective: Verify database backups are valid and can be restored

Procedure:

  1. Create test namespace: kubectl create namespace veza-dr-test
  2. Deploy test PostgreSQL instance
  3. Restore latest backup to test instance
  4. Verify data integrity
  5. Run smoke tests against restored database
  6. Document results
  7. Clean up test namespace

Success Criteria:

  • Backup restores successfully
  • All tables present and accessible
  • Data integrity verified
  • Restore completes within RTO

Application Rollback Test

Objective: Verify rollback procedure works correctly

Procedure:

  1. Deploy test version with known issue
  2. Verify health checks fail
  3. Execute rollback procedure
  4. Verify service returns to healthy state
  5. Document rollback time

Success Criteria:

  • Rollback completes within 5 minutes
  • Service returns to healthy state
  • No data loss
  • Users experience minimal disruption

Full DR Drill

Objective: Test complete disaster recovery in DR region

Procedure:

  1. Schedule maintenance window
  2. Simulate primary region failure
  3. Activate DR region
  4. Restore all services
  5. Verify platform functionality
  6. Document lessons learned
  7. Restore primary region

Success Criteria:

  • DR activation within 4 hours
  • All critical services operational
  • Data integrity maintained
  • Users can access platform

Communication Plan

Incident Notification

Immediate (P0 - Critical):

  • DevOps Lead: +1-XXX-XXX-XXXX
  • CTO: +1-XXX-XXX-XXXX
  • On-Call Engineer: PagerDuty

Escalation (P1 - High):

  • Engineering Manager
  • Product Manager

Status Updates:

  • Internal: Slack #incidents channel
  • External: Status page (status.veza.com)
  • Customers: Email notification (if > 1 hour downtime)

Communication Templates

Initial Incident Notification

Subject: [INCIDENT] Veza Platform - <Issue Description>

Severity: P0/P1/P2
Status: Investigating
Impact: <Description>
ETA: <Time>

Team is actively working on resolution.
Updates will be posted every 15 minutes.

Resolution Notification

Subject: [RESOLVED] Veza Platform - <Issue Description>

Status: Resolved
Duration: <Time>
Root Cause: <Description>
Prevention: <Actions taken>

Platform is fully operational.

Runbooks

Detailed runbooks are available in the runbooks/ directory:

  • database-failover.md - Database failover procedure
  • rollback-procedure.md - Application rollback steps
  • cluster-failover.md - Complete cluster failover
  • data-restore.md - Data restoration procedures
  • security-incident.md - Security incident response

Maintenance and Updates

Regular Reviews

  • Monthly: Review and update recovery procedures
  • Quarterly: Full DR drill and documentation update
  • Annually: Complete disaster recovery plan review

Backup Verification

  • Daily: Automated backup verification
  • Weekly: Manual backup integrity check
  • Monthly: Full restore test

Documentation Updates

All changes to infrastructure, applications, or procedures must be reflected in this disaster recovery plan within 7 days.

References

Contact Information

Disaster Recovery Team:

Last Updated: 2025-12-25
Next Review: 2026-03-25
Owner: DevOps Team