Some checks failed
Backend API CI / test-unit (push) Failing after 0s
Backend API CI / test-integration (push) Failing after 0s
Frontend CI / test (push) Failing after 0s
Storybook Audit / Build & audit Storybook (push) Failing after 0s
Stream Server CI / test (push) Failing after 0s
- ORDER BY dynamiques : whitelist explicite, fallback created_at DESC - Login/register soumis au rate limiter global - VERSION sync + check CI - Nettoyage références veza-chat-server - Go 1.24 partout (Dockerfile, workflows) - TODO/FIXME/HACK convertis en issues ou résolus
437 lines
12 KiB
Markdown
437 lines
12 KiB
Markdown
# Disaster Recovery Plan for Veza Platform
|
|
|
|
This document outlines the comprehensive disaster recovery plan for the Veza platform, including recovery procedures, testing protocols, and operational runbooks.
|
|
|
|
## Executive Summary
|
|
|
|
**Recovery Time Objective (RTO)**: < 4 hours
|
|
**Recovery Point Objective (RPO)**: < 1 hour
|
|
**Maximum Acceptable Downtime**: 4 hours
|
|
**Data Loss Tolerance**: 1 hour
|
|
|
|
## Table of Contents
|
|
|
|
1. [Overview](#overview)
|
|
2. [Recovery Objectives](#recovery-objectives)
|
|
3. [Disaster Scenarios](#disaster-scenarios)
|
|
4. [Recovery Procedures](#recovery-procedures)
|
|
5. [Testing Procedures](#testing-procedures)
|
|
6. [Communication Plan](#communication-plan)
|
|
7. [Runbooks](#runbooks)
|
|
8. [Maintenance and Updates](#maintenance-and-updates)
|
|
|
|
## Overview
|
|
|
|
The Veza platform disaster recovery plan ensures business continuity and data protection in the event of various failure scenarios. The plan covers:
|
|
|
|
- **Infrastructure Failures**: Node failures, cluster failures, network outages
|
|
- **Application Failures**: Service crashes, deployment failures, configuration errors
|
|
- **Data Failures**: Database corruption, data loss, backup failures
|
|
- **Regional Failures**: Complete datacenter or region outages
|
|
- **Security Incidents**: Breaches, ransomware, unauthorized access
|
|
|
|
## Recovery Objectives
|
|
|
|
### RTO (Recovery Time Objective)
|
|
|
|
| Component | RTO | Description |
|
|
|-----------|-----|-------------|
|
|
| Critical Services | < 1 hour | Backend API, Frontend, Authentication |
|
|
| Database | < 2 hours | PostgreSQL with failover to standby |
|
|
| Complete Platform | < 4 hours | Full platform recovery in DR region |
|
|
| Non-Critical Services | < 8 hours | Chat server, monitoring, etc. |
|
|
|
|
### RPO (Recovery Point Objective)
|
|
|
|
| Data Type | RPO | Backup Frequency |
|
|
|-----------|-----|------------------|
|
|
| Database | < 1 hour | Hourly incremental + Daily full |
|
|
| Application State | < 15 minutes | Real-time replication |
|
|
| User Uploads | < 1 hour | Hourly sync to S3 |
|
|
| Configuration | < 5 minutes | Git-based version control |
|
|
|
|
## Disaster Scenarios
|
|
|
|
### Scenario 1: Single Node Failure
|
|
|
|
**Impact**: Minimal - Pods rescheduled to other nodes
|
|
**RTO**: < 5 minutes
|
|
**Procedure**: Automatic pod rescheduling by Kubernetes
|
|
|
|
**Recovery Steps**:
|
|
1. Kubernetes automatically detects node failure
|
|
2. Pods are rescheduled to healthy nodes
|
|
3. Services continue with minimal interruption
|
|
4. Monitor for any pod startup issues
|
|
|
|
### Scenario 2: Database Primary Failure
|
|
|
|
**Impact**: High - Application cannot write data
|
|
**RTO**: < 5 minutes (with standby)
|
|
**Procedure**: Failover to standby replica
|
|
|
|
**Recovery Steps**:
|
|
1. Detect primary database failure
|
|
2. Promote standby replica to primary
|
|
3. Update connection strings in ConfigMaps/Secrets
|
|
4. Restart application pods to pick up new connection
|
|
5. Verify data integrity
|
|
6. Set up new standby replica
|
|
|
|
**Runbook**: See `runbooks/database-failover.md`
|
|
|
|
### Scenario 3: Application Deployment Failure
|
|
|
|
**Impact**: Medium - Service degradation or outage
|
|
**RTO**: < 5 minutes
|
|
**Procedure**: Automatic rollback
|
|
|
|
**Recovery Steps**:
|
|
1. Health checks detect failure
|
|
2. Automatic rollback to previous version
|
|
3. Verify service health
|
|
4. Investigate root cause
|
|
5. Fix and redeploy
|
|
|
|
**Runbook**: See `runbooks/rollback-procedure.md`
|
|
|
|
### Scenario 4: Complete Cluster Failure
|
|
|
|
**Impact**: Critical - Complete platform outage
|
|
**RTO**: < 4 hours
|
|
**Procedure**: Failover to DR region
|
|
|
|
**Recovery Steps**:
|
|
1. Declare disaster
|
|
2. Activate DR region
|
|
3. Restore database from latest backup
|
|
4. Deploy applications
|
|
5. Update DNS to point to DR region
|
|
6. Verify all services
|
|
7. Begin recovery of primary region
|
|
|
|
**Runbook**: See `runbooks/cluster-failover.md`
|
|
|
|
### Scenario 5: Data Corruption or Loss
|
|
|
|
**Impact**: Critical - Data integrity compromised
|
|
**RTO**: < 2 hours
|
|
**Procedure**: Restore from backup
|
|
|
|
**Recovery Steps**:
|
|
1. Identify affected data
|
|
2. Stop writes to affected database
|
|
3. Restore from most recent clean backup
|
|
4. Verify data integrity
|
|
5. Resume operations
|
|
6. Investigate root cause
|
|
|
|
**Runbook**: See `runbooks/data-restore.md`
|
|
|
|
### Scenario 6: Security Breach
|
|
|
|
**Impact**: Critical - Security and data integrity
|
|
**RTO**: < 1 hour (containment)
|
|
**Procedure**: Incident response and isolation
|
|
|
|
**Recovery Steps**:
|
|
1. Isolate affected systems
|
|
2. Preserve evidence
|
|
3. Assess scope of breach
|
|
4. Revoke compromised credentials
|
|
5. Restore from clean backup if needed
|
|
6. Patch vulnerabilities
|
|
7. Resume operations with enhanced monitoring
|
|
|
|
**Runbook**: See `runbooks/security-incident.md`
|
|
|
|
## Recovery Procedures
|
|
|
|
### Pre-Recovery Checklist
|
|
|
|
Before initiating any recovery procedure:
|
|
|
|
- [ ] Verify backup availability and integrity
|
|
- [ ] Confirm DR resources are available
|
|
- [ ] Notify stakeholders
|
|
- [ ] Document incident details
|
|
- [ ] Prepare recovery runbook
|
|
- [ ] Verify access credentials
|
|
|
|
### Database Recovery
|
|
|
|
#### Full Database Restore
|
|
|
|
```bash
|
|
# 1. Stop application to prevent writes
|
|
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production
|
|
|
|
# 2. Identify backup to restore
|
|
kubectl get pvc postgres-backup-storage -n veza-production
|
|
# List available backups
|
|
kubectl exec -it postgres-pod -n veza-production -- ls -lh /backups/postgres/
|
|
|
|
# 3. Restore database
|
|
kubectl run postgres-restore --rm -it --image=postgres:15-alpine \
|
|
--restart=Never \
|
|
--env="PGPASSWORD=$(kubectl get secret veza-secrets -n veza-production -o jsonpath='{.data.database-url}' | base64 -d | grep -oP 'password=\K[^&]+')" \
|
|
--env="POSTGRES_HOST=postgres-service" \
|
|
--env="POSTGRES_USER=veza_user" \
|
|
--env="POSTGRES_DB=veza_db" \
|
|
--overrides='
|
|
{
|
|
"spec": {
|
|
"containers": [{
|
|
"name": "postgres-restore",
|
|
"image": "postgres:15-alpine",
|
|
"command": ["/bin/sh", "-c", "pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -F c /backups/postgres/veza_db_YYYYMMDD_HHMMSS.dump --clean --if-exists"],
|
|
"volumeMounts": [{
|
|
"name": "backup-storage",
|
|
"mountPath": "/backups"
|
|
}]
|
|
}],
|
|
"volumes": [{
|
|
"name": "backup-storage",
|
|
"persistentVolumeClaim": {
|
|
"claimName": "postgres-backup-storage"
|
|
}
|
|
}]
|
|
}
|
|
}' \
|
|
-n veza-production
|
|
|
|
# 4. Verify data integrity
|
|
kubectl exec -it postgres-pod -n veza-production -- psql -U veza_user -d veza_db -c "SELECT COUNT(*) FROM users;"
|
|
|
|
# 5. Restart application
|
|
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production
|
|
```
|
|
|
|
#### Point-in-Time Recovery
|
|
|
|
```bash
|
|
# Restore to specific timestamp using WAL archives
|
|
pg_restore -h postgres-service -U veza_user -d veza_db \
|
|
--recovery-target-time="2025-01-01 12:00:00" \
|
|
/backups/postgres/base_backup.dump
|
|
```
|
|
|
|
### Application Recovery
|
|
|
|
#### Rollback Deployment
|
|
|
|
```bash
|
|
# Rollback to previous version
|
|
kubectl rollout undo deployment/veza-backend-api -n veza-production
|
|
|
|
# Verify rollback
|
|
kubectl rollout status deployment/veza-backend-api -n veza-production
|
|
|
|
# Check logs
|
|
kubectl logs -f deployment/veza-backend-api -n veza-production
|
|
```
|
|
|
|
#### Redeploy from Git
|
|
|
|
```bash
|
|
# Get latest code
|
|
git checkout main
|
|
git pull origin main
|
|
|
|
# Build and push image
|
|
docker build -t veza-backend-api:latest .
|
|
docker push veza-backend-api:latest
|
|
|
|
# Update deployment
|
|
kubectl set image deployment/veza-backend-api \
|
|
veza-backend-api=veza-backend-api:latest \
|
|
-n veza-production
|
|
|
|
# Verify
|
|
kubectl rollout status deployment/veza-backend-api -n veza-production
|
|
```
|
|
|
|
### Infrastructure Recovery
|
|
|
|
#### Cluster Rebuild
|
|
|
|
```bash
|
|
# 1. Provision new cluster (using Terraform/Infrastructure as Code)
|
|
cd infrastructure/terraform
|
|
terraform apply
|
|
|
|
# 2. Install prerequisites
|
|
kubectl apply -f k8s/namespace.yaml
|
|
kubectl apply -f k8s/secrets/ # Restore secrets from Vault
|
|
|
|
# 3. Deploy applications
|
|
kubectl apply -f k8s/backend-api/
|
|
kubectl apply -f k8s/frontend/
|
|
|
|
# 4. Restore data
|
|
# Follow database recovery procedure
|
|
|
|
# 5. Verify services
|
|
kubectl get pods -n veza-production
|
|
kubectl get svc -n veza-production
|
|
```
|
|
|
|
## Testing Procedures
|
|
|
|
### Test Schedule
|
|
|
|
| Test Type | Frequency | Duration |
|
|
|-----------|-----------|----------|
|
|
| Database Restore | Monthly | 2 hours |
|
|
| Application Rollback | After each deployment | 15 minutes |
|
|
| Full DR Drill | Quarterly | 4 hours |
|
|
| Security Incident Drill | Semi-annually | 2 hours |
|
|
|
|
### Database Restore Test
|
|
|
|
**Objective**: Verify database backups are valid and can be restored
|
|
|
|
**Procedure**:
|
|
1. Create test namespace: `kubectl create namespace veza-dr-test`
|
|
2. Deploy test PostgreSQL instance
|
|
3. Restore latest backup to test instance
|
|
4. Verify data integrity
|
|
5. Run smoke tests against restored database
|
|
6. Document results
|
|
7. Clean up test namespace
|
|
|
|
**Success Criteria**:
|
|
- Backup restores successfully
|
|
- All tables present and accessible
|
|
- Data integrity verified
|
|
- Restore completes within RTO
|
|
|
|
### Application Rollback Test
|
|
|
|
**Objective**: Verify rollback procedure works correctly
|
|
|
|
**Procedure**:
|
|
1. Deploy test version with known issue
|
|
2. Verify health checks fail
|
|
3. Execute rollback procedure
|
|
4. Verify service returns to healthy state
|
|
5. Document rollback time
|
|
|
|
**Success Criteria**:
|
|
- Rollback completes within 5 minutes
|
|
- Service returns to healthy state
|
|
- No data loss
|
|
- Users experience minimal disruption
|
|
|
|
### Full DR Drill
|
|
|
|
**Objective**: Test complete disaster recovery in DR region
|
|
|
|
**Procedure**:
|
|
1. Schedule maintenance window
|
|
2. Simulate primary region failure
|
|
3. Activate DR region
|
|
4. Restore all services
|
|
5. Verify platform functionality
|
|
6. Document lessons learned
|
|
7. Restore primary region
|
|
|
|
**Success Criteria**:
|
|
- DR activation within 4 hours
|
|
- All critical services operational
|
|
- Data integrity maintained
|
|
- Users can access platform
|
|
|
|
## Communication Plan
|
|
|
|
### Incident Notification
|
|
|
|
**Immediate (P0 - Critical)**:
|
|
- DevOps Lead: +1-XXX-XXX-XXXX
|
|
- CTO: +1-XXX-XXX-XXXX
|
|
- On-Call Engineer: PagerDuty
|
|
|
|
**Escalation (P1 - High)**:
|
|
- Engineering Manager
|
|
- Product Manager
|
|
|
|
**Status Updates**:
|
|
- Internal: Slack #incidents channel
|
|
- External: Status page (status.veza.com)
|
|
- Customers: Email notification (if > 1 hour downtime)
|
|
|
|
### Communication Templates
|
|
|
|
#### Initial Incident Notification
|
|
|
|
```
|
|
Subject: [INCIDENT] Veza Platform - <Issue Description>
|
|
|
|
Severity: P0/P1/P2
|
|
Status: Investigating
|
|
Impact: <Description>
|
|
ETA: <Time>
|
|
|
|
Team is actively working on resolution.
|
|
Updates will be posted every 15 minutes.
|
|
```
|
|
|
|
#### Resolution Notification
|
|
|
|
```
|
|
Subject: [RESOLVED] Veza Platform - <Issue Description>
|
|
|
|
Status: Resolved
|
|
Duration: <Time>
|
|
Root Cause: <Description>
|
|
Prevention: <Actions taken>
|
|
|
|
Platform is fully operational.
|
|
```
|
|
|
|
## Runbooks
|
|
|
|
Detailed runbooks are available in the `runbooks/` directory:
|
|
|
|
- `database-failover.md` - Database failover procedure
|
|
- `rollback-procedure.md` - Application rollback steps
|
|
- `cluster-failover.md` - Complete cluster failover
|
|
- `data-restore.md` - Data restoration procedures
|
|
- `security-incident.md` - Security incident response
|
|
|
|
## Maintenance and Updates
|
|
|
|
### Regular Reviews
|
|
|
|
- **Monthly**: Review and update recovery procedures
|
|
- **Quarterly**: Full DR drill and documentation update
|
|
- **Annually**: Complete disaster recovery plan review
|
|
|
|
### Backup Verification
|
|
|
|
- **Daily**: Automated backup verification
|
|
- **Weekly**: Manual backup integrity check
|
|
- **Monthly**: Full restore test
|
|
|
|
### Documentation Updates
|
|
|
|
All changes to infrastructure, applications, or procedures must be reflected in this disaster recovery plan within 7 days.
|
|
|
|
## References
|
|
|
|
- [Backup Strategy](../backups/README.md)
|
|
- [Secrets Management](../secrets/README.md)
|
|
- [Monitoring Setup](../monitoring/README.md)
|
|
- [Kubernetes Deployment](../README.md)
|
|
|
|
## Contact Information
|
|
|
|
**Disaster Recovery Team**:
|
|
- DevOps Lead: devops@veza.com
|
|
- SRE Team: sre@veza.com
|
|
- On-Call: PagerDuty rotation
|
|
|
|
**Last Updated**: 2025-12-25
|
|
**Next Review**: 2026-03-25
|
|
**Owner**: DevOps Team
|
|
|