veza/k8s/disaster-recovery/runbooks/rollback-procedure.md

255 lines
6.1 KiB
Markdown

# Application Rollback Runbook
This runbook describes the procedure for rolling back a failed application deployment.
## Prerequisites
- Access to Kubernetes cluster
- kubectl configured
- Previous deployment version available
## Detection
### Automatic Detection
Health checks will automatically detect:
- Application crashes
- High error rates
- Slow response times
- Failed readiness probes
### Manual Detection
```bash
# Check pod status
kubectl get pods -n veza-production -l app=veza-backend-api
# Check deployment status
kubectl rollout status deployment/veza-backend-api -n veza-production
# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production
# Check metrics
kubectl top pods -n veza-production
```
## Rollback Procedure
### Step 1: Verify Issue
```bash
# Check current deployment
kubectl get deployment veza-backend-api -n veza-production -o yaml
# Check recent events
kubectl get events -n veza-production --sort-by='.lastTimestamp' | tail -20
# Verify health endpoint
curl https://api.veza.com/health
```
### Step 2: Check Rollback History
```bash
# View deployment history
kubectl rollout history deployment/veza-backend-api -n veza-production
# View details of previous revision
kubectl rollout history deployment/veza-backend-api -n veza-production --revision=<N>
```
### Step 3: Execute Rollback
#### Option A: Rollback to Previous Version
```bash
# Rollback to previous version
kubectl rollout undo deployment/veza-backend-api -n veza-production
# Monitor rollback progress
kubectl rollout status deployment/veza-backend-api -n veza-production
```
#### Option B: Rollback to Specific Revision
```bash
# Rollback to specific revision
kubectl rollout undo deployment/veza-backend-api -n veza-production --to-revision=<N>
# Monitor rollback progress
kubectl rollout status deployment/veza-backend-api -n veza-production
```
### Step 4: Verify Rollback
```bash
# Check pod status
kubectl get pods -n veza-production -l app=veza-backend-api
# Check deployment status
kubectl get deployment veza-backend-api -n veza-production
# Verify pods are ready
kubectl wait --for=condition=ready pod \
-l app=veza-backend-api \
-n veza-production \
--timeout=300s
# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production
# Test health endpoint
curl https://api.veza.com/health
# Test critical endpoints
curl https://api.veza.com/api/v1/tracks
```
### Step 5: Verify Application Functionality
```bash
# Run smoke tests
# (Use your application's test suite)
# Check metrics
kubectl top pods -n veza-production
# Monitor error rates
# (Check monitoring dashboard)
```
## Multi-Service Rollback
If multiple services need rollback:
```bash
# Rollback backend API
kubectl rollout undo deployment/veza-backend-api -n veza-production
# Rollback frontend
kubectl rollout undo deployment/veza-frontend -n veza-production
# Rollback chat server
kubectl rollout undo deployment/veza-chat-server -n veza-production
# Monitor all rollbacks
kubectl rollout status deployment/veza-backend-api -n veza-production
kubectl rollout status deployment/veza-frontend -n veza-production
kubectl rollout status deployment/veza-chat-server -n veza-production
```
## Database Migration Rollback
If rollback includes database changes:
```bash
# 1. Stop application
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production
# 2. Rollback database migration
# (Use your migration tool)
# Example with migrate tool:
kubectl run migrate-rollback --rm -it --image=veza-backend-api:previous \
--restart=Never \
--env="DATABASE_URL=$DATABASE_URL" \
-- migrate -path /migrations -database $DATABASE_URL down 1
# 3. Rollback application
kubectl rollout undo deployment/veza-backend-api -n veza-production
# 4. Restart application
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production
```
## Verification Checklist
- [ ] Previous version identified
- [ ] Rollback executed
- [ ] Pods are running and ready
- [ ] Health checks passing
- [ ] Application logs show no errors
- [ ] Critical endpoints responding
- [ ] Metrics normalized
- [ ] Users can access platform
- [ ] Monitoring alerts cleared
## Troubleshooting
### Rollback Fails
```bash
# Check deployment status
kubectl describe deployment veza-backend-api -n veza-production
# Check pod events
kubectl describe pod <pod-name> -n veza-production
# Check image availability
kubectl get pod <pod-name> -n veza-production -o jsonpath='{.spec.containers[0].image}'
# If image is missing, may need to rebuild or use different image
```
### Pods Not Starting
```bash
# Check pod logs
kubectl logs <pod-name> -n veza-production
# Check resource constraints
kubectl describe pod <pod-name> -n veza-production | grep -A 5 "Limits\|Requests"
# Check node resources
kubectl top nodes
```
### Application Still Failing After Rollback
```bash
# Verify correct version is deployed
kubectl get deployment veza-backend-api -n veza-production -o jsonpath='{.spec.template.spec.containers[0].image}'
# Check if issue is in previous version too
kubectl logs <pod-name> -n veza-production
# May need to rollback further or investigate root cause
```
## Post-Rollback Tasks
1. **Investigate Root Cause**
- Review deployment logs
- Check application logs
- Identify what caused failure
2. **Fix Issue**
- Address root cause
- Test fix in staging
- Prepare new deployment
3. **Document Incident**
- Document rollback procedure
- Note any issues encountered
- Update deployment process if needed
4. **Notify Stakeholders**
- Send incident report
- Update status page
- Schedule post-mortem if needed
## Prevention
To prevent future rollbacks:
- **Automated Testing**: Run full test suite before deployment
- **Staged Rollouts**: Use canary or blue-green deployments
- **Health Checks**: Comprehensive health check endpoints
- **Monitoring**: Real-time monitoring and alerting
- **Gradual Rollout**: Deploy to small percentage first
## References
- [Kubernetes Rollout Documentation](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-back-a-deployment)
- [Deployment Best Practices](../README.md)