veza/k8s/disaster-recovery/runbooks/rollback-procedure.md

# Application Rollback Runbook

This runbook describes the procedure for rolling back a failed application deployment.

## Prerequisites

- Access to Kubernetes cluster
- kubectl configured
- Previous deployment version available

## Detection

### Automatic Detection

Health checks will automatically detect:

- Application crashes
- High error rates
- Slow response times
- Failed readiness probes

### Manual Detection

```bash
# Check pod status
kubectl get pods -n veza-production -l app=veza-backend-api

# Check deployment status
kubectl rollout status deployment/veza-backend-api -n veza-production

# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production

# Check metrics
kubectl top pods -n veza-production
```

## Rollback Procedure

### Step 1: Verify Issue

```bash
# Check current deployment
kubectl get deployment veza-backend-api -n veza-production -o yaml

# Check recent events
kubectl get events -n veza-production --sort-by='.lastTimestamp' | tail -20

# Verify health endpoint
curl https://api.veza.com/health
```

### Step 2: Check Rollback History

```bash
# View deployment history
kubectl rollout history deployment/veza-backend-api -n veza-production

# View details of previous revision
kubectl rollout history deployment/veza-backend-api -n veza-production --revision=<N>
```

### Step 3: Execute Rollback

#### Option A: Rollback to Previous Version

```bash
# Rollback to previous version
kubectl rollout undo deployment/veza-backend-api -n veza-production

# Monitor rollback progress
kubectl rollout status deployment/veza-backend-api -n veza-production
```

#### Option B: Rollback to Specific Revision

```bash
# Rollback to specific revision
kubectl rollout undo deployment/veza-backend-api -n veza-production --to-revision=<N>

# Monitor rollback progress
kubectl rollout status deployment/veza-backend-api -n veza-production
```

### Step 4: Verify Rollback

```bash
# Check pod status
kubectl get pods -n veza-production -l app=veza-backend-api

# Check deployment status
kubectl get deployment veza-backend-api -n veza-production

# Verify pods are ready
kubectl wait --for=condition=ready pod \
  -l app=veza-backend-api \
  -n veza-production \
  --timeout=300s

# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production

# Test health endpoint
curl https://api.veza.com/health

# Test critical endpoints
curl https://api.veza.com/api/v1/tracks
```

### Step 5: Verify Application Functionality

```bash
# Run smoke tests
# (Use your application's test suite)

# Check metrics
kubectl top pods -n veza-production

# Monitor error rates
# (Check monitoring dashboard)
```

## Multi-Service Rollback

If multiple services need rollback:

```bash
# Rollback backend API (handles chat since v0.502 merge)
kubectl rollout undo deployment/veza-backend-api -n veza-production

# Rollback frontend
kubectl rollout undo deployment/veza-frontend -n veza-production

# Rollback stream server (if media layer affected)
kubectl rollout undo deployment/veza-stream-server -n veza-production

# Monitor all rollbacks
kubectl rollout status deployment/veza-backend-api -n veza-production
kubectl rollout status deployment/veza-frontend -n veza-production
kubectl rollout status deployment/veza-stream-server -n veza-production
```

## Database Migration Rollback

If rollback includes database changes:

```bash
# 1. Stop application
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production

# 2. Rollback database migration
# (Use your migration tool)
# Example with migrate tool:
kubectl run migrate-rollback --rm -it --image=veza-backend-api:previous \
  --restart=Never \
  --env="DATABASE_URL=$DATABASE_URL" \
  -- migrate -path /migrations -database $DATABASE_URL down 1

# 3. Rollback application
kubectl rollout undo deployment/veza-backend-api -n veza-production

# 4. Restart application
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production
```

## Verification Checklist

- [ ] Previous version identified
- [ ] Rollback executed
- [ ] Pods are running and ready
- [ ] Health checks passing
- [ ] Application logs show no errors
- [ ] Critical endpoints responding
- [ ] Metrics normalized
- [ ] Users can access platform
- [ ] Monitoring alerts cleared

## Troubleshooting

### Rollback Fails

```bash
# Check deployment status
kubectl describe deployment veza-backend-api -n veza-production

# Check pod events
kubectl describe pod <pod-name> -n veza-production

# Check image availability
kubectl get pod <pod-name> -n veza-production -o jsonpath='{.spec.containers[0].image}'

# If image is missing, may need to rebuild or use different image
```

### Pods Not Starting

```bash
# Check pod logs
kubectl logs <pod-name> -n veza-production

# Check resource constraints
kubectl describe pod <pod-name> -n veza-production | grep -A 5 "Limits\|Requests"

# Check node resources
kubectl top nodes
```

### Application Still Failing After Rollback

```bash
# Verify correct version is deployed
kubectl get deployment veza-backend-api -n veza-production -o jsonpath='{.spec.template.spec.containers[0].image}'

# Check if issue is in previous version too
kubectl logs <pod-name> -n veza-production

# May need to rollback further or investigate root cause
```

## Post-Rollback Tasks

1. **Investigate Root Cause**
    - Review deployment logs
    - Check application logs
    - Identify what caused failure

2. **Fix Issue**
    - Address root cause
    - Test fix in staging
    - Prepare new deployment

3. **Document Incident**
    - Document rollback procedure
    - Note any issues encountered
    - Update deployment process if needed

4. **Notify Stakeholders**
    - Send incident report
    - Update status page
    - Schedule post-mortem if needed

## Prevention

To prevent future rollbacks:

- **Automated Testing**: Run full test suite before deployment
- **Staged Rollouts**: Use canary or blue-green deployments
- **Health Checks**: Comprehensive health check endpoints
- **Monitoring**: Real-time monitoring and alerting
- **Gradual Rollout**: Deploy to small percentage first

## References

- [Kubernetes Rollout Documentation](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-back-a-deployment)
- [Deployment Best Practices](../README.md)