veza/k8s/disaster-recovery/runbooks/rollback-procedure.md

6.1 KiB

Application Rollback Runbook

This runbook describes the procedure for rolling back a failed application deployment.

Prerequisites

  • Access to Kubernetes cluster
  • kubectl configured
  • Previous deployment version available

Detection

Automatic Detection

Health checks will automatically detect:

  • Application crashes
  • High error rates
  • Slow response times
  • Failed readiness probes

Manual Detection

# Check pod status
kubectl get pods -n veza-production -l app=veza-backend-api

# Check deployment status
kubectl rollout status deployment/veza-backend-api -n veza-production

# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production

# Check metrics
kubectl top pods -n veza-production

Rollback Procedure

Step 1: Verify Issue

# Check current deployment
kubectl get deployment veza-backend-api -n veza-production -o yaml

# Check recent events
kubectl get events -n veza-production --sort-by='.lastTimestamp' | tail -20

# Verify health endpoint
curl https://api.veza.com/health

Step 2: Check Rollback History

# View deployment history
kubectl rollout history deployment/veza-backend-api -n veza-production

# View details of previous revision
kubectl rollout history deployment/veza-backend-api -n veza-production --revision=<N>

Step 3: Execute Rollback

Option A: Rollback to Previous Version

# Rollback to previous version
kubectl rollout undo deployment/veza-backend-api -n veza-production

# Monitor rollback progress
kubectl rollout status deployment/veza-backend-api -n veza-production

Option B: Rollback to Specific Revision

# Rollback to specific revision
kubectl rollout undo deployment/veza-backend-api -n veza-production --to-revision=<N>

# Monitor rollback progress
kubectl rollout status deployment/veza-backend-api -n veza-production

Step 4: Verify Rollback

# Check pod status
kubectl get pods -n veza-production -l app=veza-backend-api

# Check deployment status
kubectl get deployment veza-backend-api -n veza-production

# Verify pods are ready
kubectl wait --for=condition=ready pod \
  -l app=veza-backend-api \
  -n veza-production \
  --timeout=300s

# Check application logs
kubectl logs -f deployment/veza-backend-api -n veza-production

# Test health endpoint
curl https://api.veza.com/health

# Test critical endpoints
curl https://api.veza.com/api/v1/tracks

Step 5: Verify Application Functionality

# Run smoke tests
# (Use your application's test suite)

# Check metrics
kubectl top pods -n veza-production

# Monitor error rates
# (Check monitoring dashboard)

Multi-Service Rollback

If multiple services need rollback:

# Rollback backend API
kubectl rollout undo deployment/veza-backend-api -n veza-production

# Rollback frontend
kubectl rollout undo deployment/veza-frontend -n veza-production

# Rollback chat server
kubectl rollout undo deployment/veza-chat-server -n veza-production

# Monitor all rollbacks
kubectl rollout status deployment/veza-backend-api -n veza-production
kubectl rollout status deployment/veza-frontend -n veza-production
kubectl rollout status deployment/veza-chat-server -n veza-production

Database Migration Rollback

If rollback includes database changes:

# 1. Stop application
kubectl scale deployment veza-backend-api --replicas=0 -n veza-production

# 2. Rollback database migration
# (Use your migration tool)
# Example with migrate tool:
kubectl run migrate-rollback --rm -it --image=veza-backend-api:previous \
  --restart=Never \
  --env="DATABASE_URL=$DATABASE_URL" \
  -- migrate -path /migrations -database $DATABASE_URL down 1

# 3. Rollback application
kubectl rollout undo deployment/veza-backend-api -n veza-production

# 4. Restart application
kubectl scale deployment veza-backend-api --replicas=3 -n veza-production

Verification Checklist

  • Previous version identified
  • Rollback executed
  • Pods are running and ready
  • Health checks passing
  • Application logs show no errors
  • Critical endpoints responding
  • Metrics normalized
  • Users can access platform
  • Monitoring alerts cleared

Troubleshooting

Rollback Fails

# Check deployment status
kubectl describe deployment veza-backend-api -n veza-production

# Check pod events
kubectl describe pod <pod-name> -n veza-production

# Check image availability
kubectl get pod <pod-name> -n veza-production -o jsonpath='{.spec.containers[0].image}'

# If image is missing, may need to rebuild or use different image

Pods Not Starting

# Check pod logs
kubectl logs <pod-name> -n veza-production

# Check resource constraints
kubectl describe pod <pod-name> -n veza-production | grep -A 5 "Limits\|Requests"

# Check node resources
kubectl top nodes

Application Still Failing After Rollback

# Verify correct version is deployed
kubectl get deployment veza-backend-api -n veza-production -o jsonpath='{.spec.template.spec.containers[0].image}'

# Check if issue is in previous version too
kubectl logs <pod-name> -n veza-production

# May need to rollback further or investigate root cause

Post-Rollback Tasks

  1. Investigate Root Cause

    • Review deployment logs
    • Check application logs
    • Identify what caused failure
  2. Fix Issue

    • Address root cause
    • Test fix in staging
    • Prepare new deployment
  3. Document Incident

    • Document rollback procedure
    • Note any issues encountered
    • Update deployment process if needed
  4. Notify Stakeholders

    • Send incident report
    • Update status page
    • Schedule post-mortem if needed

Prevention

To prevent future rollbacks:

  • Automated Testing: Run full test suite before deployment
  • Staged Rollouts: Use canary or blue-green deployments
  • Health Checks: Comprehensive health check endpoints
  • Monitoring: Real-time monitoring and alerting
  • Gradual Rollout: Deploy to small percentage first

References