# ORIGIN_DEPLOYMENT_GUIDE.md ## πŸ“‹ RΓ‰SUMΓ‰ EXΓ‰CUTIF Ce document dΓ©finit le guide de dΓ©ploiement complet pour la plateforme Veza en production. Il couvre Infrastructure as Code (Terraform/Ansible), containerisation (Docker/Incus), orchestration (Kubernetes), CI/CD pipelines, stratΓ©gies zero-downtime, disaster recovery, monitoring, et procedures opΓ©rationnelles pour dΓ©ploiements sΓ©curisΓ©s, automatisΓ©s et rΓ©versibles sur 24 mois. ## 🎯 OBJECTIFS ### Objectif Principal Γ‰tablir un processus de dΓ©ploiement automatisΓ©, sΓ©curisΓ©, reproductible et zero-downtime pour production avec rollback < 5 min, dΓ©ploiements multiples par jour, et RTO < 4 heures en cas de disaster. ### Objectifs Secondaires - Automatisation complΓ¨te (Infrastructure as Code) - Zero-downtime deployments (blue-green, canary) - Rollback automatique en cas d'Γ©chec (< 5 min) - Disaster recovery plan opΓ©rationnel (RTO < 4h, RPO < 1h) - Monitoring et alerting en temps rΓ©el (Prometheus + Grafana) ## πŸ“– TABLE DES MATIÈRES 1. [Deployment Philosophy](#1-deployment-philosophy) 2. [Infrastructure as Code](#2-infrastructure-as-code) 3. [Containerization](#3-containerization) 4. [Kubernetes Orchestration](#4-kubernetes-orchestration) 5. [CI/CD Pipelines](#5-cicd-pipelines) 6. [Zero-Downtime Strategies](#6-zero-downtime-strategies) 7. [Configuration Management](#7-configuration-management) 8. [Secrets Management](#8-secrets-management) 9. [Monitoring & Observability](#9-monitoring--observability) 10. [Backup & Disaster Recovery](#10-backup--disaster-recovery) 11. [Scaling Strategy](#11-scaling-strategy) 12. [Operational Procedures](#12-operational-procedures) ## πŸ”’ RÈGLES IMMUABLES 1. **Infrastructure as Code**: 100% infrastructure versionnΓ©e (Terraform) - aucun changement manuel 2. **Immutable Infrastructure**: Jamais modifier serveurs existants, toujours redΓ©ployer 3. **Zero Downtime**: Aucun dΓ©ploiement ne peut interrompre service (blue-green ou canary obligatoire) 4. **Automated Rollback**: Rollback automatique si health checks fail (< 5 min) 5. **Version Control**: Toutes les configs versionnΓ©es (Git) - aucune exception 6. **Secrets in Vault**: Aucun secret en clair (HashiCorp Vault ou Γ©quivalent) 7. **Testing in Staging**: Tous dΓ©ploiements testΓ©s en staging d'abord 8. **Monitoring Required**: Alerting configurΓ© avant mise en production 9. **Backup Verification**: Backups testΓ©s mensuellement (restore test) 10. **Documentation**: Runbooks Γ  jour pour toutes procedures critiques ## 1. DEPLOYMENT PHILOSOPHY ### 1.1 Deployment Principles **Twelve-Factor App**: 1. **Codebase**: One codebase tracked in Git, many deploys 2. **Dependencies**: Explicitly declare and isolate (go.mod, Cargo.lock, package-lock.json) 3. **Config**: Store config in environment (never in code) 4. **Backing Services**: Treat as attached resources (DB, Redis, S3) 5. **Build, Release, Run**: Strictly separate build and run stages 6. **Processes**: Execute app as stateless processes 7. **Port Binding**: Export services via port binding 8. **Concurrency**: Scale out via process model 9. **Disposability**: Fast startup and graceful shutdown 10. **Dev/Prod Parity**: Keep development, staging, production similar 11. **Logs**: Treat logs as event streams 12. **Admin Processes**: Run admin/management tasks as one-off processes ### 1.2 Deployment Environments | Environment | Purpose | Update Frequency | Users | |-------------|---------|------------------|-------| | **Development** | Local development | Continuous | Developers | | **Staging** | Pre-production testing | Daily | QA, Product Team | | **Production** | Live users | Multiple/day | All users | ### 1.3 Deployment Workflow ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Develop β”‚ ─── git push ───> CI/CD Triggered β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Build β”‚ ─── Tests, Linting, Security Scan β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Staging β”‚ ─── Deploy to staging, E2E tests β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Production β”‚ ─── Blue-Green / Canary deployment β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Monitor β”‚ ─── Health checks, metrics, logs β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό (if issues) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Rollback β”‚ ─── Automatic rollback < 5 min β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## 2. INFRASTRUCTURE AS CODE ### 2.1 Terraform Configuration **Project Structure**: ``` terraform/ β”œβ”€β”€ environments/ β”‚ β”œβ”€β”€ production/ β”‚ β”‚ β”œβ”€β”€ main.tf β”‚ β”‚ β”œβ”€β”€ variables.tf β”‚ β”‚ β”œβ”€β”€ terraform.tfvars (encrypted) β”‚ β”‚ └── outputs.tf β”‚ └── staging/ β”‚ β”œβ”€β”€ main.tf β”‚ β”œβ”€β”€ variables.tf β”‚ β”œβ”€β”€ terraform.tfvars β”‚ └── outputs.tf β”œβ”€β”€ modules/ β”‚ β”œβ”€β”€ compute/ β”‚ β”‚ β”œβ”€β”€ main.tf β”‚ β”‚ β”œβ”€β”€ variables.tf β”‚ β”‚ └── outputs.tf β”‚ β”œβ”€β”€ database/ β”‚ β”œβ”€β”€ networking/ β”‚ β”œβ”€β”€ storage/ β”‚ └── kubernetes/ └── backend.tf (Terraform state in S3) ``` **Example: Compute Module**: ```hcl # terraform/modules/compute/main.tf resource "aws_instance" "app_server" { count = var.instance_count ami = var.ami_id instance_type = var.instance_type vpc_security_group_ids = [aws_security_group.app.id] subnet_id = var.subnet_ids[count.index % length(var.subnet_ids)] user_data = templatefile("${path.module}/user_data.sh", { environment = var.environment }) tags = { Name = "veza-app-${var.environment}-${count.index + 1}" Environment = var.environment ManagedBy = "Terraform" } lifecycle { create_before_destroy = true } } resource "aws_security_group" "app" { name = "veza-app-${var.environment}" description = "Security group for Veza application servers" vpc_id = var.vpc_id ingress { from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } ingress { from_port = 80 to_port = 80 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } } ``` **Database Module**: ```hcl # terraform/modules/database/main.tf resource "aws_db_instance" "postgres" { identifier = "veza-db-${var.environment}" engine = "postgres" engine_version = "15.4" instance_class = var.instance_class allocated_storage = var.allocated_storage max_allocated_storage = var.max_allocated_storage storage_encrypted = true kms_key_id = var.kms_key_id db_name = var.database_name username = var.master_username password = var.master_password # From Vault vpc_security_group_ids = [aws_security_group.database.id] db_subnet_group_name = aws_db_subnet_group.database.name backup_retention_period = var.backup_retention_days backup_window = "03:00-04:00" maintenance_window = "mon:04:00-mon:05:00" multi_az = var.multi_az publicly_accessible = false skip_final_snapshot = false final_snapshot_identifier = "veza-db-${var.environment}-final-snapshot-${formatdate("YYYY-MM-DD-hhmm", timestamp())}" enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"] tags = { Name = "veza-db-${var.environment}" Environment = var.environment ManagedBy = "Terraform" } } ``` **Terraform Workflow**: ```bash # Initialize cd terraform/environments/production terraform init # Plan (review changes) terraform plan -out=tfplan # Apply (execute changes) terraform apply tfplan # Destroy (cleanup) terraform destroy ``` ### 2.2 Ansible Configuration **Playbook Structure**: ``` ansible/ β”œβ”€β”€ inventory/ β”‚ β”œβ”€β”€ production/ β”‚ β”‚ β”œβ”€β”€ hosts.yml β”‚ β”‚ └── group_vars/ β”‚ └── staging/ β”‚ β”œβ”€β”€ hosts.yml β”‚ └── group_vars/ β”œβ”€β”€ playbooks/ β”‚ β”œβ”€β”€ deploy-backend.yml β”‚ β”œβ”€β”€ deploy-chat-server.yml β”‚ β”œβ”€β”€ deploy-stream-server.yml β”‚ └── deploy-frontend.yml β”œβ”€β”€ roles/ β”‚ β”œβ”€β”€ common/ β”‚ β”œβ”€β”€ docker/ β”‚ β”œβ”€β”€ nginx/ β”‚ β”œβ”€β”€ postgres/ β”‚ └── monitoring/ └── ansible.cfg ``` **Deployment Playbook**: ```yaml # ansible/playbooks/deploy-backend.yml --- - name: Deploy Veza Backend API hosts: backend_servers become: yes vars: app_name: veza-backend-api app_version: "{{ lookup('env', 'VERSION') | default('latest') }}" docker_image: "registry.veza.app/{{ app_name }}:{{ app_version }}" tasks: - name: Pull Docker image docker_image: name: "{{ docker_image }}" source: pull - name: Stop old container docker_container: name: "{{ app_name }}" state: stopped ignore_errors: yes - name: Remove old container docker_container: name: "{{ app_name }}" state: absent ignore_errors: yes - name: Start new container docker_container: name: "{{ app_name }}" image: "{{ docker_image }}" state: started restart_policy: unless-stopped ports: - "8080:8080" env: DATABASE_URL: "{{ database_url }}" REDIS_URL: "{{ redis_url }}" JWT_SECRET: "{{ jwt_secret }}" volumes: - "/var/log/{{ app_name }}:/var/log/app" healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 10s retries: 3 start_period: 40s - name: Wait for application to be healthy uri: url: http://localhost:8080/health status_code: 200 register: result until: result.status == 200 retries: 10 delay: 5 - name: Verify deployment debug: msg: "{{ app_name }} version {{ app_version }} deployed successfully" ``` ## 3. CONTAINERIZATION ### 3.1 Docker Images **Multi-Stage Build (Go)**: ```dockerfile # veza-backend-api/Dockerfile # Stage 1: Builder FROM golang:1.21.5-alpine3.18 AS builder WORKDIR /app # Copy dependencies COPY go.mod go.sum ./ RUN go mod download # Copy source COPY . . # Build binary RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -ldflags="-w -s" -o main ./cmd/api # Stage 2: Runner FROM alpine:3.18 # Install CA certificates for HTTPS RUN apk --no-cache add ca-certificates WORKDIR /root/ # Copy binary from builder COPY --from=builder /app/main . # Create non-root user RUN addgroup -g 1000 appuser && \ adduser -D -u 1000 -G appuser appuser USER appuser # Expose port EXPOSE 8080 # Health check HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \ CMD ["/root/main", "healthcheck"] # Run ENTRYPOINT ["./main"] ``` **Multi-Stage Build (Rust)**: ```dockerfile # veza-chat-server/Dockerfile FROM rust:1.75-alpine AS builder WORKDIR /app RUN apk add --no-cache musl-dev # Copy dependencies COPY Cargo.toml Cargo.lock ./ RUN mkdir src && echo "fn main() {}" > src/main.rs && cargo build --release && rm -rf src # Copy source COPY . . # Build binary RUN cargo build --release # Stage 2: Runner FROM alpine:3.18 WORKDIR /app # Copy binary COPY --from=builder /app/target/release/veza-chat-server . # Create non-root user RUN addgroup -g 1000 appuser && \ adduser -D -u 1000 -G appuser appuser USER appuser EXPOSE 8081 HEALTHCHECK --interval=30s --timeout=10s --start-period=20s --retries=3 \ CMD ["wget", "--quiet", "--tries=1", "--spider", "http://localhost:8081/health"] ENTRYPOINT ["./veza-chat-server"] ``` **Frontend (React/Vite)**: ```dockerfile # apps/web/Dockerfile FROM node:20-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci COPY . . RUN npm run build # Stage 2: Nginx FROM nginx:1.25-alpine COPY --from=builder /app/dist /usr/share/nginx/html COPY nginx.conf /etc/nginx/conf.d/default.conf EXPOSE 80 HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \ CMD ["wget", "--quiet", "--tries=1", "--spider", "http://localhost/health"] CMD ["nginx", "-g", "daemon off;"] ``` ### 3.2 Docker Compose (Development) ```yaml # docker-compose.yml version: '3.9' services: postgres: image: postgres:15-alpine environment: POSTGRES_DB: veza_db POSTGRES_USER: veza POSTGRES_PASSWORD: ${DB_PASSWORD:-password} ports: - "5432:5432" volumes: - postgres_data:/var/lib/postgresql/data healthcheck: test: ["CMD-SHELL", "pg_isready -U veza"] interval: 10s timeout: 5s retries: 5 redis: image: redis:7-alpine ports: - "6379:6379" volumes: - redis_data:/data healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 10s timeout: 3s retries: 5 backend: build: context: ./veza-backend-api dockerfile: Dockerfile ports: - "8080:8080" environment: DATABASE_URL: postgresql://veza:${DB_PASSWORD:-password}@postgres:5432/veza_db REDIS_URL: redis://redis:6379 JWT_SECRET: ${JWT_SECRET} depends_on: postgres: condition: service_healthy redis: condition: service_healthy healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 10s retries: 3 chat-server: build: context: ./veza-chat-server dockerfile: Dockerfile ports: - "8081:8081" environment: DATABASE_URL: postgresql://veza:${DB_PASSWORD:-password}@postgres:5432/veza_db REDIS_URL: redis://redis:6379 depends_on: postgres: condition: service_healthy redis: condition: service_healthy frontend: build: context: ./apps/web dockerfile: Dockerfile ports: - "3000:80" depends_on: - backend volumes: postgres_data: redis_data: ``` ## 4. KUBERNETES ORCHESTRATION ### 4.1 Kubernetes Manifests **Deployment (Backend)**: ```yaml # k8s/backend/deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: veza-backend namespace: veza-production labels: app: veza-backend version: v1.0.0 spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 selector: matchLabels: app: veza-backend template: metadata: labels: app: veza-backend version: v1.0.0 spec: securityContext: runAsNonRoot: true runAsUser: 1000 fsGroup: 1000 containers: - name: backend image: registry.veza.app/veza-backend-api:v1.0.0 imagePullPolicy: IfNotPresent ports: - containerPort: 8080 name: http protocol: TCP env: - name: DATABASE_URL valueFrom: secretKeyRef: name: veza-secrets key: database-url - name: REDIS_URL valueFrom: secretKeyRef: name: veza-secrets key: redis-url - name: JWT_SECRET valueFrom: secretKeyRef: name: veza-secrets key: jwt-secret resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1Gi livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: true imagePullSecrets: - name: registry-credentials ``` **Service**: ```yaml # k8s/backend/service.yaml apiVersion: v1 kind: Service metadata: name: veza-backend namespace: veza-production spec: type: ClusterIP selector: app: veza-backend ports: - name: http port: 80 targetPort: 8080 protocol: TCP ``` **Ingress**: ```yaml # k8s/ingress.yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: veza-ingress namespace: veza-production annotations: cert-manager.io/cluster-issuer: letsencrypt-prod nginx.ingress.kubernetes.io/rate-limit: "100" nginx.ingress.kubernetes.io/ssl-redirect: "true" spec: ingressClassName: nginx tls: - hosts: - api.veza.app - veza.app secretName: veza-tls rules: - host: api.veza.app http: paths: - path: / pathType: Prefix backend: service: name: veza-backend port: number: 80 - host: veza.app http: paths: - path: / pathType: Prefix backend: service: name: veza-frontend port: number: 80 ``` **HorizontalPodAutoscaler**: ```yaml # k8s/backend/hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: veza-backend-hpa namespace: veza-production spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: veza-backend minReplicas: 3 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Percent value: 100 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Pods value: 1 periodSeconds: 60 ``` ## 5. CI/CD PIPELINES ### 5.1 GitHub Actions Workflow ```yaml # .github/workflows/deploy-production.yml name: Deploy to Production on: push: branches: - main tags: - 'v*' env: REGISTRY: registry.veza.app KUBE_NAMESPACE: veza-production jobs: build-and-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Run tests run: | make test-all - name: Security scan run: | make security-scan build-backend: needs: build-and-test runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Docker Buildx uses: docker/setup-buildx-action@v2 - name: Login to Registry uses: docker/login-action@v2 with: registry: ${{ env.REGISTRY }} username: ${{ secrets.REGISTRY_USERNAME }} password: ${{ secrets.REGISTRY_PASSWORD }} - name: Extract metadata id: meta uses: docker/metadata-action@v4 with: images: ${{ env.REGISTRY }}/veza-backend-api tags: | type=ref,event=branch type=ref,event=pr type=semver,pattern={{version}} type=semver,pattern={{major}}.{{minor}} type=sha,prefix={{branch}}- - name: Build and push uses: docker/build-push-action@v4 with: context: ./veza-backend-api push: true tags: ${{ steps.meta.outputs.tags }} labels: ${{ steps.meta.outputs.labels }} cache-from: type=registry,ref=${{ env.REGISTRY }}/veza-backend-api:buildcache cache-to: type=registry,ref=${{ env.REGISTRY }}/veza-backend-api:buildcache,mode=max deploy-staging: needs: [build-backend] runs-on: ubuntu-latest environment: staging steps: - uses: actions/checkout@v3 - name: Deploy to Staging run: | kubectl set image deployment/veza-backend \ backend=${{ env.REGISTRY }}/veza-backend-api:${{ github.sha }} \ -n veza-staging kubectl rollout status deployment/veza-backend -n veza-staging --timeout=5m - name: Run E2E tests run: | npm run test:e2e -- --env=staging deploy-production: needs: [deploy-staging] runs-on: ubuntu-latest environment: production steps: - uses: actions/checkout@v3 - name: Setup kubectl uses: azure/setup-kubectl@v3 with: version: 'v1.28.0' - name: Configure kubectl run: | echo "${{ secrets.KUBE_CONFIG }}" | base64 -d > kubeconfig export KUBECONFIG=./kubeconfig - name: Deploy to Production (Blue-Green) run: | # Deploy green environment kubectl apply -f k8s/backend/deployment-green.yaml kubectl rollout status deployment/veza-backend-green -n ${{ env.KUBE_NAMESPACE }} --timeout=10m # Run smoke tests make smoke-tests ENDPOINT=https://green.api.veza.app # Switch traffic to green kubectl patch service veza-backend -n ${{ env.KUBE_NAMESPACE }} \ -p '{"spec":{"selector":{"version":"green"}}}' # Wait for validation sleep 60 # Monitor metrics if ! make verify-deployment; then echo "Deployment verification failed, rolling back..." kubectl patch service veza-backend -n ${{ env.KUBE_NAMESPACE }} \ -p '{"spec":{"selector":{"version":"blue"}}}' exit 1 fi # Delete old blue deployment kubectl delete deployment veza-backend-blue -n ${{ env.KUBE_NAMESPACE }} - name: Notify Slack if: always() uses: slackapi/slack-github-action@v1 with: payload: | { "text": "Production deployment ${{ job.status }}: ${{ github.sha }}" } env: SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }} ``` ## 6. ZERO-DOWNTIME STRATEGIES ### 6.1 Blue-Green Deployment **Process**: 1. **Blue** (current production) serves all traffic 2. Deploy **Green** (new version) in parallel 3. Test Green thoroughly (smoke tests, health checks) 4. Switch load balancer from Blue to Green (instant cutover) 5. Monitor Green for issues (5-10 min) 6. If issues: Rollback to Blue (instant) 7. If stable: Decommission Blue **Kubernetes Implementation**: ```bash # Deploy green kubectl apply -f k8s/backend/deployment-green.yaml # Wait for readiness kubectl wait --for=condition=available --timeout=10m deployment/veza-backend-green # Switch service selector kubectl patch service veza-backend -p '{"spec":{"selector":{"version":"green"}}}' # Monitor watch kubectl get pods -l app=veza-backend # Rollback if needed kubectl patch service veza-backend -p '{"spec":{"selector":{"version":"blue"}}}' ``` ### 6.2 Canary Deployment **Process**: 1. Deploy new version (canary) with 5% traffic 2. Monitor metrics (error rate, latency) 3. Gradually increase traffic: 5% β†’ 25% β†’ 50% β†’ 100% 4. At each stage, verify metrics are healthy 5. If issues detected: Rollback immediately **Kubernetes with Istio**: ```yaml # k8s/canary/virtualservice.yaml apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: veza-backend spec: hosts: - veza-backend http: - match: - headers: canary: exact: "true" route: - destination: host: veza-backend subset: canary - route: - destination: host: veza-backend subset: stable weight: 95 - destination: host: veza-backend subset: canary weight: 5 ``` **Automated Canary with Flagger**: ```yaml # k8s/canary/flagger-canary.yaml apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: veza-backend namespace: veza-production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: veza-backend service: port: 80 analysis: interval: 1m threshold: 5 maxWeight: 50 stepWeight: 10 metrics: - name: request-success-rate thresholdRange: min: 99 interval: 1m - name: request-duration thresholdRange: max: 500 interval: 1m webhooks: - name: acceptance-test type: pre-rollout url: http://flagger-loadtester.test/ timeout: 30s metadata: type: bash cmd: "curl -s http://veza-backend-canary/health | grep -q ok" ``` ## 7. CONFIGURATION MANAGEMENT ### 7.1 ConfigMap (Non-Sensitive Config) ```yaml # k8s/backend/configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: veza-backend-config namespace: veza-production data: APP_ENV: "production" LOG_LEVEL: "info" API_RATE_LIMIT: "300" MAX_UPLOAD_SIZE: "500MB" CORS_ORIGINS: "https://veza.app,https://www.veza.app" ``` ### 7.2 Secrets (Sensitive Data) ```yaml # k8s/backend/secret.yaml (encrypted with SOPS or sealed-secrets) apiVersion: v1 kind: Secret metadata: name: veza-secrets namespace: veza-production type: Opaque data: database-url: redis-url: jwt-secret: stripe-api-key: ``` **Create Secret from Vault**: ```bash # Fetch from Vault and create K8s secret vault kv get -field=database_url secret/veza/production | base64 | \ kubectl create secret generic veza-secrets \ --from-literal=database-url=- \ -n veza-production ``` ## 8. SECRETS MANAGEMENT ### 8.1 HashiCorp Vault **Vault Structure**: ``` secret/ β”œβ”€β”€ veza/ β”‚ β”œβ”€β”€ production/ β”‚ β”‚ β”œβ”€β”€ database_url β”‚ β”‚ β”œβ”€β”€ redis_url β”‚ β”‚ β”œβ”€β”€ jwt_secret β”‚ β”‚ β”œβ”€β”€ stripe_api_key β”‚ β”‚ β”œβ”€β”€ aws_access_key β”‚ β”‚ └── aws_secret_key β”‚ └── staging/ β”‚ └── ... ``` **Store Secret**: ```bash # Write secret vault kv put secret/veza/production \ database_url="postgresql://..." \ redis_url="redis://..." \ jwt_secret="..." # Read secret vault kv get secret/veza/production # Rotate secret (new version) vault kv put secret/veza/production jwt_secret="new-secret" ``` **Vault Agent Injector (Kubernetes)**: ```yaml apiVersion: v1 kind: Pod metadata: annotations: vault.hashicorp.com/agent-inject: "true" vault.hashicorp.com/role: "veza-backend" vault.hashicorp.com/agent-inject-secret-database: "secret/data/veza/production" vault.hashicorp.com/agent-inject-template-database: | {{- with secret "secret/data/veza/production" -}} export DATABASE_URL="{{ .Data.data.database_url }}" {{- end }} ``` ## 9. MONITORING & OBSERVABILITY ### 9.1 Prometheus + Grafana **Prometheus Configuration**: ```yaml # prometheus/prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'veza-backend' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] action: keep regex: veza-backend - source_labels: [__meta_kubernetes_pod_ip] target_label: __address__ replacement: $1:8080 - job_name: 'postgres' static_configs: - targets: ['postgres-exporter:9187'] - job_name: 'redis' static_configs: - targets: ['redis-exporter:9121'] ``` **Grafana Dashboard**: - **API Latency**: p50, p95, p99 response times - **Throughput**: Requests per second - **Error Rate**: 4xx, 5xx errors - **Database**: Query time, connections, slow queries - **Cache Hit Rate**: Redis hit/miss ratio ### 9.2 Logging (ELK Stack) **Filebeat Configuration**: ```yaml # filebeat/filebeat.yml filebeat.inputs: - type: container paths: - '/var/lib/docker/containers/*/*.log' processors: - add_kubernetes_metadata: host: ${NODE_NAME} matchers: - logs_path: logs_path: "/var/lib/docker/containers/" output.elasticsearch: hosts: ["elasticsearch:9200"] index: "veza-logs-%{+yyyy.MM.dd}" ``` ### 9.3 Tracing (Jaeger) **OpenTelemetry Integration**: ```go // Go - OpenTelemetry setup import ( "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/exporters/jaeger" "go.opentelemetry.io/otel/sdk/trace" ) func initTracer() (*trace.TracerProvider, error) { exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint("http://jaeger:14268/api/traces"))) if err != nil { return nil, err } tp := trace.NewTracerProvider( trace.WithBatcher(exporter), trace.WithResource(resource.NewWithAttributes( semconv.SchemaURL, semconv.ServiceNameKey.String("veza-backend-api"), )), ) otel.SetTracerProvider(tp) return tp, nil } ``` ## 10. BACKUP & DISASTER RECOVERY ### 10.1 Database Backups **Automated Backup Strategy**: - **Daily**: Full backup (3 AM UTC) - **Hourly**: Incremental backup - **Retention**: 30 days daily, 12 weeks weekly, 2 years monthly **Backup Script**: ```bash #!/bin/bash # scripts/backup-database.sh DATE=$(date +%Y%m%d_%H%M%S) BACKUP_DIR="/backups/postgres" DATABASE="veza_db" # Full backup pg_dump -Fc -f "$BACKUP_DIR/veza_db_$DATE.dump" "$DATABASE" # Encrypt gpg --encrypt --recipient backup@veza.app "$BACKUP_DIR/veza_db_$DATE.dump" # Upload to S3 aws s3 cp "$BACKUP_DIR/veza_db_$DATE.dump.gpg" s3://veza-backups/postgres/ # Cleanup local backups > 7 days find "$BACKUP_DIR" -name "*.dump.gpg" -mtime +7 -delete ``` **Restore Procedure**: ```bash #!/bin/bash # scripts/restore-database.sh BACKUP_FILE=$1 # Download from S3 aws s3 cp "s3://veza-backups/postgres/$BACKUP_FILE" /tmp/ # Decrypt gpg --decrypt "/tmp/$BACKUP_FILE" > "/tmp/backup.dump" # Restore pg_restore -d veza_db "/tmp/backup.dump" ``` ### 10.2 Disaster Recovery Plan **RTO (Recovery Time Objective)**: < 4 hours **RPO (Recovery Point Objective)**: < 1 hour **Recovery Procedures**: 1. **Database Failure**: Failover to standby replica (< 5 min) 2. **Application Failure**: Rollback deployment (< 5 min) 3. **Complete Region Failure**: Failover to DR region (< 4 hours) ## 11. SCALING STRATEGY ### 11.1 Horizontal Scaling **Auto-Scaling Rules**: - **CPU > 70%**: Scale up - **CPU < 30%**: Scale down (after 5 min stability) - **Memory > 80%**: Scale up - **Request queue > 100**: Scale up ### 11.2 Database Scaling **Read Replicas**: - 2 read replicas minimum - Route read queries to replicas - Write queries to primary only **Connection Pooling** (PgBouncer): ```ini [databases] veza_db = host=postgres port=5432 dbname=veza_db [pgbouncer] pool_mode = transaction max_client_conn = 1000 default_pool_size = 25 reserve_pool_size = 5 ``` ## 12. OPERATIONAL PROCEDURES ### 12.1 Deployment Checklist **Pre-Deployment**: - [ ] All tests pass (unit, integration, E2E) - [ ] Security scan completed (no critical vulnerabilities) - [ ] Database migrations tested in staging - [ ] Rollback plan documented - [ ] Monitoring dashboards ready - [ ] On-call engineer notified - [ ] Deployment window scheduled (low-traffic period) **During Deployment**: - [ ] Monitor error rates in real-time - [ ] Monitor response times (p95, p99) - [ ] Check logs for errors - [ ] Verify database migrations applied - [ ] Test critical user flows **Post-Deployment**: - [ ] Verify all services healthy - [ ] Run smoke tests - [ ] Monitor for 30 minutes - [ ] Update deployment log - [ ] Notify stakeholders ### 12.2 Rollback Procedure **Immediate Rollback** (< 5 min): ```bash # Kubernetes kubectl rollout undo deployment/veza-backend -n veza-production # Verify kubectl rollout status deployment/veza-backend -n veza-production # Check logs kubectl logs -f deployment/veza-backend -n veza-production ``` ### 12.3 Incident Response **Severity Levels**: - **P0 (Critical)**: Production down, data breach - **P1 (High)**: Major feature broken, performance degradation - **P2 (Medium)**: Minor feature broken - **P3 (Low)**: Cosmetic issues **Response Procedure**: 1. Acknowledge incident (< 5 min) 2. Assess severity 3. Notify stakeholders 4. Mitigate (rollback, hotfix, scaling) 5. Root cause analysis 6. Post-mortem ## βœ… CHECKLIST DE VALIDATION ### Infrastructure - [ ] Infrastructure as Code (Terraform) complete - [ ] All resources versioned in Git - [ ] Secrets in Vault (no plaintext) - [ ] Automated provisioning tested ### Deployment - [ ] CI/CD pipeline functional - [ ] Zero-downtime deployment strategy (blue-green or canary) - [ ] Automated rollback configured - [ ] Health checks implemented ### Monitoring - [ ] Prometheus + Grafana dashboards - [ ] Alerting configured (PagerDuty/Slack) - [ ] Logging centralized (ELK Stack) - [ ] Tracing implemented (Jaeger) ### Disaster Recovery - [ ] Automated backups (daily + hourly) - [ ] Backup restoration tested - [ ] Failover procedure documented - [ ] RTO < 4h, RPO < 1h validated ## πŸ“Š MΓ‰TRIQUES DE SUCCÈS ### Deployment Metrics - **Deployment Frequency**: Multiple per day - **Lead Time**: < 1 hour (commit to production) - **MTTR (Mean Time To Recovery)**: < 5 minutes - **Change Failure Rate**: < 5% ### Operational Metrics - **Uptime**: > 99.9% - **RTO**: < 4 hours - **RPO**: < 1 hour - **Deployment Success Rate**: > 95% ## πŸ”„ HISTORIQUE DES VERSIONS | Version | Date | Changements | |---------|------|-------------| | 1.0.0 | 2025-11-02 | Version initiale - Guide de dΓ©ploiement complet | --- ## ⚠️ AVERTISSEMENT **CE GUIDE EST IMMUABLE** --- **Document créé par**: DevOps Team + SRE **Date de crΓ©ation**: 2025-11-02 **Prochaine rΓ©vision**: Quarterly (2026-02-01) **PropriΓ©taire**: DevOps Lead **Statut**: βœ… **APPROUVΓ‰ ET VERROUILLΓ‰**