# ORIGIN_DEPLOYMENT_GUIDE.md ## 📋 RÉSUMÉ EXÉCUTIF Ce document définit le guide de déploiement complet pour la plateforme Veza en production. Il couvre Infrastructure as Code (Terraform/Ansible), containerisation (Docker/Incus), orchestration (Kubernetes), CI/CD pipelines, stratégies zero-downtime, disaster recovery, monitoring, et procedures opérationnelles pour déploiements sécurisés, automatisés et réversibles sur 24 mois. ## 🎯 OBJECTIFS ### Objectif Principal Établir un processus de déploiement automatisé, sécurisé, reproductible et zero-downtime pour production avec rollback < 5 min, déploiements multiples par jour, et RTO < 4 heures en cas de disaster. ### Objectifs Secondaires - Automatisation complète (Infrastructure as Code) - Zero-downtime deployments (blue-green, canary) - Rollback automatique en cas d'échec (< 5 min) - Disaster recovery plan opérationnel (RTO < 4h, RPO < 1h) - Monitoring et alerting en temps réel (Prometheus + Grafana) ## 📖 TABLE DES MATIÈRES 1. [Deployment Philosophy](#1-deployment-philosophy) 2. [Infrastructure as Code](#2-infrastructure-as-code) 3. [Containerization](#3-containerization) 4. [Kubernetes Orchestration](#4-kubernetes-orchestration) 5. [CI/CD Pipelines](#5-cicd-pipelines) 6. [Zero-Downtime Strategies](#6-zero-downtime-strategies) 7. [Configuration Management](#7-configuration-management) 8. [Secrets Management](#8-secrets-management) 9. [Monitoring & Observability](#9-monitoring--observability) 10. [Backup & Disaster Recovery](#10-backup--disaster-recovery) 11. [Scaling Strategy](#11-scaling-strategy) 12. [Operational Procedures](#12-operational-procedures) 13. [Correctifs de Sécurité Prioritaires](#13-correctifs-de-sécurité-prioritaires) 14. [Checklist de Déploiement Éthique](#14-checklist-de-déploiement-éthique) 15. [Plan de Migration JWT HS256 → RS256](#15-plan-de-migration-jwt-hs256--rs256) ## 🔒 RÈGLES IMMUABLES 1. **Infrastructure as Code**: 100% infrastructure versionnée (Terraform) - aucun changement manuel 2. **Immutable Infrastructure**: Jamais modifier serveurs existants, toujours redéployer 3. **Zero Downtime**: Aucun déploiement ne peut interrompre service (blue-green ou canary obligatoire) 4. **Automated Rollback**: Rollback automatique si health checks fail (< 5 min) 5. **Version Control**: Toutes les configs versionnées (Git) - aucune exception 6. **Secrets in Vault**: Aucun secret en clair (HashiCorp Vault ou équivalent) 7. **Testing in Staging**: Tous déploiements testés en staging d'abord 8. **Monitoring Required**: Alerting configuré avant mise en production 9. **Backup Verification**: Backups testés mensuellement (restore test) 10. **Documentation**: Runbooks à jour pour toutes procedures critiques ## 1. DEPLOYMENT PHILOSOPHY ### 1.1 Deployment Principles **Twelve-Factor App**: 1. **Codebase**: One codebase tracked in Git, many deploys 2. **Dependencies**: Explicitly declare and isolate (go.mod, Cargo.lock, package-lock.json) 3. **Config**: Store config in environment (never in code) 4. **Backing Services**: Treat as attached resources (DB, Redis, S3) 5. **Build, Release, Run**: Strictly separate build and run stages 6. **Processes**: Execute app as stateless processes 7. **Port Binding**: Export services via port binding 8. **Concurrency**: Scale out via process model 9. **Disposability**: Fast startup and graceful shutdown 10. **Dev/Prod Parity**: Keep development, staging, production similar 11. **Logs**: Treat logs as event streams 12. **Admin Processes**: Run admin/management tasks as one-off processes ### 1.2 Deployment Environments | Environment | Purpose | Update Frequency | Users | |-------------|---------|------------------|-------| | **Development** | Local development | Continuous | Developers | | **Staging** | Pre-production testing | Daily | QA, Product Team | | **Production** | Live users | Multiple/day | All users | ### 1.3 Deployment Workflow ``` ┌─────────────┐ │ Develop │ ─── git push ───> CI/CD Triggered └─────────────┘ │ ▼ ┌─────────────┐ │ Build │ ─── Tests, Linting, Security Scan └─────────────┘ │ ▼ ┌─────────────┐ │ Staging │ ─── Deploy to staging, E2E tests └─────────────┘ │ ▼ ┌─────────────┐ │ Production │ ─── Blue-Green / Canary deployment └─────────────┘ │ ▼ ┌─────────────┐ │ Monitor │ ─── Health checks, metrics, logs └─────────────┘ │ ▼ (if issues) ┌─────────────┐ │ Rollback │ ─── Automatic rollback < 5 min └─────────────┘ ``` ## 2. INFRASTRUCTURE AS CODE ### 2.1 Terraform Configuration **Project Structure**: ``` terraform/ ├── environments/ │ ├── production/ │ │ ├── main.tf │ │ ├── variables.tf │ │ ├── terraform.tfvars (encrypted) │ │ └── outputs.tf │ └── staging/ │ ├── main.tf │ ├── variables.tf │ ├── terraform.tfvars │ └── outputs.tf ├── modules/ │ ├── compute/ │ │ ├── main.tf │ │ ├── variables.tf │ │ └── outputs.tf │ ├── database/ │ ├── networking/ │ ├── storage/ │ └── kubernetes/ └── backend.tf (Terraform state in S3) ``` **Example: Compute Module**: ```hcl # terraform/modules/compute/main.tf resource "aws_instance" "app_server" { count = var.instance_count ami = var.ami_id instance_type = var.instance_type vpc_security_group_ids = [aws_security_group.app.id] subnet_id = var.subnet_ids[count.index % length(var.subnet_ids)] user_data = templatefile("${path.module}/user_data.sh", { environment = var.environment }) tags = { Name = "veza-app-${var.environment}-${count.index + 1}" Environment = var.environment ManagedBy = "Terraform" } lifecycle { create_before_destroy = true } } resource "aws_security_group" "app" { name = "veza-app-${var.environment}" description = "Security group for Veza application servers" vpc_id = var.vpc_id ingress { from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } ingress { from_port = 80 to_port = 80 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } } ``` **Database Module**: ```hcl # terraform/modules/database/main.tf resource "aws_db_instance" "postgres" { identifier = "veza-db-${var.environment}" engine = "postgres" engine_version = "15.4" instance_class = var.instance_class allocated_storage = var.allocated_storage max_allocated_storage = var.max_allocated_storage storage_encrypted = true kms_key_id = var.kms_key_id db_name = var.database_name username = var.master_username password = var.master_password # From Vault vpc_security_group_ids = [aws_security_group.database.id] db_subnet_group_name = aws_db_subnet_group.database.name backup_retention_period = var.backup_retention_days backup_window = "03:00-04:00" maintenance_window = "mon:04:00-mon:05:00" multi_az = var.multi_az publicly_accessible = false skip_final_snapshot = false final_snapshot_identifier = "veza-db-${var.environment}-final-snapshot-${formatdate("YYYY-MM-DD-hhmm", timestamp())}" enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"] tags = { Name = "veza-db-${var.environment}" Environment = var.environment ManagedBy = "Terraform" } } ``` **Terraform Workflow**: ```bash # Initialize cd terraform/environments/production terraform init # Plan (review changes) terraform plan -out=tfplan # Apply (execute changes) terraform apply tfplan # Destroy (cleanup) terraform destroy ``` ### 2.2 Ansible Configuration **Playbook Structure**: ``` ansible/ ├── inventory/ │ ├── production/ │ │ ├── hosts.yml │ │ └── group_vars/ │ └── staging/ │ ├── hosts.yml │ └── group_vars/ ├── playbooks/ │ ├── deploy-backend.yml │ ├── deploy-chat-server.yml │ ├── deploy-stream-server.yml │ └── deploy-frontend.yml ├── roles/ │ ├── common/ │ ├── docker/ │ ├── nginx/ │ ├── postgres/ │ └── monitoring/ └── ansible.cfg ``` **Deployment Playbook**: ```yaml # ansible/playbooks/deploy-backend.yml --- - name: Deploy Veza Backend API hosts: backend_servers become: yes vars: app_name: veza-backend-api app_version: "{{ lookup('env', 'VERSION') | default('latest') }}" docker_image: "registry.veza.app/{{ app_name }}:{{ app_version }}" tasks: - name: Pull Docker image docker_image: name: "{{ docker_image }}" source: pull - name: Stop old container docker_container: name: "{{ app_name }}" state: stopped ignore_errors: yes - name: Remove old container docker_container: name: "{{ app_name }}" state: absent ignore_errors: yes - name: Start new container docker_container: name: "{{ app_name }}" image: "{{ docker_image }}" state: started restart_policy: unless-stopped ports: - "8080:8080" env: DATABASE_URL: "{{ database_url }}" REDIS_URL: "{{ redis_url }}" JWT_SECRET: "{{ jwt_secret }}" volumes: - "/var/log/{{ app_name }}:/var/log/app" healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 10s retries: 3 start_period: 40s - name: Wait for application to be healthy uri: url: http://localhost:8080/health status_code: 200 register: result until: result.status == 200 retries: 10 delay: 5 - name: Verify deployment debug: msg: "{{ app_name }} version {{ app_version }} deployed successfully" ``` ## 3. CONTAINERIZATION ### 3.1 Docker Images **Multi-Stage Build (Go)**: ```dockerfile # veza-backend-api/Dockerfile # Stage 1: Builder FROM golang:1.21.5-alpine3.18 AS builder WORKDIR /app # Copy dependencies COPY go.mod go.sum ./ RUN go mod download # Copy source COPY . . # Build binary RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -ldflags="-w -s" -o main ./cmd/api # Stage 2: Runner FROM alpine:3.18 # Install CA certificates for HTTPS RUN apk --no-cache add ca-certificates WORKDIR /root/ # Copy binary from builder COPY --from=builder /app/main . # Create non-root user RUN addgroup -g 1000 appuser && \ adduser -D -u 1000 -G appuser appuser USER appuser # Expose port EXPOSE 8080 # Health check HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \ CMD ["/root/main", "healthcheck"] # Run ENTRYPOINT ["./main"] ``` **Multi-Stage Build (Rust)**: ```dockerfile # veza-chat-server/Dockerfile FROM rust:1.75-alpine AS builder WORKDIR /app RUN apk add --no-cache musl-dev # Copy dependencies COPY Cargo.toml Cargo.lock ./ RUN mkdir src && echo "fn main() {}" > src/main.rs && cargo build --release && rm -rf src # Copy source COPY . . # Build binary RUN cargo build --release # Stage 2: Runner FROM alpine:3.18 WORKDIR /app # Copy binary COPY --from=builder /app/target/release/veza-chat-server . # Create non-root user RUN addgroup -g 1000 appuser && \ adduser -D -u 1000 -G appuser appuser USER appuser EXPOSE 8081 HEALTHCHECK --interval=30s --timeout=10s --start-period=20s --retries=3 \ CMD ["wget", "--quiet", "--tries=1", "--spider", "http://localhost:8081/health"] ENTRYPOINT ["./veza-chat-server"] ``` **Frontend (React/Vite)**: ```dockerfile # apps/web/Dockerfile FROM node:20-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci COPY . . RUN npm run build # Stage 2: Nginx FROM nginx:1.25-alpine COPY --from=builder /app/dist /usr/share/nginx/html COPY nginx.conf /etc/nginx/conf.d/default.conf EXPOSE 80 HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \ CMD ["wget", "--quiet", "--tries=1", "--spider", "http://localhost/health"] CMD ["nginx", "-g", "daemon off;"] ``` ### 3.2 Docker Compose (Development) ```yaml # docker-compose.yml version: '3.9' services: postgres: image: postgres:15-alpine environment: POSTGRES_DB: veza_db POSTGRES_USER: veza POSTGRES_PASSWORD: ${DB_PASSWORD:-password} ports: - "5432:5432" volumes: - postgres_data:/var/lib/postgresql/data healthcheck: test: ["CMD-SHELL", "pg_isready -U veza"] interval: 10s timeout: 5s retries: 5 redis: image: redis:7-alpine ports: - "6379:6379" volumes: - redis_data:/data healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 10s timeout: 3s retries: 5 backend: build: context: ./veza-backend-api dockerfile: Dockerfile ports: - "8080:8080" environment: DATABASE_URL: postgresql://veza:${DB_PASSWORD:-password}@postgres:5432/veza_db REDIS_URL: redis://redis:6379 JWT_SECRET: ${JWT_SECRET} depends_on: postgres: condition: service_healthy redis: condition: service_healthy healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 10s retries: 3 chat-server: build: context: ./veza-chat-server dockerfile: Dockerfile ports: - "8081:8081" environment: DATABASE_URL: postgresql://veza:${DB_PASSWORD:-password}@postgres:5432/veza_db REDIS_URL: redis://redis:6379 depends_on: postgres: condition: service_healthy redis: condition: service_healthy frontend: build: context: ./apps/web dockerfile: Dockerfile ports: - "3000:80" depends_on: - backend volumes: postgres_data: redis_data: ``` ## 4. KUBERNETES ORCHESTRATION ### 4.1 Kubernetes Manifests **Deployment (Backend)**: ```yaml # k8s/backend/deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: veza-backend namespace: veza-production labels: app: veza-backend version: v1.0.0 spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 selector: matchLabels: app: veza-backend template: metadata: labels: app: veza-backend version: v1.0.0 spec: securityContext: runAsNonRoot: true runAsUser: 1000 fsGroup: 1000 containers: - name: backend image: registry.veza.app/veza-backend-api:v1.0.0 imagePullPolicy: IfNotPresent ports: - containerPort: 8080 name: http protocol: TCP env: - name: DATABASE_URL valueFrom: secretKeyRef: name: veza-secrets key: database-url - name: REDIS_URL valueFrom: secretKeyRef: name: veza-secrets key: redis-url - name: JWT_SECRET valueFrom: secretKeyRef: name: veza-secrets key: jwt-secret resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1Gi livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: true imagePullSecrets: - name: registry-credentials ``` **Service**: ```yaml # k8s/backend/service.yaml apiVersion: v1 kind: Service metadata: name: veza-backend namespace: veza-production spec: type: ClusterIP selector: app: veza-backend ports: - name: http port: 80 targetPort: 8080 protocol: TCP ``` **Ingress**: ```yaml # k8s/ingress.yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: veza-ingress namespace: veza-production annotations: cert-manager.io/cluster-issuer: letsencrypt-prod nginx.ingress.kubernetes.io/rate-limit: "100" nginx.ingress.kubernetes.io/ssl-redirect: "true" spec: ingressClassName: nginx tls: - hosts: - api.veza.app - veza.app secretName: veza-tls rules: - host: api.veza.app http: paths: - path: / pathType: Prefix backend: service: name: veza-backend port: number: 80 - host: veza.app http: paths: - path: / pathType: Prefix backend: service: name: veza-frontend port: number: 80 ``` **HorizontalPodAutoscaler**: ```yaml # k8s/backend/hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: veza-backend-hpa namespace: veza-production spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: veza-backend minReplicas: 3 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Percent value: 100 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Pods value: 1 periodSeconds: 60 ``` ## 5. CI/CD PIPELINES ### 5.1 GitHub Actions Workflow ```yaml # .github/workflows/deploy-production.yml name: Deploy to Production on: push: branches: - main tags: - 'v*' env: REGISTRY: registry.veza.app KUBE_NAMESPACE: veza-production jobs: build-and-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Run tests run: | make test-all - name: Security scan run: | make security-scan build-backend: needs: build-and-test runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Docker Buildx uses: docker/setup-buildx-action@v2 - name: Login to Registry uses: docker/login-action@v2 with: registry: ${{ env.REGISTRY }} username: ${{ secrets.REGISTRY_USERNAME }} password: ${{ secrets.REGISTRY_PASSWORD }} - name: Extract metadata id: meta uses: docker/metadata-action@v4 with: images: ${{ env.REGISTRY }}/veza-backend-api tags: | type=ref,event=branch type=ref,event=pr type=semver,pattern={{version}} type=semver,pattern={{major}}.{{minor}} type=sha,prefix={{branch}}- - name: Build and push uses: docker/build-push-action@v4 with: context: ./veza-backend-api push: true tags: ${{ steps.meta.outputs.tags }} labels: ${{ steps.meta.outputs.labels }} cache-from: type=registry,ref=${{ env.REGISTRY }}/veza-backend-api:buildcache cache-to: type=registry,ref=${{ env.REGISTRY }}/veza-backend-api:buildcache,mode=max deploy-staging: needs: [build-backend] runs-on: ubuntu-latest environment: staging steps: - uses: actions/checkout@v3 - name: Deploy to Staging run: | kubectl set image deployment/veza-backend \ backend=${{ env.REGISTRY }}/veza-backend-api:${{ github.sha }} \ -n veza-staging kubectl rollout status deployment/veza-backend -n veza-staging --timeout=5m - name: Run E2E tests run: | npm run test:e2e -- --env=staging deploy-production: needs: [deploy-staging] runs-on: ubuntu-latest environment: production steps: - uses: actions/checkout@v3 - name: Setup kubectl uses: azure/setup-kubectl@v3 with: version: 'v1.28.0' - name: Configure kubectl run: | echo "${{ secrets.KUBE_CONFIG }}" | base64 -d > kubeconfig export KUBECONFIG=./kubeconfig - name: Deploy to Production (Blue-Green) run: | # Deploy green environment kubectl apply -f k8s/backend/deployment-green.yaml kubectl rollout status deployment/veza-backend-green -n ${{ env.KUBE_NAMESPACE }} --timeout=10m # Run smoke tests make smoke-tests ENDPOINT=https://green.api.veza.app # Switch traffic to green kubectl patch service veza-backend -n ${{ env.KUBE_NAMESPACE }} \ -p '{"spec":{"selector":{"version":"green"}}}' # Wait for validation sleep 60 # Monitor metrics if ! make verify-deployment; then echo "Deployment verification failed, rolling back..." kubectl patch service veza-backend -n ${{ env.KUBE_NAMESPACE }} \ -p '{"spec":{"selector":{"version":"blue"}}}' exit 1 fi # Delete old blue deployment kubectl delete deployment veza-backend-blue -n ${{ env.KUBE_NAMESPACE }} - name: Notify Slack if: always() uses: slackapi/slack-github-action@v1 with: payload: | { "text": "Production deployment ${{ job.status }}: ${{ github.sha }}" } env: SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }} ``` ## 6. ZERO-DOWNTIME STRATEGIES ### 6.1 Blue-Green Deployment **Process**: 1. **Blue** (current production) serves all traffic 2. Deploy **Green** (new version) in parallel 3. Test Green thoroughly (smoke tests, health checks) 4. Switch load balancer from Blue to Green (instant cutover) 5. Monitor Green for issues (5-10 min) 6. If issues: Rollback to Blue (instant) 7. If stable: Decommission Blue **Kubernetes Implementation**: ```bash # Deploy green kubectl apply -f k8s/backend/deployment-green.yaml # Wait for readiness kubectl wait --for=condition=available --timeout=10m deployment/veza-backend-green # Switch service selector kubectl patch service veza-backend -p '{"spec":{"selector":{"version":"green"}}}' # Monitor watch kubectl get pods -l app=veza-backend # Rollback if needed kubectl patch service veza-backend -p '{"spec":{"selector":{"version":"blue"}}}' ``` ### 6.2 Canary Deployment **Process**: 1. Deploy new version (canary) with 5% traffic 2. Monitor metrics (error rate, latency) 3. Gradually increase traffic: 5% → 25% → 50% → 100% 4. At each stage, verify metrics are healthy 5. If issues detected: Rollback immediately **Kubernetes with Istio**: ```yaml # k8s/canary/virtualservice.yaml apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: veza-backend spec: hosts: - veza-backend http: - match: - headers: canary: exact: "true" route: - destination: host: veza-backend subset: canary - route: - destination: host: veza-backend subset: stable weight: 95 - destination: host: veza-backend subset: canary weight: 5 ``` **Automated Canary with Flagger**: ```yaml # k8s/canary/flagger-canary.yaml apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: veza-backend namespace: veza-production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: veza-backend service: port: 80 analysis: interval: 1m threshold: 5 maxWeight: 50 stepWeight: 10 metrics: - name: request-success-rate thresholdRange: min: 99 interval: 1m - name: request-duration thresholdRange: max: 500 interval: 1m webhooks: - name: acceptance-test type: pre-rollout url: http://flagger-loadtester.test/ timeout: 30s metadata: type: bash cmd: "curl -s http://veza-backend-canary/health | grep -q ok" ``` ## 7. CONFIGURATION MANAGEMENT ### 7.1 ConfigMap (Non-Sensitive Config) ```yaml # k8s/backend/configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: veza-backend-config namespace: veza-production data: APP_ENV: "production" LOG_LEVEL: "info" API_RATE_LIMIT: "300" MAX_UPLOAD_SIZE: "500MB" CORS_ORIGINS: "https://veza.app,https://www.veza.app" ``` ### 7.2 Secrets (Sensitive Data) ```yaml # k8s/backend/secret.yaml (encrypted with SOPS or sealed-secrets) apiVersion: v1 kind: Secret metadata: name: veza-secrets namespace: veza-production type: Opaque data: database-url: redis-url: jwt-secret: stripe-api-key: ``` **Create Secret from Vault**: ```bash # Fetch from Vault and create K8s secret vault kv get -field=database_url secret/veza/production | base64 | \ kubectl create secret generic veza-secrets \ --from-literal=database-url=- \ -n veza-production ``` ## 8. SECRETS MANAGEMENT ### 8.1 HashiCorp Vault **Vault Structure**: ``` secret/ ├── veza/ │ ├── production/ │ │ ├── database_url │ │ ├── redis_url │ │ ├── jwt_secret │ │ ├── stripe_api_key │ │ ├── aws_access_key │ │ └── aws_secret_key │ └── staging/ │ └── ... ``` **Store Secret**: ```bash # Write secret vault kv put secret/veza/production \ database_url="postgresql://..." \ redis_url="redis://..." \ jwt_secret="..." # Read secret vault kv get secret/veza/production # Rotate secret (new version) vault kv put secret/veza/production jwt_secret="new-secret" ``` **Vault Agent Injector (Kubernetes)**: ```yaml apiVersion: v1 kind: Pod metadata: annotations: vault.hashicorp.com/agent-inject: "true" vault.hashicorp.com/role: "veza-backend" vault.hashicorp.com/agent-inject-secret-database: "secret/data/veza/production" vault.hashicorp.com/agent-inject-template-database: | {{- with secret "secret/data/veza/production" -}} export DATABASE_URL="{{ .Data.data.database_url }}" {{- end }} ``` ## 9. MONITORING & OBSERVABILITY ### 9.1 Prometheus + Grafana **Prometheus Configuration**: ```yaml # prometheus/prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'veza-backend' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] action: keep regex: veza-backend - source_labels: [__meta_kubernetes_pod_ip] target_label: __address__ replacement: $1:8080 - job_name: 'postgres' static_configs: - targets: ['postgres-exporter:9187'] - job_name: 'redis' static_configs: - targets: ['redis-exporter:9121'] ``` **Grafana Dashboard**: - **API Latency**: p50, p95, p99 response times - **Throughput**: Requests per second - **Error Rate**: 4xx, 5xx errors - **Database**: Query time, connections, slow queries - **Cache Hit Rate**: Redis hit/miss ratio ### 9.2 Logging (ELK Stack) **Filebeat Configuration**: ```yaml # filebeat/filebeat.yml filebeat.inputs: - type: container paths: - '/var/lib/docker/containers/*/*.log' processors: - add_kubernetes_metadata: host: ${NODE_NAME} matchers: - logs_path: logs_path: "/var/lib/docker/containers/" output.elasticsearch: hosts: ["elasticsearch:9200"] index: "veza-logs-%{+yyyy.MM.dd}" ``` ### 9.3 Tracing (Jaeger) **OpenTelemetry Integration**: ```go // Go - OpenTelemetry setup import ( "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/exporters/jaeger" "go.opentelemetry.io/otel/sdk/trace" ) func initTracer() (*trace.TracerProvider, error) { exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint("http://jaeger:14268/api/traces"))) if err != nil { return nil, err } tp := trace.NewTracerProvider( trace.WithBatcher(exporter), trace.WithResource(resource.NewWithAttributes( semconv.SchemaURL, semconv.ServiceNameKey.String("veza-backend-api"), )), ) otel.SetTracerProvider(tp) return tp, nil } ``` ## 10. BACKUP & DISASTER RECOVERY ### 10.1 Database Backups **Automated Backup Strategy**: - **Daily**: Full backup (3 AM UTC) - **Hourly**: Incremental backup - **Retention**: 30 days daily, 12 weeks weekly, 2 years monthly **Backup Script**: ```bash #!/bin/bash # scripts/backup-database.sh DATE=$(date +%Y%m%d_%H%M%S) BACKUP_DIR="/backups/postgres" DATABASE="veza_db" # Full backup pg_dump -Fc -f "$BACKUP_DIR/veza_db_$DATE.dump" "$DATABASE" # Encrypt gpg --encrypt --recipient backup@veza.app "$BACKUP_DIR/veza_db_$DATE.dump" # Upload to S3 aws s3 cp "$BACKUP_DIR/veza_db_$DATE.dump.gpg" s3://veza-backups/postgres/ # Cleanup local backups > 7 days find "$BACKUP_DIR" -name "*.dump.gpg" -mtime +7 -delete ``` **Restore Procedure**: ```bash #!/bin/bash # scripts/restore-database.sh BACKUP_FILE=$1 # Download from S3 aws s3 cp "s3://veza-backups/postgres/$BACKUP_FILE" /tmp/ # Decrypt gpg --decrypt "/tmp/$BACKUP_FILE" > "/tmp/backup.dump" # Restore pg_restore -d veza_db "/tmp/backup.dump" ``` ### 10.2 Disaster Recovery Plan **RTO (Recovery Time Objective)**: < 4 hours **RPO (Recovery Point Objective)**: < 1 hour **Recovery Procedures**: 1. **Database Failure**: Failover to standby replica (< 5 min) 2. **Application Failure**: Rollback deployment (< 5 min) 3. **Complete Region Failure**: Failover to DR region (< 4 hours) ## 11. SCALING STRATEGY ### 11.1 Horizontal Scaling **Auto-Scaling Rules**: - **CPU > 70%**: Scale up - **CPU < 30%**: Scale down (after 5 min stability) - **Memory > 80%**: Scale up - **Request queue > 100**: Scale up ### 11.2 Database Scaling **Read Replicas**: - 2 read replicas minimum - Route read queries to replicas - Write queries to primary only **Connection Pooling** (PgBouncer): ```ini [databases] veza_db = host=postgres port=5432 dbname=veza_db [pgbouncer] pool_mode = transaction max_client_conn = 1000 default_pool_size = 25 reserve_pool_size = 5 ``` ## 12. OPERATIONAL PROCEDURES ### 12.1 Deployment Checklist **Pre-Deployment**: - [ ] All tests pass (unit, integration, E2E) - [ ] Security scan completed (no critical vulnerabilities) - [ ] Database migrations tested in staging - [ ] Rollback plan documented - [ ] Monitoring dashboards ready - [ ] On-call engineer notified - [ ] Deployment window scheduled (low-traffic period) **During Deployment**: - [ ] Monitor error rates in real-time - [ ] Monitor response times (p95, p99) - [ ] Check logs for errors - [ ] Verify database migrations applied - [ ] Test critical user flows **Post-Deployment**: - [ ] Verify all services healthy - [ ] Run smoke tests - [ ] Monitor for 30 minutes - [ ] Update deployment log - [ ] Notify stakeholders ### 12.2 Rollback Procedure **Immediate Rollback** (< 5 min): ```bash # Kubernetes kubectl rollout undo deployment/veza-backend -n veza-production # Verify kubectl rollout status deployment/veza-backend -n veza-production # Check logs kubectl logs -f deployment/veza-backend -n veza-production ``` ### 12.3 Incident Response **Severity Levels**: - **P0 (Critical)**: Production down, data breach - **P1 (High)**: Major feature broken, performance degradation - **P2 (Medium)**: Minor feature broken - **P3 (Low)**: Cosmetic issues **Response Procedure**: 1. Acknowledge incident (< 5 min) 2. Assess severity 3. Notify stakeholders 4. Mitigate (rollback, hotfix, scaling) 5. Root cause analysis 6. Post-mortem ## 13. CORRECTIFS DE SÉCURITÉ PRIORITAIRES > Identifiés lors de l'audit sécurité du 2026-03-04. Ces procédures sont **bloquantes** pour tout déploiement en production. ### 13.1 Rotation du Secret JWT Le secret JWT doit être rotaté régulièrement (minimum trimestriel) et immédiatement en cas de suspicion de compromission. **Procédure de rotation** : ```bash #!/bin/bash # scripts/rotate-jwt-secret.sh set -euo pipefail NEW_SECRET=$(openssl rand -base64 64) # 1. Stocker le nouveau secret dans Vault vault kv put secret/veza/production jwt_secret="$NEW_SECRET" # 2. Mettre à jour le secret Kubernetes kubectl create secret generic veza-jwt-secret \ --from-literal=jwt-secret="$NEW_SECRET" \ --dry-run=client -o yaml | kubectl apply -f - -n veza-production # 3. Redémarrage progressif (rolling restart) pour charger le nouveau secret kubectl rollout restart deployment/veza-backend -n veza-production kubectl rollout restart deployment/veza-chat-server -n veza-production # 4. Attendre que le rollout soit terminé kubectl rollout status deployment/veza-backend -n veza-production --timeout=5m kubectl rollout status deployment/veza-chat-server -n veza-production --timeout=5m echo "JWT secret rotation complete. Old tokens will expire naturally." ``` **Points critiques** : - Pendant la rotation, les anciens tokens restent valides jusqu'à leur expiration naturelle (configurer une durée de vie courte : 15 min access, 7 jours refresh) - Tester en staging avant chaque rotation production - Logger l'événement de rotation (sans le secret) dans l'audit log ### 13.2 Alignement JWT Issuer/Audience Go ↔ Rust Le backend Go et le chat-server Rust doivent utiliser les mêmes claims JWT (`iss`, `aud`) pour éviter les rejets de tokens inter-services. **Configuration alignée** : ```yaml # Configuration partagée — identique Go et Rust jwt: issuer: "https://api.veza.app" audience: "https://veza.app" algorithm: "HS256" # Migrer vers RS256 — voir section 15 access_token_ttl: "15m" refresh_token_ttl: "7d" ``` **Vérification Go** : ```go // veza-backend-api/internal/auth/jwt.go claims := jwt.MapClaims{ "iss": "https://api.veza.app", "aud": "https://veza.app", "sub": userID, "exp": time.Now().Add(15 * time.Minute).Unix(), "iat": time.Now().Unix(), } ``` **Vérification Rust** : ```rust // veza-chat-server/src/auth.rs let validation = Validation::new(Algorithm::HS256); validation.set_issuer(&["https://api.veza.app"]); validation.set_audience(&["https://veza.app"]); ``` **Procédure de déploiement** : 1. Mettre à jour la config Rust pour accepter le même `iss`/`aud` 2. Déployer le chat-server Rust en premier (il accepte les tokens existants + nouveaux) 3. Mettre à jour la config Go 4. Déployer le backend Go 5. Vérifier les logs des deux services : zéro erreur `invalid issuer` ou `invalid audience` ### 13.3 Protection de la Route /metrics La route `/metrics` (Prometheus) expose des métriques internes et ne doit **jamais** être accessible publiquement. **Nginx/Ingress — bloquer l'accès externe** : ```yaml # k8s/ingress.yaml — ajout d'annotation apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: veza-ingress annotations: nginx.ingress.kubernetes.io/server-snippet: | location /metrics { deny all; return 404; } ``` **NetworkPolicy — restreindre au namespace monitoring** : ```yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-metrics-from-prometheus namespace: veza-production spec: podSelector: matchLabels: app: veza-backend ingress: - from: - namespaceSelector: matchLabels: name: monitoring ports: - port: 8080 protocol: TCP ``` **Vérification post-déploiement** : ```bash # Doit retourner 404 ou connexion refusée depuis l'extérieur curl -s -o /dev/null -w "%{http_code}" https://api.veza.app/metrics # Expected: 404 # Doit fonctionner depuis le pod Prometheus kubectl exec -n monitoring deploy/prometheus -- curl -s http://veza-backend.veza-production/metrics | head -5 # Expected: 200 avec métriques ``` ## 14. CHECKLIST DE DÉPLOIEMENT ÉTHIQUE Avant chaque déploiement en production, les points suivants doivent être vérifiés. Cette checklist complète la checklist opérationnelle standard (section 12.1). ### 14.1 Protection des Données Personnelles - [ ] **Métriques Prometheus** : vérifier qu'aucune métrique n'expose de données personnelles (emails, IPs, user agents complets, identifiants utilisateur) ```bash # Vérification automatisée — aucune métrique ne doit contenir ces patterns kubectl exec -n monitoring deploy/prometheus -- \ curl -s http://veza-backend.veza-production/metrics | \ grep -iE '(email|user_agent|ip_address|@)' && \ echo "FAIL: Personal data found in metrics" && exit 1 || \ echo "PASS: No personal data in metrics" ``` - [ ] **Logs anonymisés** : confirmer que les logs applicatifs ne contiennent pas de données personnelles en clair ```bash # Vérification sur les 1000 dernières lignes de logs kubectl logs deployment/veza-backend -n veza-production --tail=1000 | \ grep -iE '([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})' && \ echo "FAIL: Email addresses found in logs" && exit 1 || \ echo "PASS: No emails in logs" ``` - [ ] **Headers HTTP** : aucun header de tracking ou fingerprinting n'est émis par les services ### 14.2 Conformité RGPD - [ ] **Export des données** : l'endpoint `POST /api/v1/me/data-export` retourne toutes les données de l'utilisateur authentifié (profil, tracks, playlists, historique) - [ ] **Suppression de compte** : l'endpoint `DELETE /api/v1/me` supprime l'utilisateur et toutes ses données associées (cascade) - [ ] **Consentement cookies** : seuls les cookies strictement nécessaires (session JWT) sont envoyés sans consentement ### 14.3 Intégrité Éthique - [ ] **Pas de dépendance AI/ML** : aucune image Docker ne contient de framework ML (TensorFlow, PyTorch, scikit-learn, ONNX) - [ ] **Pas de tracking tiers** : aucun script Google Analytics, Facebook Pixel, ou équivalent dans le bundle frontend - [ ] **Algorithme de découverte** : les tests de biais (section 14 de ORIGIN_TESTING_STRATEGY) passent en CI ## 15. PLAN DE MIGRATION JWT HS256 → RS256 ### 15.1 Motivation HS256 (HMAC symétrique) nécessite le partage du même secret entre tous les services. RS256 (RSA asymétrique) permet au backend Go de signer avec une clé privée tandis que les autres services (Rust, frontend) vérifient avec la clé publique, réduisant la surface d'attaque. ### 15.2 Architecture Cible ``` ┌──────────────────┐ ┌──────────────────┐ │ Backend Go │ │ Chat Server │ │ (signe JWT) │ │ Rust │ │ │ │ (vérifie JWT) │ │ Clé PRIVÉE RS256│ │ Clé PUBLIQUE │ └──────────────────┘ └──────────────────┘ │ │ │ Même clé publique │ └────────────┬───────────────┘ │ ┌───────▼───────┐ │ Frontend │ │ (vérifie JWT │ │ optionnel) │ │ Clé PUBLIQUE │ └───────────────┘ ``` ### 15.3 Étapes de Migration **Phase 1 — Préparation (semaine 1)** : ```bash # Générer la paire de clés RSA 4096 bits openssl genrsa -out jwt-private.pem 4096 openssl rsa -in jwt-private.pem -pubout -out jwt-public.pem # Stocker dans Vault vault kv put secret/veza/production \ jwt_private_key=@jwt-private.pem \ jwt_public_key=@jwt-public.pem # Nettoyer les fichiers locaux shred -u jwt-private.pem jwt-public.pem ``` **Phase 2 — Double validation (semaine 2)** : Modifier les services pour accepter **les deux algorithmes** pendant la transition : ```go // veza-backend-api — signe en RS256, valide HS256 et RS256 func (a *Auth) ValidateToken(tokenString string) (*Claims, error) { token, err := jwt.Parse(tokenString, func(t *jwt.Token) (interface{}, error) { switch t.Method.(type) { case *jwt.SigningMethodRSA: return a.rsaPublicKey, nil case *jwt.SigningMethodHMAC: return []byte(a.hmacSecret), nil default: return nil, fmt.Errorf("unexpected signing method: %v", t.Header["alg"]) } }) // ... } ``` ```rust // veza-chat-server — valide HS256 et RS256 fn validate_token(token: &str, config: &AuthConfig) -> Result { // Try RS256 first, fall back to HS256 let rs256_result = decode::( token, &DecodingKey::from_rsa_pem(config.rsa_public_key.as_bytes())?, &Validation::new(Algorithm::RS256), ); match rs256_result { Ok(data) => Ok(data.claims), Err(_) => { let hs256_result = decode::( token, &DecodingKey::from_secret(config.hmac_secret.as_bytes()), &Validation::new(Algorithm::HS256), ); hs256_result.map(|d| d.claims).map_err(AuthError::from) } } } ``` **Phase 3 — Basculement (semaine 3)** : 1. Déployer le backend Go pour signer exclusivement en RS256 2. Attendre l'expiration de tous les tokens HS256 (max 7 jours pour les refresh tokens) 3. Supprimer le code de fallback HS256 des services Go et Rust 4. Supprimer le secret HS256 de Vault **Phase 4 — Nettoyage (semaine 4)** : ```bash # Supprimer l'ancien secret HS256 de Vault vault kv delete secret/veza/production/jwt_secret_hmac # Vérifier qu'aucun service n'utilise encore HS256 kubectl logs -l app=veza-backend -n veza-production --since=24h | \ grep -i "hs256" && echo "WARNING: HS256 still in use" || echo "CLEAN: No HS256 usage" ``` ### 15.4 Rollback Plan Si des problèmes sont détectés pendant la migration : 1. Remettre le backend Go en mode signature HS256 2. Les services continuent d'accepter les deux formats 3. Investiguer et corriger avant de retenter ### 15.5 Critères de Succès - [ ] Tous les services acceptent RS256 - [ ] Aucun token HS256 en circulation (après expiration naturelle) - [ ] La clé privée RSA est uniquement accessible au backend Go - [ ] Les tests d'intégration inter-services passent avec RS256 - [ ] Performance : la vérification RS256 < 1ms (benchmarker) ## ✅ CHECKLIST DE VALIDATION ### Infrastructure - [ ] Infrastructure as Code (Terraform) complete - [ ] All resources versioned in Git - [ ] Secrets in Vault (no plaintext) - [ ] Automated provisioning tested ### Deployment - [ ] CI/CD pipeline functional - [ ] Zero-downtime deployment strategy (blue-green or canary) - [ ] Automated rollback configured - [ ] Health checks implemented ### Monitoring - [ ] Prometheus + Grafana dashboards - [ ] Alerting configured (PagerDuty/Slack) - [ ] Logging centralized (ELK Stack) - [ ] Tracing implemented (Jaeger) ### Disaster Recovery - [ ] Automated backups (daily + hourly) - [ ] Backup restoration tested - [ ] Failover procedure documented - [ ] RTO < 4h, RPO < 1h validated ## 📊 MÉTRIQUES DE SUCCÈS ### Deployment Metrics - **Deployment Frequency**: Multiple per day - **Lead Time**: < 1 hour (commit to production) - **MTTR (Mean Time To Recovery)**: < 5 minutes - **Change Failure Rate**: < 5% ### Operational Metrics - **Uptime**: > 99.9% - **RTO**: < 4 hours - **RPO**: < 1 hour - **Deployment Success Rate**: > 95% ## 🔄 HISTORIQUE DES VERSIONS | Version | Date | Changements | |---------|------|-------------| | 1.0.0 | 2025-11-02 | Version initiale - Guide de déploiement complet | | 2.0.0 | 2026-03-04 | Audit sécurité : ajout correctifs prioritaires (rotation JWT, alignement issuer/audience Go↔Rust, protection /metrics), checklist de déploiement éthique (données personnelles, RGPD, intégrité), plan de migration JWT HS256→RS256. | --- ## ⚠️ AVERTISSEMENT **CE GUIDE EST IMMUABLE** --- **Document créé par**: DevOps Team + SRE **Date de création**: 2025-11-02 **Dernière révision**: 2026-03-04 (audit sécurité) **Prochaine révision**: Quarterly (2026-06-01) **Propriétaire**: DevOps Lead **Statut**: ✅ **APPROUVÉ ET VERROUILLÉ**