senke/veza

okinrev b7955a680c P0: stabilisation backend/chat/stream + nouvelle base migrations v1

Backend Go:
- Remplacement complet des anciennes migrations par la base V1 alignée sur ORIGIN.
- Durcissement global du parsing JSON (BindAndValidateJSON + RespondWithAppError).
- Sécurisation de config.go, CORS, statuts de santé et monitoring.
- Implémentation des transactions P0 (RBAC, duplication de playlists, social toggles).
- Ajout d’un job worker structuré (emails, analytics, thumbnails) + tests associés.
- Nouvelle doc backend : AUDIT_CONFIG, BACKEND_CONFIG, AUTH_PASSWORD_RESET, JOB_WORKER_*.

Chat server (Rust):
- Refonte du pipeline JWT + sécurité, audit et rate limiting avancé.
- Implémentation complète du cycle de message (read receipts, delivered, edit/delete, typing).
- Nettoyage des panics, gestion d’erreurs robuste, logs structurés.
- Migrations chat alignées sur le schéma UUID et nouvelles features.

Stream server (Rust):
- Refonte du moteur de streaming (encoding pipeline + HLS) et des modules core.
- Transactions P0 pour les jobs et segments, garanties d’atomicité.
- Documentation détaillée de la pipeline (AUDIT_STREAM_*, DESIGN_STREAM_PIPELINE, TRANSACTIONS_P0_IMPLEMENTATION).

Documentation & audits:
- TRIAGE.md et AUDIT_STABILITY.md à jour avec l’état réel des 3 services.
- Cartographie complète des migrations et des transactions (DB_MIGRATIONS_*, DB_TRANSACTION_PLAN, AUDIT_DB_TRANSACTIONS, TRANSACTION_TESTS_PHASE3).
- Scripts de reset et de cleanup pour la lab DB et la V1.

Ce commit fige l’ensemble du travail de stabilisation P0 (UUID, backend, chat et stream) avant les phases suivantes (Coherence Guardian, WS hardening, etc.).

2025-12-06 11:14:38 +01:00

34 KiB

Raw Blame History

ORIGIN_DEPLOYMENT_GUIDE.md

📋 RÉSUMÉ EXÉCUTIF

Ce document définit le guide de déploiement complet pour la plateforme Veza en production. Il couvre Infrastructure as Code (Terraform/Ansible), containerisation (Docker/Incus), orchestration (Kubernetes), CI/CD pipelines, stratégies zero-downtime, disaster recovery, monitoring, et procedures opérationnelles pour déploiements sécurisés, automatisés et réversibles sur 24 mois.

🎯 OBJECTIFS

Objectif Principal

Établir un processus de déploiement automatisé, sécurisé, reproductible et zero-downtime pour production avec rollback < 5 min, déploiements multiples par jour, et RTO < 4 heures en cas de disaster.

Objectifs Secondaires

Automatisation complète (Infrastructure as Code)
Zero-downtime deployments (blue-green, canary)
Rollback automatique en cas d'échec (< 5 min)
Disaster recovery plan opérationnel (RTO < 4h, RPO < 1h)
Monitoring et alerting en temps réel (Prometheus + Grafana)

🔒 RÈGLES IMMUABLES

Infrastructure as Code: 100% infrastructure versionnée (Terraform) - aucun changement manuel
Immutable Infrastructure: Jamais modifier serveurs existants, toujours redéployer
Zero Downtime: Aucun déploiement ne peut interrompre service (blue-green ou canary obligatoire)
Automated Rollback: Rollback automatique si health checks fail (< 5 min)
Version Control: Toutes les configs versionnées (Git) - aucune exception
Secrets in Vault: Aucun secret en clair (HashiCorp Vault ou équivalent)
Testing in Staging: Tous déploiements testés en staging d'abord
Monitoring Required: Alerting configuré avant mise en production
Backup Verification: Backups testés mensuellement (restore test)
Documentation: Runbooks à jour pour toutes procedures critiques

1. DEPLOYMENT PHILOSOPHY

1.1 Deployment Principles

Twelve-Factor App:

Codebase: One codebase tracked in Git, many deploys
Dependencies: Explicitly declare and isolate (go.mod, Cargo.lock, package-lock.json)
Config: Store config in environment (never in code)
Backing Services: Treat as attached resources (DB, Redis, S3)
Build, Release, Run: Strictly separate build and run stages
Processes: Execute app as stateless processes
Port Binding: Export services via port binding
Concurrency: Scale out via process model
Disposability: Fast startup and graceful shutdown
Dev/Prod Parity: Keep development, staging, production similar
Logs: Treat logs as event streams
Admin Processes: Run admin/management tasks as one-off processes

1.2 Deployment Environments

Environment	Purpose	Update Frequency	Users
Development	Local development	Continuous	Developers
Staging	Pre-production testing	Daily	QA, Product Team
Production	Live users	Multiple/day	All users

1.3 Deployment Workflow

┌─────────────┐
│   Develop   │ ─── git push ───> CI/CD Triggered
└─────────────┘
       │
       ▼
┌─────────────┐
│  Build      │ ─── Tests, Linting, Security Scan
└─────────────┘
       │
       ▼
┌─────────────┐
│  Staging    │ ─── Deploy to staging, E2E tests
└─────────────┘
       │
       ▼
┌─────────────┐
│ Production  │ ─── Blue-Green / Canary deployment
└─────────────┘
       │
       ▼
┌─────────────┐
│  Monitor    │ ─── Health checks, metrics, logs
└─────────────┘
       │
       ▼ (if issues)
┌─────────────┐
│  Rollback   │ ─── Automatic rollback < 5 min
└─────────────┘

2. INFRASTRUCTURE AS CODE

2.1 Terraform Configuration

Project Structure:

terraform/
├── environments/
│   ├── production/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── terraform.tfvars (encrypted)
│   │   └── outputs.tf
│   └── staging/
│       ├── main.tf
│       ├── variables.tf
│       ├── terraform.tfvars
│       └── outputs.tf
├── modules/
│   ├── compute/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── database/
│   ├── networking/
│   ├── storage/
│   └── kubernetes/
└── backend.tf (Terraform state in S3)

Example: Compute Module:

# terraform/modules/compute/main.tf
resource "aws_instance" "app_server" {
  count         = var.instance_count
  ami           = var.ami_id
  instance_type = var.instance_type
  
  vpc_security_group_ids = [aws_security_group.app.id]
  subnet_id              = var.subnet_ids[count.index % length(var.subnet_ids)]
  
  user_data = templatefile("${path.module}/user_data.sh", {
    environment = var.environment
  })
  
  tags = {
    Name        = "veza-app-${var.environment}-${count.index + 1}"
    Environment = var.environment
    ManagedBy   = "Terraform"
  }
  
  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_security_group" "app" {
  name        = "veza-app-${var.environment}"
  description = "Security group for Veza application servers"
  vpc_id      = var.vpc_id
  
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Database Module:

# terraform/modules/database/main.tf
resource "aws_db_instance" "postgres" {
  identifier     = "veza-db-${var.environment}"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = var.instance_class
  
  allocated_storage     = var.allocated_storage
  max_allocated_storage = var.max_allocated_storage
  storage_encrypted     = true
  kms_key_id           = var.kms_key_id
  
  db_name  = var.database_name
  username = var.master_username
  password = var.master_password # From Vault
  
  vpc_security_group_ids = [aws_security_group.database.id]
  db_subnet_group_name   = aws_db_subnet_group.database.name
  
  backup_retention_period = var.backup_retention_days
  backup_window          = "03:00-04:00"
  maintenance_window     = "mon:04:00-mon:05:00"
  
  multi_az               = var.multi_az
  publicly_accessible    = false
  skip_final_snapshot    = false
  final_snapshot_identifier = "veza-db-${var.environment}-final-snapshot-${formatdate("YYYY-MM-DD-hhmm", timestamp())}"
  
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
  
  tags = {
    Name        = "veza-db-${var.environment}"
    Environment = var.environment
    ManagedBy   = "Terraform"
  }
}

Terraform Workflow:

# Initialize
cd terraform/environments/production
terraform init

# Plan (review changes)
terraform plan -out=tfplan

# Apply (execute changes)
terraform apply tfplan

# Destroy (cleanup)
terraform destroy

2.2 Ansible Configuration

Playbook Structure:

ansible/
├── inventory/
│   ├── production/
│   │   ├── hosts.yml
│   │   └── group_vars/
│   └── staging/
│       ├── hosts.yml
│       └── group_vars/
├── playbooks/
│   ├── deploy-backend.yml
│   ├── deploy-chat-server.yml
│   ├── deploy-stream-server.yml
│   └── deploy-frontend.yml
├── roles/
│   ├── common/
│   ├── docker/
│   ├── nginx/
│   ├── postgres/
│   └── monitoring/
└── ansible.cfg

Deployment Playbook:

# ansible/playbooks/deploy-backend.yml
---
- name: Deploy Veza Backend API
  hosts: backend_servers
  become: yes
  
  vars:
    app_name: veza-backend-api
    app_version: "{{ lookup('env', 'VERSION') | default('latest') }}"
    docker_image: "registry.veza.app/{{ app_name }}:{{ app_version }}"
    
  tasks:
    - name: Pull Docker image
      docker_image:
        name: "{{ docker_image }}"
        source: pull
        
    - name: Stop old container
      docker_container:
        name: "{{ app_name }}"
        state: stopped
      ignore_errors: yes
      
    - name: Remove old container
      docker_container:
        name: "{{ app_name }}"
        state: absent
      ignore_errors: yes
      
    - name: Start new container
      docker_container:
        name: "{{ app_name }}"
        image: "{{ docker_image }}"
        state: started
        restart_policy: unless-stopped
        ports:
          - "8080:8080"
        env:
          DATABASE_URL: "{{ database_url }}"
          REDIS_URL: "{{ redis_url }}"
          JWT_SECRET: "{{ jwt_secret }}"
        volumes:
          - "/var/log/{{ app_name }}:/var/log/app"
        healthcheck:
          test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
          interval: 30s
          timeout: 10s
          retries: 3
          start_period: 40s
          
    - name: Wait for application to be healthy
      uri:
        url: http://localhost:8080/health
        status_code: 200
      register: result
      until: result.status == 200
      retries: 10
      delay: 5
      
    - name: Verify deployment
      debug:
        msg: "{{ app_name }} version {{ app_version }} deployed successfully"

3. CONTAINERIZATION

3.1 Docker Images

Multi-Stage Build (Go):

# veza-backend-api/Dockerfile
# Stage 1: Builder
FROM golang:1.21.5-alpine3.18 AS builder

WORKDIR /app

# Copy dependencies
COPY go.mod go.sum ./
RUN go mod download

# Copy source
COPY . .

# Build binary
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -ldflags="-w -s" -o main ./cmd/api

# Stage 2: Runner
FROM alpine:3.18

# Install CA certificates for HTTPS
RUN apk --no-cache add ca-certificates

WORKDIR /root/

# Copy binary from builder
COPY --from=builder /app/main .

# Create non-root user
RUN addgroup -g 1000 appuser && \
    adduser -D -u 1000 -G appuser appuser

USER appuser

# Expose port
EXPOSE 8080

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
  CMD ["/root/main", "healthcheck"]

# Run
ENTRYPOINT ["./main"]

Multi-Stage Build (Rust):

# veza-chat-server/Dockerfile
FROM rust:1.75-alpine AS builder

WORKDIR /app

RUN apk add --no-cache musl-dev

# Copy dependencies
COPY Cargo.toml Cargo.lock ./
RUN mkdir src && echo "fn main() {}" > src/main.rs && cargo build --release && rm -rf src

# Copy source
COPY . .

# Build binary
RUN cargo build --release

# Stage 2: Runner
FROM alpine:3.18

WORKDIR /app

# Copy binary
COPY --from=builder /app/target/release/veza-chat-server .

# Create non-root user
RUN addgroup -g 1000 appuser && \
    adduser -D -u 1000 -G appuser appuser

USER appuser

EXPOSE 8081

HEALTHCHECK --interval=30s --timeout=10s --start-period=20s --retries=3 \
  CMD ["wget", "--quiet", "--tries=1", "--spider", "http://localhost:8081/health"]

ENTRYPOINT ["./veza-chat-server"]

Frontend (React/Vite):

# apps/web/Dockerfile
FROM node:20-alpine AS builder

WORKDIR /app

COPY package*.json ./
RUN npm ci

COPY . .
RUN npm run build

# Stage 2: Nginx
FROM nginx:1.25-alpine

COPY --from=builder /app/dist /usr/share/nginx/html
COPY nginx.conf /etc/nginx/conf.d/default.conf

EXPOSE 80

HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
  CMD ["wget", "--quiet", "--tries=1", "--spider", "http://localhost/health"]

CMD ["nginx", "-g", "daemon off;"]

3.2 Docker Compose (Development)

# docker-compose.yml
version: '3.9'

services:
  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: veza_db
      POSTGRES_USER: veza
      POSTGRES_PASSWORD: ${DB_PASSWORD:-password}
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U veza"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 5

  backend:
    build:
      context: ./veza-backend-api
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    environment:
      DATABASE_URL: postgresql://veza:${DB_PASSWORD:-password}@postgres:5432/veza_db
      REDIS_URL: redis://redis:6379
      JWT_SECRET: ${JWT_SECRET}
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  chat-server:
    build:
      context: ./veza-chat-server
      dockerfile: Dockerfile
    ports:
      - "8081:8081"
    environment:
      DATABASE_URL: postgresql://veza:${DB_PASSWORD:-password}@postgres:5432/veza_db
      REDIS_URL: redis://redis:6379
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy

  frontend:
    build:
      context: ./apps/web
      dockerfile: Dockerfile
    ports:
      - "3000:80"
    depends_on:
      - backend

volumes:
  postgres_data:
  redis_data:

4. KUBERNETES ORCHESTRATION

4.1 Kubernetes Manifests

Deployment (Backend):

# k8s/backend/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: veza-backend
  namespace: veza-production
  labels:
    app: veza-backend
    version: v1.0.0
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: veza-backend
  template:
    metadata:
      labels:
        app: veza-backend
        version: v1.0.0
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
      - name: backend
        image: registry.veza.app/veza-backend-api:v1.0.0
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: veza-secrets
              key: database-url
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: veza-secrets
              key: redis-url
        - name: JWT_SECRET
          valueFrom:
            secretKeyRef:
              name: veza-secrets
              key: jwt-secret
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
      imagePullSecrets:
      - name: registry-credentials

Service:

# k8s/backend/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: veza-backend
  namespace: veza-production
spec:
  type: ClusterIP
  selector:
    app: veza-backend
  ports:
  - name: http
    port: 80
    targetPort: 8080
    protocol: TCP

Ingress:

# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: veza-ingress
  namespace: veza-production
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/rate-limit: "100"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.veza.app
    - veza.app
    secretName: veza-tls
  rules:
  - host: api.veza.app
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: veza-backend
            port:
              number: 80
  - host: veza.app
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: veza-frontend
            port:
              number: 80

HorizontalPodAutoscaler:

# k8s/backend/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: veza-backend-hpa
  namespace: veza-production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: veza-backend
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 60

5. CI/CD PIPELINES

5.1 GitHub Actions Workflow

# .github/workflows/deploy-production.yml
name: Deploy to Production

on:
  push:
    branches:
      - main
    tags:
      - 'v*'

env:
  REGISTRY: registry.veza.app
  KUBE_NAMESPACE: veza-production

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Run tests
        run: |
          make test-all          
      
      - name: Security scan
        run: |
          make security-scan          

  build-backend:
    needs: build-and-test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2
      
      - name: Login to Registry
        uses: docker/login-action@v2
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ secrets.REGISTRY_USERNAME }}
          password: ${{ secrets.REGISTRY_PASSWORD }}
      
      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v4
        with:
          images: ${{ env.REGISTRY }}/veza-backend-api
          tags: |
            type=ref,event=branch
            type=ref,event=pr
            type=semver,pattern={{version}}
            type=semver,pattern={{major}}.{{minor}}
            type=sha,prefix={{branch}}-            
      
      - name: Build and push
        uses: docker/build-push-action@v4
        with:
          context: ./veza-backend-api
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=registry,ref=${{ env.REGISTRY }}/veza-backend-api:buildcache
          cache-to: type=registry,ref=${{ env.REGISTRY }}/veza-backend-api:buildcache,mode=max

  deploy-staging:
    needs: [build-backend]
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v3
      
      - name: Deploy to Staging
        run: |
          kubectl set image deployment/veza-backend \
            backend=${{ env.REGISTRY }}/veza-backend-api:${{ github.sha }} \
            -n veza-staging
          kubectl rollout status deployment/veza-backend -n veza-staging --timeout=5m          
      
      - name: Run E2E tests
        run: |
          npm run test:e2e -- --env=staging          

  deploy-production:
    needs: [deploy-staging]
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup kubectl
        uses: azure/setup-kubectl@v3
        with:
          version: 'v1.28.0'
      
      - name: Configure kubectl
        run: |
          echo "${{ secrets.KUBE_CONFIG }}" | base64 -d > kubeconfig
          export KUBECONFIG=./kubeconfig          
      
      - name: Deploy to Production (Blue-Green)
        run: |
          # Deploy green environment
          kubectl apply -f k8s/backend/deployment-green.yaml
          kubectl rollout status deployment/veza-backend-green -n ${{ env.KUBE_NAMESPACE }} --timeout=10m
          
          # Run smoke tests
          make smoke-tests ENDPOINT=https://green.api.veza.app
          
          # Switch traffic to green
          kubectl patch service veza-backend -n ${{ env.KUBE_NAMESPACE }} \
            -p '{"spec":{"selector":{"version":"green"}}}'
          
          # Wait for validation
          sleep 60
          
          # Monitor metrics
          if ! make verify-deployment; then
            echo "Deployment verification failed, rolling back..."
            kubectl patch service veza-backend -n ${{ env.KUBE_NAMESPACE }} \
              -p '{"spec":{"selector":{"version":"blue"}}}'
            exit 1
          fi
          
          # Delete old blue deployment
          kubectl delete deployment veza-backend-blue -n ${{ env.KUBE_NAMESPACE }}          
      
      - name: Notify Slack
        if: always()
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "Production deployment ${{ job.status }}: ${{ github.sha }}"
            }            
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

6. ZERO-DOWNTIME STRATEGIES

6.1 Blue-Green Deployment

Process:

Blue (current production) serves all traffic
Deploy Green (new version) in parallel
Test Green thoroughly (smoke tests, health checks)
Switch load balancer from Blue to Green (instant cutover)
Monitor Green for issues (5-10 min)
If issues: Rollback to Blue (instant)
If stable: Decommission Blue

Kubernetes Implementation:

# Deploy green
kubectl apply -f k8s/backend/deployment-green.yaml

# Wait for readiness
kubectl wait --for=condition=available --timeout=10m deployment/veza-backend-green

# Switch service selector
kubectl patch service veza-backend -p '{"spec":{"selector":{"version":"green"}}}'

# Monitor
watch kubectl get pods -l app=veza-backend

# Rollback if needed
kubectl patch service veza-backend -p '{"spec":{"selector":{"version":"blue"}}}'

6.2 Canary Deployment

Process:

Deploy new version (canary) with 5% traffic
Monitor metrics (error rate, latency)
Gradually increase traffic: 5% → 25% → 50% → 100%
At each stage, verify metrics are healthy
If issues detected: Rollback immediately

Kubernetes with Istio:

# k8s/canary/virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: veza-backend
spec:
  hosts:
  - veza-backend
  http:
  - match:
    - headers:
        canary:
          exact: "true"
    route:
    - destination:
        host: veza-backend
        subset: canary
  - route:
    - destination:
        host: veza-backend
        subset: stable
      weight: 95
    - destination:
        host: veza-backend
        subset: canary
      weight: 5

Automated Canary with Flagger:

# k8s/canary/flagger-canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: veza-backend
  namespace: veza-production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: veza-backend
  service:
    port: 80
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 1m
  webhooks:
    - name: acceptance-test
      type: pre-rollout
      url: http://flagger-loadtester.test/
      timeout: 30s
      metadata:
        type: bash
        cmd: "curl -s http://veza-backend-canary/health | grep -q ok"

7. CONFIGURATION MANAGEMENT

7.1 ConfigMap (Non-Sensitive Config)

# k8s/backend/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: veza-backend-config
  namespace: veza-production
data:
  APP_ENV: "production"
  LOG_LEVEL: "info"
  API_RATE_LIMIT: "300"
  MAX_UPLOAD_SIZE: "500MB"
  CORS_ORIGINS: "https://veza.app,https://www.veza.app"

7.2 Secrets (Sensitive Data)

# k8s/backend/secret.yaml (encrypted with SOPS or sealed-secrets)
apiVersion: v1
kind: Secret
metadata:
  name: veza-secrets
  namespace: veza-production
type: Opaque
data:
  database-url: <base64-encoded>
  redis-url: <base64-encoded>
  jwt-secret: <base64-encoded>
  stripe-api-key: <base64-encoded>

Create Secret from Vault:

# Fetch from Vault and create K8s secret
vault kv get -field=database_url secret/veza/production | base64 | \
  kubectl create secret generic veza-secrets \
    --from-literal=database-url=- \
    -n veza-production

8. SECRETS MANAGEMENT

8.1 HashiCorp Vault

Vault Structure:

secret/
├── veza/
│   ├── production/
│   │   ├── database_url
│   │   ├── redis_url
│   │   ├── jwt_secret
│   │   ├── stripe_api_key
│   │   ├── aws_access_key
│   │   └── aws_secret_key
│   └── staging/
│       └── ...

Store Secret:

# Write secret
vault kv put secret/veza/production \
  database_url="postgresql://..." \
  redis_url="redis://..." \
  jwt_secret="..."

# Read secret
vault kv get secret/veza/production

# Rotate secret (new version)
vault kv put secret/veza/production jwt_secret="new-secret"

Vault Agent Injector (Kubernetes):

apiVersion: v1
kind: Pod
metadata:
  annotations:
    vault.hashicorp.com/agent-inject: "true"
    vault.hashicorp.com/role: "veza-backend"
    vault.hashicorp.com/agent-inject-secret-database: "secret/data/veza/production"
    vault.hashicorp.com/agent-inject-template-database: |
      {{- with secret "secret/data/veza/production" -}}
      export DATABASE_URL="{{ .Data.data.database_url }}"
      {{- end }}

9. MONITORING & OBSERVABILITY

9.1 Prometheus + Grafana

Prometheus Configuration:

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'veza-backend'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_label_app]
      action: keep
      regex: veza-backend
    - source_labels: [__meta_kubernetes_pod_ip]
      target_label: __address__
      replacement: $1:8080

  - job_name: 'postgres'
    static_configs:
    - targets: ['postgres-exporter:9187']

  - job_name: 'redis'
    static_configs:
    - targets: ['redis-exporter:9121']

Grafana Dashboard:

API Latency: p50, p95, p99 response times
Throughput: Requests per second
Error Rate: 4xx, 5xx errors
Database: Query time, connections, slow queries
Cache Hit Rate: Redis hit/miss ratio

9.2 Logging (ELK Stack)

Filebeat Configuration:

# filebeat/filebeat.yml
filebeat.inputs:
- type: container
  paths:
    - '/var/lib/docker/containers/*/*.log'
  processors:
    - add_kubernetes_metadata:
        host: ${NODE_NAME}
        matchers:
        - logs_path:
            logs_path: "/var/lib/docker/containers/"

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "veza-logs-%{+yyyy.MM.dd}"

9.3 Tracing (Jaeger)

OpenTelemetry Integration:

// Go - OpenTelemetry setup
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() (*trace.TracerProvider, error) {
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint("http://jaeger:14268/api/traces")))
    if err != nil {
        return nil, err
    }
    
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("veza-backend-api"),
        )),
    )
    
    otel.SetTracerProvider(tp)
    return tp, nil
}

10. BACKUP & DISASTER RECOVERY

10.1 Database Backups

Automated Backup Strategy:

Daily: Full backup (3 AM UTC)
Hourly: Incremental backup
Retention: 30 days daily, 12 weeks weekly, 2 years monthly

Backup Script:

#!/bin/bash
# scripts/backup-database.sh

DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/postgres"
DATABASE="veza_db"

# Full backup
pg_dump -Fc -f "$BACKUP_DIR/veza_db_$DATE.dump" "$DATABASE"

# Encrypt
gpg --encrypt --recipient backup@veza.app "$BACKUP_DIR/veza_db_$DATE.dump"

# Upload to S3
aws s3 cp "$BACKUP_DIR/veza_db_$DATE.dump.gpg" s3://veza-backups/postgres/

# Cleanup local backups > 7 days
find "$BACKUP_DIR" -name "*.dump.gpg" -mtime +7 -delete

Restore Procedure:

#!/bin/bash
# scripts/restore-database.sh

BACKUP_FILE=$1

# Download from S3
aws s3 cp "s3://veza-backups/postgres/$BACKUP_FILE" /tmp/

# Decrypt
gpg --decrypt "/tmp/$BACKUP_FILE" > "/tmp/backup.dump"

# Restore
pg_restore -d veza_db "/tmp/backup.dump"

10.2 Disaster Recovery Plan

RTO (Recovery Time Objective): < 4 hours
RPO (Recovery Point Objective): < 1 hour

Recovery Procedures:

Database Failure: Failover to standby replica (< 5 min)
Application Failure: Rollback deployment (< 5 min)
Complete Region Failure: Failover to DR region (< 4 hours)

11. SCALING STRATEGY

11.1 Horizontal Scaling

Auto-Scaling Rules:

CPU > 70%: Scale up
CPU < 30%: Scale down (after 5 min stability)
Memory > 80%: Scale up
Request queue > 100: Scale up

11.2 Database Scaling

Read Replicas:

2 read replicas minimum
Route read queries to replicas
Write queries to primary only

Connection Pooling (PgBouncer):

[databases]
veza_db = host=postgres port=5432 dbname=veza_db

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5

12. OPERATIONAL PROCEDURES

12.1 Deployment Checklist

Pre-Deployment:

All tests pass (unit, integration, E2E)
Security scan completed (no critical vulnerabilities)
Database migrations tested in staging
Rollback plan documented
Monitoring dashboards ready
On-call engineer notified
Deployment window scheduled (low-traffic period)

During Deployment:

Monitor error rates in real-time
Monitor response times (p95, p99)
Check logs for errors
Verify database migrations applied
Test critical user flows

Post-Deployment:

Verify all services healthy
Run smoke tests
Monitor for 30 minutes
Update deployment log
Notify stakeholders

12.2 Rollback Procedure

Immediate Rollback (< 5 min):

# Kubernetes
kubectl rollout undo deployment/veza-backend -n veza-production

# Verify
kubectl rollout status deployment/veza-backend -n veza-production

# Check logs
kubectl logs -f deployment/veza-backend -n veza-production

12.3 Incident Response

Severity Levels:

P0 (Critical): Production down, data breach
P1 (High): Major feature broken, performance degradation
P2 (Medium): Minor feature broken
P3 (Low): Cosmetic issues

Response Procedure:

Acknowledge incident (< 5 min)
Assess severity
Notify stakeholders
Mitigate (rollback, hotfix, scaling)
Root cause analysis
Post-mortem

✅ CHECKLIST DE VALIDATION

Infrastructure

Infrastructure as Code (Terraform) complete
All resources versioned in Git
Secrets in Vault (no plaintext)
Automated provisioning tested

Deployment

CI/CD pipeline functional
Zero-downtime deployment strategy (blue-green or canary)
Automated rollback configured
Health checks implemented

Monitoring

Prometheus + Grafana dashboards
Alerting configured (PagerDuty/Slack)
Logging centralized (ELK Stack)
Tracing implemented (Jaeger)

Disaster Recovery

Automated backups (daily + hourly)
Backup restoration tested
Failover procedure documented
RTO < 4h, RPO < 1h validated

📊 MÉTRIQUES DE SUCCÈS

Deployment Metrics

Deployment Frequency: Multiple per day
Lead Time: < 1 hour (commit to production)
MTTR (Mean Time To Recovery): < 5 minutes
Change Failure Rate: < 5%

Operational Metrics

Uptime: > 99.9%
RTO: < 4 hours
RPO: < 1 hour
Deployment Success Rate: > 95%

🔄 HISTORIQUE DES VERSIONS

Version	Date	Changements
1.0.0	2025-11-02	Version initiale - Guide de déploiement complet

⚠️ AVERTISSEMENT

CE GUIDE EST IMMUABLE

Document créé par: DevOps Team + SRE
Date de création: 2025-11-02
Prochaine révision: Quarterly (2026-02-01)
Propriétaire: DevOps Lead

Statut: ✅ APPROUVÉ ET VERROUILLÉ

34 KiB Raw Blame History