veza/veza-docs/ORIGIN/ORIGIN_DEPLOYMENT_GUIDE.md

# ORIGIN_DEPLOYMENT_GUIDE.md

## 📋 RÉSUMÉ EXÉCUTIF

Ce document définit le guide de déploiement complet pour la plateforme Veza en production. Il couvre Infrastructure as Code (Terraform/Ansible), containerisation (Docker/Incus), orchestration (Kubernetes), CI/CD pipelines, stratégies zero-downtime, disaster recovery, monitoring, et procedures opérationnelles pour déploiements sécurisés, automatisés et réversibles sur 24 mois.

## 🎯 OBJECTIFS

### Objectif Principal
Établir un processus de déploiement automatisé, sécurisé, reproductible et zero-downtime pour production avec rollback < 5 min, déploiements multiples par jour, et RTO < 4 heures en cas de disaster.

### Objectifs Secondaires
- Automatisation complète (Infrastructure as Code)
- Zero-downtime deployments (blue-green, canary)
- Rollback automatique en cas d'échec (< 5 min)
- Disaster recovery plan opérationnel (RTO < 4h, RPO < 1h)
- Monitoring et alerting en temps réel (Prometheus + Grafana)

## 📖 TABLE DES MATIÈRES

1. [Deployment Philosophy](#1-deployment-philosophy)
2. [Infrastructure as Code](#2-infrastructure-as-code)
3. [Containerization](#3-containerization)
4. [Kubernetes Orchestration](#4-kubernetes-orchestration)
5. [CI/CD Pipelines](#5-cicd-pipelines)
6. [Zero-Downtime Strategies](#6-zero-downtime-strategies)
7. [Configuration Management](#7-configuration-management)
8. [Secrets Management](#8-secrets-management)
9. [Monitoring & Observability](#9-monitoring--observability)
10. [Backup & Disaster Recovery](#10-backup--disaster-recovery)
11. [Scaling Strategy](#11-scaling-strategy)
12. [Operational Procedures](#12-operational-procedures)

## 🔒 RÈGLES IMMUABLES

1. **Infrastructure as Code**: 100% infrastructure versionnée (Terraform) - aucun changement manuel
2. **Immutable Infrastructure**: Jamais modifier serveurs existants, toujours redéployer
3. **Zero Downtime**: Aucun déploiement ne peut interrompre service (blue-green ou canary obligatoire)
4. **Automated Rollback**: Rollback automatique si health checks fail (< 5 min)
5. **Version Control**: Toutes les configs versionnées (Git) - aucune exception
6. **Secrets in Vault**: Aucun secret en clair (HashiCorp Vault ou équivalent)
7. **Testing in Staging**: Tous déploiements testés en staging d'abord
8. **Monitoring Required**: Alerting configuré avant mise en production
9. **Backup Verification**: Backups testés mensuellement (restore test)
10. **Documentation**: Runbooks à jour pour toutes procedures critiques

## 1. DEPLOYMENT PHILOSOPHY

### 1.1 Deployment Principles

**Twelve-Factor App**:
1. **Codebase**: One codebase tracked in Git, many deploys
2. **Dependencies**: Explicitly declare and isolate (go.mod, Cargo.lock, package-lock.json)
3. **Config**: Store config in environment (never in code)
4. **Backing Services**: Treat as attached resources (DB, Redis, S3)
5. **Build, Release, Run**: Strictly separate build and run stages
6. **Processes**: Execute app as stateless processes
7. **Port Binding**: Export services via port binding
8. **Concurrency**: Scale out via process model
9. **Disposability**: Fast startup and graceful shutdown
10. **Dev/Prod Parity**: Keep development, staging, production similar
11. **Logs**: Treat logs as event streams
12. **Admin Processes**: Run admin/management tasks as one-off processes

### 1.2 Deployment Environments

| Environment | Purpose | Update Frequency | Users |
|-------------|---------|------------------|-------|
| **Development** | Local development | Continuous | Developers |
| **Staging** | Pre-production testing | Daily | QA, Product Team |
| **Production** | Live users | Multiple/day | All users |

### 1.3 Deployment Workflow

```
┌─────────────┐
│   Develop   │ ─── git push ───> CI/CD Triggered
└─────────────┘
       │
       ▼
┌─────────────┐
│  Build      │ ─── Tests, Linting, Security Scan
└─────────────┘
       │
       ▼
┌─────────────┐
│  Staging    │ ─── Deploy to staging, E2E tests
└─────────────┘
       │
       ▼
┌─────────────┐
│ Production  │ ─── Blue-Green / Canary deployment
└─────────────┘
       │
       ▼
┌─────────────┐
│  Monitor    │ ─── Health checks, metrics, logs
└─────────────┘
       │
       ▼ (if issues)
┌─────────────┐
│  Rollback   │ ─── Automatic rollback < 5 min
└─────────────┘
```

## 2. INFRASTRUCTURE AS CODE

### 2.1 Terraform Configuration

**Project Structure**:
```
terraform/
├── environments/
│   ├── production/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── terraform.tfvars (encrypted)
│   │   └── outputs.tf
│   └── staging/
│       ├── main.tf
│       ├── variables.tf
│       ├── terraform.tfvars
│       └── outputs.tf
├── modules/
│   ├── compute/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── database/
│   ├── networking/
│   ├── storage/
│   └── kubernetes/
└── backend.tf (Terraform state in S3)
```

**Example: Compute Module**:
```hcl
# terraform/modules/compute/main.tf
resource "aws_instance" "app_server" {
  count         = var.instance_count
  ami           = var.ami_id
  instance_type = var.instance_type
  
  vpc_security_group_ids = [aws_security_group.app.id]
  subnet_id              = var.subnet_ids[count.index % length(var.subnet_ids)]
  
  user_data = templatefile("${path.module}/user_data.sh", {
    environment = var.environment
  })
  
  tags = {
    Name        = "veza-app-${var.environment}-${count.index + 1}"
    Environment = var.environment
    ManagedBy   = "Terraform"
  }
  
  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_security_group" "app" {
  name        = "veza-app-${var.environment}"
  description = "Security group for Veza application servers"
  vpc_id      = var.vpc_id
  
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}
```

**Database Module**:
```hcl
# terraform/modules/database/main.tf
resource "aws_db_instance" "postgres" {
  identifier     = "veza-db-${var.environment}"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = var.instance_class
  
  allocated_storage     = var.allocated_storage
  max_allocated_storage = var.max_allocated_storage
  storage_encrypted     = true
  kms_key_id           = var.kms_key_id
  
  db_name  = var.database_name
  username = var.master_username
  password = var.master_password # From Vault
  
  vpc_security_group_ids = [aws_security_group.database.id]
  db_subnet_group_name   = aws_db_subnet_group.database.name
  
  backup_retention_period = var.backup_retention_days
  backup_window          = "03:00-04:00"
  maintenance_window     = "mon:04:00-mon:05:00"
  
  multi_az               = var.multi_az
  publicly_accessible    = false
  skip_final_snapshot    = false
  final_snapshot_identifier = "veza-db-${var.environment}-final-snapshot-${formatdate("YYYY-MM-DD-hhmm", timestamp())}"
  
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
  
  tags = {
    Name        = "veza-db-${var.environment}"
    Environment = var.environment
    ManagedBy   = "Terraform"
  }
}
```

**Terraform Workflow**:
```bash
# Initialize
cd terraform/environments/production
terraform init

# Plan (review changes)
terraform plan -out=tfplan

# Apply (execute changes)
terraform apply tfplan

# Destroy (cleanup)
terraform destroy
```

### 2.2 Ansible Configuration

**Playbook Structure**:
```
ansible/
├── inventory/
│   ├── production/
│   │   ├── hosts.yml
│   │   └── group_vars/
│   └── staging/
│       ├── hosts.yml
│       └── group_vars/
├── playbooks/
│   ├── deploy-backend.yml
│   ├── deploy-chat-server.yml
│   ├── deploy-stream-server.yml
│   └── deploy-frontend.yml
├── roles/
│   ├── common/
│   ├── docker/
│   ├── nginx/
│   ├── postgres/
│   └── monitoring/
└── ansible.cfg
```

**Deployment Playbook**:
```yaml
# ansible/playbooks/deploy-backend.yml
---
- name: Deploy Veza Backend API
  hosts: backend_servers
  become: yes
  
  vars:
    app_name: veza-backend-api
    app_version: "{{ lookup('env', 'VERSION') | default('latest') }}"
    docker_image: "registry.veza.app/{{ app_name }}:{{ app_version }}"
    
  tasks:
    - name: Pull Docker image
      docker_image:
        name: "{{ docker_image }}"
        source: pull
        
    - name: Stop old container
      docker_container:
        name: "{{ app_name }}"
        state: stopped
      ignore_errors: yes
      
    - name: Remove old container
      docker_container:
        name: "{{ app_name }}"
        state: absent
      ignore_errors: yes
      
    - name: Start new container
      docker_container:
        name: "{{ app_name }}"
        image: "{{ docker_image }}"
        state: started
        restart_policy: unless-stopped
        ports:
          - "8080:8080"
        env:
          DATABASE_URL: "{{ database_url }}"
          REDIS_URL: "{{ redis_url }}"
          JWT_SECRET: "{{ jwt_secret }}"
        volumes:
          - "/var/log/{{ app_name }}:/var/log/app"
        healthcheck:
          test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
          interval: 30s
          timeout: 10s
          retries: 3
          start_period: 40s
          
    - name: Wait for application to be healthy
      uri:
        url: http://localhost:8080/health
        status_code: 200
      register: result
      until: result.status == 200
      retries: 10
      delay: 5
      
    - name: Verify deployment
      debug:
        msg: "{{ app_name }} version {{ app_version }} deployed successfully"
```

## 3. CONTAINERIZATION

### 3.1 Docker Images

**Multi-Stage Build (Go)**:
```dockerfile
# veza-backend-api/Dockerfile
# Stage 1: Builder
FROM golang:1.21.5-alpine3.18 AS builder

WORKDIR /app

# Copy dependencies
COPY go.mod go.sum ./
RUN go mod download

# Copy source
COPY . .

# Build binary
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -ldflags="-w -s" -o main ./cmd/api

# Stage 2: Runner
FROM alpine:3.18

# Install CA certificates for HTTPS
RUN apk --no-cache add ca-certificates

WORKDIR /root/

# Copy binary from builder
COPY --from=builder /app/main .

# Create non-root user
RUN addgroup -g 1000 appuser && \
    adduser -D -u 1000 -G appuser appuser

USER appuser

# Expose port
EXPOSE 8080

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
  CMD ["/root/main", "healthcheck"]

# Run
ENTRYPOINT ["./main"]
```

**Multi-Stage Build (Rust)**:
```dockerfile
# veza-chat-server/Dockerfile
FROM rust:1.75-alpine AS builder

WORKDIR /app

RUN apk add --no-cache musl-dev

# Copy dependencies
COPY Cargo.toml Cargo.lock ./
RUN mkdir src && echo "fn main() {}" > src/main.rs && cargo build --release && rm -rf src

# Copy source
COPY . .

# Build binary
RUN cargo build --release

# Stage 2: Runner
FROM alpine:3.18

WORKDIR /app

# Copy binary
COPY --from=builder /app/target/release/veza-chat-server .

# Create non-root user
RUN addgroup -g 1000 appuser && \
    adduser -D -u 1000 -G appuser appuser

USER appuser

EXPOSE 8081

HEALTHCHECK --interval=30s --timeout=10s --start-period=20s --retries=3 \
  CMD ["wget", "--quiet", "--tries=1", "--spider", "http://localhost:8081/health"]

ENTRYPOINT ["./veza-chat-server"]
```

**Frontend (React/Vite)**:
```dockerfile
# apps/web/Dockerfile
FROM node:20-alpine AS builder

WORKDIR /app

COPY package*.json ./
RUN npm ci

COPY . .
RUN npm run build

# Stage 2: Nginx
FROM nginx:1.25-alpine

COPY --from=builder /app/dist /usr/share/nginx/html
COPY nginx.conf /etc/nginx/conf.d/default.conf

EXPOSE 80

HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
  CMD ["wget", "--quiet", "--tries=1", "--spider", "http://localhost/health"]

CMD ["nginx", "-g", "daemon off;"]
```

### 3.2 Docker Compose (Development)

```yaml
# docker-compose.yml
version: '3.9'

services:
  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: veza_db
      POSTGRES_USER: veza
      POSTGRES_PASSWORD: ${DB_PASSWORD:-password}
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U veza"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 5

  backend:
    build:
      context: ./veza-backend-api
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    environment:
      DATABASE_URL: postgresql://veza:${DB_PASSWORD:-password}@postgres:5432/veza_db
      REDIS_URL: redis://redis:6379
      JWT_SECRET: ${JWT_SECRET}
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  chat-server:
    build:
      context: ./veza-chat-server
      dockerfile: Dockerfile
    ports:
      - "8081:8081"
    environment:
      DATABASE_URL: postgresql://veza:${DB_PASSWORD:-password}@postgres:5432/veza_db
      REDIS_URL: redis://redis:6379
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy

  frontend:
    build:
      context: ./apps/web
      dockerfile: Dockerfile
    ports:
      - "3000:80"
    depends_on:
      - backend

volumes:
  postgres_data:
  redis_data:
```

## 4. KUBERNETES ORCHESTRATION

### 4.1 Kubernetes Manifests

**Deployment (Backend)**:
```yaml
# k8s/backend/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: veza-backend
  namespace: veza-production
  labels:
    app: veza-backend
    version: v1.0.0
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: veza-backend
  template:
    metadata:
      labels:
        app: veza-backend
        version: v1.0.0
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
      - name: backend
        image: registry.veza.app/veza-backend-api:v1.0.0
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: veza-secrets
              key: database-url
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: veza-secrets
              key: redis-url
        - name: JWT_SECRET
          valueFrom:
            secretKeyRef:
              name: veza-secrets
              key: jwt-secret
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
      imagePullSecrets:
      - name: registry-credentials
```

**Service**:
```yaml
# k8s/backend/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: veza-backend
  namespace: veza-production
spec:
  type: ClusterIP
  selector:
    app: veza-backend
  ports:
  - name: http
    port: 80
    targetPort: 8080
    protocol: TCP
```

**Ingress**:
```yaml
# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: veza-ingress
  namespace: veza-production
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/rate-limit: "100"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.veza.app
    - veza.app
    secretName: veza-tls
  rules:
  - host: api.veza.app
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: veza-backend
            port:
              number: 80
  - host: veza.app
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: veza-frontend
            port:
              number: 80
```

**HorizontalPodAutoscaler**:
```yaml
# k8s/backend/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: veza-backend-hpa
  namespace: veza-production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: veza-backend
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 60
```

## 5. CI/CD PIPELINES

### 5.1 GitHub Actions Workflow

```yaml
# .github/workflows/deploy-production.yml
name: Deploy to Production

on:
  push:
    branches:
      - main
    tags:
      - 'v*'

env:
  REGISTRY: registry.veza.app
  KUBE_NAMESPACE: veza-production

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Run tests
        run: |
          make test-all
      
      - name: Security scan
        run: |
          make security-scan

  build-backend:
    needs: build-and-test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2
      
      - name: Login to Registry
        uses: docker/login-action@v2
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ secrets.REGISTRY_USERNAME }}
          password: ${{ secrets.REGISTRY_PASSWORD }}
      
      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v4
        with:
          images: ${{ env.REGISTRY }}/veza-backend-api
          tags: |
            type=ref,event=branch
            type=ref,event=pr
            type=semver,pattern={{version}}
            type=semver,pattern={{major}}.{{minor}}
            type=sha,prefix={{branch}}-
      
      - name: Build and push
        uses: docker/build-push-action@v4
        with:
          context: ./veza-backend-api
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=registry,ref=${{ env.REGISTRY }}/veza-backend-api:buildcache
          cache-to: type=registry,ref=${{ env.REGISTRY }}/veza-backend-api:buildcache,mode=max

  deploy-staging:
    needs: [build-backend]
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v3
      
      - name: Deploy to Staging
        run: |
          kubectl set image deployment/veza-backend \
            backend=${{ env.REGISTRY }}/veza-backend-api:${{ github.sha }} \
            -n veza-staging
          kubectl rollout status deployment/veza-backend -n veza-staging --timeout=5m
      
      - name: Run E2E tests
        run: |
          npm run test:e2e -- --env=staging

  deploy-production:
    needs: [deploy-staging]
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup kubectl
        uses: azure/setup-kubectl@v3
        with:
          version: 'v1.28.0'
      
      - name: Configure kubectl
        run: |
          echo "${{ secrets.KUBE_CONFIG }}" | base64 -d > kubeconfig
          export KUBECONFIG=./kubeconfig
      
      - name: Deploy to Production (Blue-Green)
        run: |
          # Deploy green environment
          kubectl apply -f k8s/backend/deployment-green.yaml
          kubectl rollout status deployment/veza-backend-green -n ${{ env.KUBE_NAMESPACE }} --timeout=10m
          
          # Run smoke tests
          make smoke-tests ENDPOINT=https://green.api.veza.app
          
          # Switch traffic to green
          kubectl patch service veza-backend -n ${{ env.KUBE_NAMESPACE }} \
            -p '{"spec":{"selector":{"version":"green"}}}'
          
          # Wait for validation
          sleep 60
          
          # Monitor metrics
          if ! make verify-deployment; then
            echo "Deployment verification failed, rolling back..."
            kubectl patch service veza-backend -n ${{ env.KUBE_NAMESPACE }} \
              -p '{"spec":{"selector":{"version":"blue"}}}'
            exit 1
          fi
          
          # Delete old blue deployment
          kubectl delete deployment veza-backend-blue -n ${{ env.KUBE_NAMESPACE }}
      
      - name: Notify Slack
        if: always()
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "Production deployment ${{ job.status }}: ${{ github.sha }}"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
```

## 6. ZERO-DOWNTIME STRATEGIES

### 6.1 Blue-Green Deployment

**Process**:
1. **Blue** (current production) serves all traffic
2. Deploy **Green** (new version) in parallel
3. Test Green thoroughly (smoke tests, health checks)
4. Switch load balancer from Blue to Green (instant cutover)
5. Monitor Green for issues (5-10 min)
6. If issues: Rollback to Blue (instant)
7. If stable: Decommission Blue

**Kubernetes Implementation**:
```bash
# Deploy green
kubectl apply -f k8s/backend/deployment-green.yaml

# Wait for readiness
kubectl wait --for=condition=available --timeout=10m deployment/veza-backend-green

# Switch service selector
kubectl patch service veza-backend -p '{"spec":{"selector":{"version":"green"}}}'

# Monitor
watch kubectl get pods -l app=veza-backend

# Rollback if needed
kubectl patch service veza-backend -p '{"spec":{"selector":{"version":"blue"}}}'
```

### 6.2 Canary Deployment

**Process**:
1. Deploy new version (canary) with 5% traffic
2. Monitor metrics (error rate, latency)
3. Gradually increase traffic: 5% → 25% → 50% → 100%
4. At each stage, verify metrics are healthy
5. If issues detected: Rollback immediately

**Kubernetes with Istio**:
```yaml
# k8s/canary/virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: veza-backend
spec:
  hosts:
  - veza-backend
  http:
  - match:
    - headers:
        canary:
          exact: "true"
    route:
    - destination:
        host: veza-backend
        subset: canary
  - route:
    - destination:
        host: veza-backend
        subset: stable
      weight: 95
    - destination:
        host: veza-backend
        subset: canary
      weight: 5
```

**Automated Canary with Flagger**:
```yaml
# k8s/canary/flagger-canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: veza-backend
  namespace: veza-production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: veza-backend
  service:
    port: 80
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 1m
  webhooks:
    - name: acceptance-test
      type: pre-rollout
      url: http://flagger-loadtester.test/
      timeout: 30s
      metadata:
        type: bash
        cmd: "curl -s http://veza-backend-canary/health | grep -q ok"
```

## 7. CONFIGURATION MANAGEMENT

### 7.1 ConfigMap (Non-Sensitive Config)

```yaml
# k8s/backend/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: veza-backend-config
  namespace: veza-production
data:
  APP_ENV: "production"
  LOG_LEVEL: "info"
  API_RATE_LIMIT: "300"
  MAX_UPLOAD_SIZE: "500MB"
  CORS_ORIGINS: "https://veza.app,https://www.veza.app"
```

### 7.2 Secrets (Sensitive Data)

```yaml
# k8s/backend/secret.yaml (encrypted with SOPS or sealed-secrets)
apiVersion: v1
kind: Secret
metadata:
  name: veza-secrets
  namespace: veza-production
type: Opaque
data:
  database-url: <base64-encoded>
  redis-url: <base64-encoded>
  jwt-secret: <base64-encoded>
  stripe-api-key: <base64-encoded>
```

**Create Secret from Vault**:
```bash
# Fetch from Vault and create K8s secret
vault kv get -field=database_url secret/veza/production | base64 | \
  kubectl create secret generic veza-secrets \
    --from-literal=database-url=- \
    -n veza-production
```

## 8. SECRETS MANAGEMENT

### 8.1 HashiCorp Vault

**Vault Structure**:
```
secret/
├── veza/
│   ├── production/
│   │   ├── database_url
│   │   ├── redis_url
│   │   ├── jwt_secret
│   │   ├── stripe_api_key
│   │   ├── aws_access_key
│   │   └── aws_secret_key
│   └── staging/
│       └── ...
```

**Store Secret**:
```bash
# Write secret
vault kv put secret/veza/production \
  database_url="postgresql://..." \
  redis_url="redis://..." \
  jwt_secret="..."

# Read secret
vault kv get secret/veza/production

# Rotate secret (new version)
vault kv put secret/veza/production jwt_secret="new-secret"
```

**Vault Agent Injector (Kubernetes)**:
```yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    vault.hashicorp.com/agent-inject: "true"
    vault.hashicorp.com/role: "veza-backend"
    vault.hashicorp.com/agent-inject-secret-database: "secret/data/veza/production"
    vault.hashicorp.com/agent-inject-template-database: |
      {{- with secret "secret/data/veza/production" -}}
      export DATABASE_URL="{{ .Data.data.database_url }}"
      {{- end }}
```

## 9. MONITORING & OBSERVABILITY

### 9.1 Prometheus + Grafana

**Prometheus Configuration**:
```yaml
# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'veza-backend'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_label_app]
      action: keep
      regex: veza-backend
    - source_labels: [__meta_kubernetes_pod_ip]
      target_label: __address__
      replacement: $1:8080

  - job_name: 'postgres'
    static_configs:
    - targets: ['postgres-exporter:9187']

  - job_name: 'redis'
    static_configs:
    - targets: ['redis-exporter:9121']
```

**Grafana Dashboard**:
- **API Latency**: p50, p95, p99 response times
- **Throughput**: Requests per second
- **Error Rate**: 4xx, 5xx errors
- **Database**: Query time, connections, slow queries
- **Cache Hit Rate**: Redis hit/miss ratio

### 9.2 Logging (ELK Stack)

**Filebeat Configuration**:
```yaml
# filebeat/filebeat.yml
filebeat.inputs:
- type: container
  paths:
    - '/var/lib/docker/containers/*/*.log'
  processors:
    - add_kubernetes_metadata:
        host: ${NODE_NAME}
        matchers:
        - logs_path:
            logs_path: "/var/lib/docker/containers/"

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "veza-logs-%{+yyyy.MM.dd}"
```

### 9.3 Tracing (Jaeger)

**OpenTelemetry Integration**:
```go
// Go - OpenTelemetry setup
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() (*trace.TracerProvider, error) {
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint("http://jaeger:14268/api/traces")))
    if err != nil {
        return nil, err
    }
    
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("veza-backend-api"),
        )),
    )
    
    otel.SetTracerProvider(tp)
    return tp, nil
}
```

## 10. BACKUP & DISASTER RECOVERY

### 10.1 Database Backups

**Automated Backup Strategy**:
- **Daily**: Full backup (3 AM UTC)
- **Hourly**: Incremental backup
- **Retention**: 30 days daily, 12 weeks weekly, 2 years monthly

**Backup Script**:
```bash
#!/bin/bash
# scripts/backup-database.sh

DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/postgres"
DATABASE="veza_db"

# Full backup
pg_dump -Fc -f "$BACKUP_DIR/veza_db_$DATE.dump" "$DATABASE"

# Encrypt
gpg --encrypt --recipient backup@veza.app "$BACKUP_DIR/veza_db_$DATE.dump"

# Upload to S3
aws s3 cp "$BACKUP_DIR/veza_db_$DATE.dump.gpg" s3://veza-backups/postgres/

# Cleanup local backups > 7 days
find "$BACKUP_DIR" -name "*.dump.gpg" -mtime +7 -delete
```

**Restore Procedure**:
```bash
#!/bin/bash
# scripts/restore-database.sh

BACKUP_FILE=$1

# Download from S3
aws s3 cp "s3://veza-backups/postgres/$BACKUP_FILE" /tmp/

# Decrypt
gpg --decrypt "/tmp/$BACKUP_FILE" > "/tmp/backup.dump"

# Restore
pg_restore -d veza_db "/tmp/backup.dump"
```

### 10.2 Disaster Recovery Plan

**RTO (Recovery Time Objective)**: < 4 hours  
**RPO (Recovery Point Objective)**: < 1 hour

**Recovery Procedures**:
1. **Database Failure**: Failover to standby replica (< 5 min)
2. **Application Failure**: Rollback deployment (< 5 min)
3. **Complete Region Failure**: Failover to DR region (< 4 hours)

## 11. SCALING STRATEGY

### 11.1 Horizontal Scaling

**Auto-Scaling Rules**:
- **CPU > 70%**: Scale up
- **CPU < 30%**: Scale down (after 5 min stability)
- **Memory > 80%**: Scale up
- **Request queue > 100**: Scale up

### 11.2 Database Scaling

**Read Replicas**:
- 2 read replicas minimum
- Route read queries to replicas
- Write queries to primary only

**Connection Pooling** (PgBouncer):
```ini
[databases]
veza_db = host=postgres port=5432 dbname=veza_db

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
```

## 12. OPERATIONAL PROCEDURES

### 12.1 Deployment Checklist

**Pre-Deployment**:
- [ ] All tests pass (unit, integration, E2E)
- [ ] Security scan completed (no critical vulnerabilities)
- [ ] Database migrations tested in staging
- [ ] Rollback plan documented
- [ ] Monitoring dashboards ready
- [ ] On-call engineer notified
- [ ] Deployment window scheduled (low-traffic period)

**During Deployment**:
- [ ] Monitor error rates in real-time
- [ ] Monitor response times (p95, p99)
- [ ] Check logs for errors
- [ ] Verify database migrations applied
- [ ] Test critical user flows

**Post-Deployment**:
- [ ] Verify all services healthy
- [ ] Run smoke tests
- [ ] Monitor for 30 minutes
- [ ] Update deployment log
- [ ] Notify stakeholders

### 12.2 Rollback Procedure

**Immediate Rollback** (< 5 min):
```bash
# Kubernetes
kubectl rollout undo deployment/veza-backend -n veza-production

# Verify
kubectl rollout status deployment/veza-backend -n veza-production

# Check logs
kubectl logs -f deployment/veza-backend -n veza-production
```

### 12.3 Incident Response

**Severity Levels**:
- **P0 (Critical)**: Production down, data breach
- **P1 (High)**: Major feature broken, performance degradation
- **P2 (Medium)**: Minor feature broken
- **P3 (Low)**: Cosmetic issues

**Response Procedure**:
1. Acknowledge incident (< 5 min)
2. Assess severity
3. Notify stakeholders
4. Mitigate (rollback, hotfix, scaling)
5. Root cause analysis
6. Post-mortem

## ✅ CHECKLIST DE VALIDATION

### Infrastructure
- [ ] Infrastructure as Code (Terraform) complete
- [ ] All resources versioned in Git
- [ ] Secrets in Vault (no plaintext)
- [ ] Automated provisioning tested

### Deployment
- [ ] CI/CD pipeline functional
- [ ] Zero-downtime deployment strategy (blue-green or canary)
- [ ] Automated rollback configured
- [ ] Health checks implemented

### Monitoring
- [ ] Prometheus + Grafana dashboards
- [ ] Alerting configured (PagerDuty/Slack)
- [ ] Logging centralized (ELK Stack)
- [ ] Tracing implemented (Jaeger)

### Disaster Recovery
- [ ] Automated backups (daily + hourly)
- [ ] Backup restoration tested
- [ ] Failover procedure documented
- [ ] RTO < 4h, RPO < 1h validated

## 📊 MÉTRIQUES DE SUCCÈS

### Deployment Metrics
- **Deployment Frequency**: Multiple per day
- **Lead Time**: < 1 hour (commit to production)
- **MTTR (Mean Time To Recovery)**: < 5 minutes
- **Change Failure Rate**: < 5%

### Operational Metrics
- **Uptime**: > 99.9%
- **RTO**: < 4 hours
- **RPO**: < 1 hour
- **Deployment Success Rate**: > 95%

## 🔄 HISTORIQUE DES VERSIONS

| Version | Date | Changements |
|---------|------|-------------|
| 1.0.0 | 2025-11-02 | Version initiale - Guide de déploiement complet |

---

## ⚠️ AVERTISSEMENT

**CE GUIDE EST IMMUABLE**

---

**Document créé par**: DevOps Team + SRE  
**Date de création**: 2025-11-02  
**Prochaine révision**: Quarterly (2026-02-01)  
**Propriétaire**: DevOps Lead

**Statut**: ✅ **APPROUVÉ ET VERROUILLÉ**