veza/veza-docs/ORIGIN/ORIGIN_DEPLOYMENT_GUIDE.md
2026-03-05 19:22:31 +01:00

1676 lines
45 KiB
Markdown

# ORIGIN_DEPLOYMENT_GUIDE.md
## 📋 RÉSUMÉ EXÉCUTIF
Ce document définit le guide de déploiement complet pour la plateforme Veza en production. Il couvre Infrastructure as Code (Terraform/Ansible), containerisation (Docker/Incus), orchestration (Kubernetes), CI/CD pipelines, stratégies zero-downtime, disaster recovery, monitoring, et procedures opérationnelles pour déploiements sécurisés, automatisés et réversibles sur 24 mois.
## 🎯 OBJECTIFS
### Objectif Principal
Établir un processus de déploiement automatisé, sécurisé, reproductible et zero-downtime pour production avec rollback < 5 min, déploiements multiples par jour, et RTO < 4 heures en cas de disaster.
### Objectifs Secondaires
- Automatisation complète (Infrastructure as Code)
- Zero-downtime deployments (blue-green, canary)
- Rollback automatique en cas d'échec (< 5 min)
- Disaster recovery plan opérationnel (RTO < 4h, RPO < 1h)
- Monitoring et alerting en temps réel (Prometheus + Grafana)
## 📖 TABLE DES MATIÈRES
1. [Deployment Philosophy](#1-deployment-philosophy)
2. [Infrastructure as Code](#2-infrastructure-as-code)
3. [Containerization](#3-containerization)
4. [Kubernetes Orchestration](#4-kubernetes-orchestration)
5. [CI/CD Pipelines](#5-cicd-pipelines)
6. [Zero-Downtime Strategies](#6-zero-downtime-strategies)
7. [Configuration Management](#7-configuration-management)
8. [Secrets Management](#8-secrets-management)
9. [Monitoring & Observability](#9-monitoring--observability)
10. [Backup & Disaster Recovery](#10-backup--disaster-recovery)
11. [Scaling Strategy](#11-scaling-strategy)
12. [Operational Procedures](#12-operational-procedures)
13. [Correctifs de Sécurité Prioritaires](#13-correctifs-de-sécurité-prioritaires)
14. [Checklist de Déploiement Éthique](#14-checklist-de-déploiement-éthique)
15. [Plan de Migration JWT HS256 → RS256](#15-plan-de-migration-jwt-hs256--rs256)
## 🔒 RÈGLES IMMUABLES
1. **Infrastructure as Code**: 100% infrastructure versionnée (Terraform) - aucun changement manuel
2. **Immutable Infrastructure**: Jamais modifier serveurs existants, toujours redéployer
3. **Zero Downtime**: Aucun déploiement ne peut interrompre service (blue-green ou canary obligatoire)
4. **Automated Rollback**: Rollback automatique si health checks fail (< 5 min)
5. **Version Control**: Toutes les configs versionnées (Git) - aucune exception
6. **Secrets in Vault**: Aucun secret en clair (HashiCorp Vault ou équivalent)
7. **Testing in Staging**: Tous déploiements testés en staging d'abord
8. **Monitoring Required**: Alerting configuré avant mise en production
9. **Backup Verification**: Backups testés mensuellement (restore test)
10. **Documentation**: Runbooks à jour pour toutes procedures critiques
## 1. DEPLOYMENT PHILOSOPHY
### 1.1 Deployment Principles
**Twelve-Factor App**:
1. **Codebase**: One codebase tracked in Git, many deploys
2. **Dependencies**: Explicitly declare and isolate (go.mod, Cargo.lock, package-lock.json)
3. **Config**: Store config in environment (never in code)
4. **Backing Services**: Treat as attached resources (DB, Redis, S3)
5. **Build, Release, Run**: Strictly separate build and run stages
6. **Processes**: Execute app as stateless processes
7. **Port Binding**: Export services via port binding
8. **Concurrency**: Scale out via process model
9. **Disposability**: Fast startup and graceful shutdown
10. **Dev/Prod Parity**: Keep development, staging, production similar
11. **Logs**: Treat logs as event streams
12. **Admin Processes**: Run admin/management tasks as one-off processes
### 1.2 Deployment Environments
| Environment | Purpose | Update Frequency | Users |
|-------------|---------|------------------|-------|
| **Development** | Local development | Continuous | Developers |
| **Staging** | Pre-production testing | Daily | QA, Product Team |
| **Production** | Live users | Multiple/day | All users |
### 1.3 Deployment Workflow
```
┌─────────────┐
│ Develop │ ─── git push ───> CI/CD Triggered
└─────────────┘
┌─────────────┐
│ Build │ ─── Tests, Linting, Security Scan
└─────────────┘
┌─────────────┐
│ Staging │ ─── Deploy to staging, E2E tests
└─────────────┘
┌─────────────┐
│ Production │ ─── Blue-Green / Canary deployment
└─────────────┘
┌─────────────┐
│ Monitor │ ─── Health checks, metrics, logs
└─────────────┘
▼ (if issues)
┌─────────────┐
│ Rollback │ ─── Automatic rollback < 5 min
└─────────────┘
```
## 2. INFRASTRUCTURE AS CODE
### 2.1 Terraform Configuration
**Project Structure**:
```
terraform/
├── environments/
│ ├── production/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── terraform.tfvars (encrypted)
│ │ └── outputs.tf
│ └── staging/
│ ├── main.tf
│ ├── variables.tf
│ ├── terraform.tfvars
│ └── outputs.tf
├── modules/
│ ├── compute/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── database/
│ ├── networking/
│ ├── storage/
│ └── kubernetes/
└── backend.tf (Terraform state in S3)
```
**Example: Compute Module**:
```hcl
# terraform/modules/compute/main.tf
resource "aws_instance" "app_server" {
count = var.instance_count
ami = var.ami_id
instance_type = var.instance_type
vpc_security_group_ids = [aws_security_group.app.id]
subnet_id = var.subnet_ids[count.index % length(var.subnet_ids)]
user_data = templatefile("${path.module}/user_data.sh", {
environment = var.environment
})
tags = {
Name = "veza-app-${var.environment}-${count.index + 1}"
Environment = var.environment
ManagedBy = "Terraform"
}
lifecycle {
create_before_destroy = true
}
}
resource "aws_security_group" "app" {
name = "veza-app-${var.environment}"
description = "Security group for Veza application servers"
vpc_id = var.vpc_id
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
```
**Database Module**:
```hcl
# terraform/modules/database/main.tf
resource "aws_db_instance" "postgres" {
identifier = "veza-db-${var.environment}"
engine = "postgres"
engine_version = "15.4"
instance_class = var.instance_class
allocated_storage = var.allocated_storage
max_allocated_storage = var.max_allocated_storage
storage_encrypted = true
kms_key_id = var.kms_key_id
db_name = var.database_name
username = var.master_username
password = var.master_password # From Vault
vpc_security_group_ids = [aws_security_group.database.id]
db_subnet_group_name = aws_db_subnet_group.database.name
backup_retention_period = var.backup_retention_days
backup_window = "03:00-04:00"
maintenance_window = "mon:04:00-mon:05:00"
multi_az = var.multi_az
publicly_accessible = false
skip_final_snapshot = false
final_snapshot_identifier = "veza-db-${var.environment}-final-snapshot-${formatdate("YYYY-MM-DD-hhmm", timestamp())}"
enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
tags = {
Name = "veza-db-${var.environment}"
Environment = var.environment
ManagedBy = "Terraform"
}
}
```
**Terraform Workflow**:
```bash
# Initialize
cd terraform/environments/production
terraform init
# Plan (review changes)
terraform plan -out=tfplan
# Apply (execute changes)
terraform apply tfplan
# Destroy (cleanup)
terraform destroy
```
### 2.2 Ansible Configuration
**Playbook Structure**:
```
ansible/
├── inventory/
│ ├── production/
│ │ ├── hosts.yml
│ │ └── group_vars/
│ └── staging/
│ ├── hosts.yml
│ └── group_vars/
├── playbooks/
│ ├── deploy-backend.yml
│ ├── deploy-chat-server.yml
│ ├── deploy-stream-server.yml
│ └── deploy-frontend.yml
├── roles/
│ ├── common/
│ ├── docker/
│ ├── nginx/
│ ├── postgres/
│ └── monitoring/
└── ansible.cfg
```
**Deployment Playbook**:
```yaml
# ansible/playbooks/deploy-backend.yml
---
- name: Deploy Veza Backend API
hosts: backend_servers
become: yes
vars:
app_name: veza-backend-api
app_version: "{{ lookup('env', 'VERSION') | default('latest') }}"
docker_image: "registry.veza.app/{{ app_name }}:{{ app_version }}"
tasks:
- name: Pull Docker image
docker_image:
name: "{{ docker_image }}"
source: pull
- name: Stop old container
docker_container:
name: "{{ app_name }}"
state: stopped
ignore_errors: yes
- name: Remove old container
docker_container:
name: "{{ app_name }}"
state: absent
ignore_errors: yes
- name: Start new container
docker_container:
name: "{{ app_name }}"
image: "{{ docker_image }}"
state: started
restart_policy: unless-stopped
ports:
- "8080:8080"
env:
DATABASE_URL: "{{ database_url }}"
REDIS_URL: "{{ redis_url }}"
JWT_SECRET: "{{ jwt_secret }}"
volumes:
- "/var/log/{{ app_name }}:/var/log/app"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
- name: Wait for application to be healthy
uri:
url: http://localhost:8080/health
status_code: 200
register: result
until: result.status == 200
retries: 10
delay: 5
- name: Verify deployment
debug:
msg: "{{ app_name }} version {{ app_version }} deployed successfully"
```
## 3. CONTAINERIZATION
### 3.1 Docker Images
**Multi-Stage Build (Go)**:
```dockerfile
# veza-backend-api/Dockerfile
# Stage 1: Builder
FROM golang:1.21.5-alpine3.18 AS builder
WORKDIR /app
# Copy dependencies
COPY go.mod go.sum ./
RUN go mod download
# Copy source
COPY . .
# Build binary
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -ldflags="-w -s" -o main ./cmd/api
# Stage 2: Runner
FROM alpine:3.18
# Install CA certificates for HTTPS
RUN apk --no-cache add ca-certificates
WORKDIR /root/
# Copy binary from builder
COPY --from=builder /app/main .
# Create non-root user
RUN addgroup -g 1000 appuser && \
adduser -D -u 1000 -G appuser appuser
USER appuser
# Expose port
EXPOSE 8080
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
CMD ["/root/main", "healthcheck"]
# Run
ENTRYPOINT ["./main"]
```
**Multi-Stage Build (Rust)**:
```dockerfile
# veza-chat-server/Dockerfile
FROM rust:1.75-alpine AS builder
WORKDIR /app
RUN apk add --no-cache musl-dev
# Copy dependencies
COPY Cargo.toml Cargo.lock ./
RUN mkdir src && echo "fn main() {}" > src/main.rs && cargo build --release && rm -rf src
# Copy source
COPY . .
# Build binary
RUN cargo build --release
# Stage 2: Runner
FROM alpine:3.18
WORKDIR /app
# Copy binary
COPY --from=builder /app/target/release/veza-chat-server .
# Create non-root user
RUN addgroup -g 1000 appuser && \
adduser -D -u 1000 -G appuser appuser
USER appuser
EXPOSE 8081
HEALTHCHECK --interval=30s --timeout=10s --start-period=20s --retries=3 \
CMD ["wget", "--quiet", "--tries=1", "--spider", "http://localhost:8081/health"]
ENTRYPOINT ["./veza-chat-server"]
```
**Frontend (React/Vite)**:
```dockerfile
# apps/web/Dockerfile
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# Stage 2: Nginx
FROM nginx:1.25-alpine
COPY --from=builder /app/dist /usr/share/nginx/html
COPY nginx.conf /etc/nginx/conf.d/default.conf
EXPOSE 80
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
CMD ["wget", "--quiet", "--tries=1", "--spider", "http://localhost/health"]
CMD ["nginx", "-g", "daemon off;"]
```
### 3.2 Docker Compose (Development)
```yaml
# docker-compose.yml
version: '3.9'
services:
postgres:
image: postgres:15-alpine
environment:
POSTGRES_DB: veza_db
POSTGRES_USER: veza
POSTGRES_PASSWORD: ${DB_PASSWORD:-password}
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U veza"]
interval: 10s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis_data:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 5
backend:
build:
context: ./veza-backend-api
dockerfile: Dockerfile
ports:
- "8080:8080"
environment:
DATABASE_URL: postgresql://veza:${DB_PASSWORD:-password}@postgres:5432/veza_db
REDIS_URL: redis://redis:6379
JWT_SECRET: ${JWT_SECRET}
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
chat-server:
build:
context: ./veza-chat-server
dockerfile: Dockerfile
ports:
- "8081:8081"
environment:
DATABASE_URL: postgresql://veza:${DB_PASSWORD:-password}@postgres:5432/veza_db
REDIS_URL: redis://redis:6379
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
frontend:
build:
context: ./apps/web
dockerfile: Dockerfile
ports:
- "3000:80"
depends_on:
- backend
volumes:
postgres_data:
redis_data:
```
## 4. KUBERNETES ORCHESTRATION
### 4.1 Kubernetes Manifests
**Deployment (Backend)**:
```yaml
# k8s/backend/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: veza-backend
namespace: veza-production
labels:
app: veza-backend
version: v1.0.0
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: veza-backend
template:
metadata:
labels:
app: veza-backend
version: v1.0.0
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: backend
image: registry.veza.app/veza-backend-api:v1.0.0
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8080
name: http
protocol: TCP
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: veza-secrets
key: database-url
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: veza-secrets
key: redis-url
- name: JWT_SECRET
valueFrom:
secretKeyRef:
name: veza-secrets
key: jwt-secret
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
imagePullSecrets:
- name: registry-credentials
```
**Service**:
```yaml
# k8s/backend/service.yaml
apiVersion: v1
kind: Service
metadata:
name: veza-backend
namespace: veza-production
spec:
type: ClusterIP
selector:
app: veza-backend
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
```
**Ingress**:
```yaml
# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: veza-ingress
namespace: veza-production
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.veza.app
- veza.app
secretName: veza-tls
rules:
- host: api.veza.app
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: veza-backend
port:
number: 80
- host: veza.app
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: veza-frontend
port:
number: 80
```
**HorizontalPodAutoscaler**:
```yaml
# k8s/backend/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: veza-backend-hpa
namespace: veza-production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: veza-backend
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 60
```
## 5. CI/CD PIPELINES
### 5.1 GitHub Actions Workflow
```yaml
# .github/workflows/deploy-production.yml
name: Deploy to Production
on:
push:
branches:
- main
tags:
- 'v*'
env:
REGISTRY: registry.veza.app
KUBE_NAMESPACE: veza-production
jobs:
build-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run tests
run: |
make test-all
- name: Security scan
run: |
make security-scan
build-backend:
needs: build-and-test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Login to Registry
uses: docker/login-action@v2
with:
registry: ${{ env.REGISTRY }}
username: ${{ secrets.REGISTRY_USERNAME }}
password: ${{ secrets.REGISTRY_PASSWORD }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v4
with:
images: ${{ env.REGISTRY }}/veza-backend-api
tags: |
type=ref,event=branch
type=ref,event=pr
type=semver,pattern={{version}}
type=semver,pattern={{major}}.{{minor}}
type=sha,prefix={{branch}}-
- name: Build and push
uses: docker/build-push-action@v4
with:
context: ./veza-backend-api
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=registry,ref=${{ env.REGISTRY }}/veza-backend-api:buildcache
cache-to: type=registry,ref=${{ env.REGISTRY }}/veza-backend-api:buildcache,mode=max
deploy-staging:
needs: [build-backend]
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v3
- name: Deploy to Staging
run: |
kubectl set image deployment/veza-backend \
backend=${{ env.REGISTRY }}/veza-backend-api:${{ github.sha }} \
-n veza-staging
kubectl rollout status deployment/veza-backend -n veza-staging --timeout=5m
- name: Run E2E tests
run: |
npm run test:e2e -- --env=staging
deploy-production:
needs: [deploy-staging]
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v3
- name: Setup kubectl
uses: azure/setup-kubectl@v3
with:
version: 'v1.28.0'
- name: Configure kubectl
run: |
echo "${{ secrets.KUBE_CONFIG }}" | base64 -d > kubeconfig
export KUBECONFIG=./kubeconfig
- name: Deploy to Production (Blue-Green)
run: |
# Deploy green environment
kubectl apply -f k8s/backend/deployment-green.yaml
kubectl rollout status deployment/veza-backend-green -n ${{ env.KUBE_NAMESPACE }} --timeout=10m
# Run smoke tests
make smoke-tests ENDPOINT=https://green.api.veza.app
# Switch traffic to green
kubectl patch service veza-backend -n ${{ env.KUBE_NAMESPACE }} \
-p '{"spec":{"selector":{"version":"green"}}}'
# Wait for validation
sleep 60
# Monitor metrics
if ! make verify-deployment; then
echo "Deployment verification failed, rolling back..."
kubectl patch service veza-backend -n ${{ env.KUBE_NAMESPACE }} \
-p '{"spec":{"selector":{"version":"blue"}}}'
exit 1
fi
# Delete old blue deployment
kubectl delete deployment veza-backend-blue -n ${{ env.KUBE_NAMESPACE }}
- name: Notify Slack
if: always()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "Production deployment ${{ job.status }}: ${{ github.sha }}"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
```
## 6. ZERO-DOWNTIME STRATEGIES
### 6.1 Blue-Green Deployment
**Process**:
1. **Blue** (current production) serves all traffic
2. Deploy **Green** (new version) in parallel
3. Test Green thoroughly (smoke tests, health checks)
4. Switch load balancer from Blue to Green (instant cutover)
5. Monitor Green for issues (5-10 min)
6. If issues: Rollback to Blue (instant)
7. If stable: Decommission Blue
**Kubernetes Implementation**:
```bash
# Deploy green
kubectl apply -f k8s/backend/deployment-green.yaml
# Wait for readiness
kubectl wait --for=condition=available --timeout=10m deployment/veza-backend-green
# Switch service selector
kubectl patch service veza-backend -p '{"spec":{"selector":{"version":"green"}}}'
# Monitor
watch kubectl get pods -l app=veza-backend
# Rollback if needed
kubectl patch service veza-backend -p '{"spec":{"selector":{"version":"blue"}}}'
```
### 6.2 Canary Deployment
**Process**:
1. Deploy new version (canary) with 5% traffic
2. Monitor metrics (error rate, latency)
3. Gradually increase traffic: 5% 25% 50% 100%
4. At each stage, verify metrics are healthy
5. If issues detected: Rollback immediately
**Kubernetes with Istio**:
```yaml
# k8s/canary/virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: veza-backend
spec:
hosts:
- veza-backend
http:
- match:
- headers:
canary:
exact: "true"
route:
- destination:
host: veza-backend
subset: canary
- route:
- destination:
host: veza-backend
subset: stable
weight: 95
- destination:
host: veza-backend
subset: canary
weight: 5
```
**Automated Canary with Flagger**:
```yaml
# k8s/canary/flagger-canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: veza-backend
namespace: veza-production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: veza-backend
service:
port: 80
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
webhooks:
- name: acceptance-test
type: pre-rollout
url: http://flagger-loadtester.test/
timeout: 30s
metadata:
type: bash
cmd: "curl -s http://veza-backend-canary/health | grep -q ok"
```
## 7. CONFIGURATION MANAGEMENT
### 7.1 ConfigMap (Non-Sensitive Config)
```yaml
# k8s/backend/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: veza-backend-config
namespace: veza-production
data:
APP_ENV: "production"
LOG_LEVEL: "info"
API_RATE_LIMIT: "300"
MAX_UPLOAD_SIZE: "500MB"
CORS_ORIGINS: "https://veza.app,https://www.veza.app"
```
### 7.2 Secrets (Sensitive Data)
```yaml
# k8s/backend/secret.yaml (encrypted with SOPS or sealed-secrets)
apiVersion: v1
kind: Secret
metadata:
name: veza-secrets
namespace: veza-production
type: Opaque
data:
database-url: <base64-encoded>
redis-url: <base64-encoded>
jwt-secret: <base64-encoded>
stripe-api-key: <base64-encoded>
```
**Create Secret from Vault**:
```bash
# Fetch from Vault and create K8s secret
vault kv get -field=database_url secret/veza/production | base64 | \
kubectl create secret generic veza-secrets \
--from-literal=database-url=- \
-n veza-production
```
## 8. SECRETS MANAGEMENT
### 8.1 HashiCorp Vault
**Vault Structure**:
```
secret/
├── veza/
│ ├── production/
│ │ ├── database_url
│ │ ├── redis_url
│ │ ├── jwt_secret
│ │ ├── stripe_api_key
│ │ ├── aws_access_key
│ │ └── aws_secret_key
│ └── staging/
│ └── ...
```
**Store Secret**:
```bash
# Write secret
vault kv put secret/veza/production \
database_url="postgresql://..." \
redis_url="redis://..." \
jwt_secret="..."
# Read secret
vault kv get secret/veza/production
# Rotate secret (new version)
vault kv put secret/veza/production jwt_secret="new-secret"
```
**Vault Agent Injector (Kubernetes)**:
```yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "veza-backend"
vault.hashicorp.com/agent-inject-secret-database: "secret/data/veza/production"
vault.hashicorp.com/agent-inject-template-database: |
{{- with secret "secret/data/veza/production" -}}
export DATABASE_URL="{{ .Data.data.database_url }}"
{{- end }}
```
## 9. MONITORING & OBSERVABILITY
### 9.1 Prometheus + Grafana
**Prometheus Configuration**:
```yaml
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'veza-backend'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: veza-backend
- source_labels: [__meta_kubernetes_pod_ip]
target_label: __address__
replacement: $1:8080
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
```
**Grafana Dashboard**:
- **API Latency**: p50, p95, p99 response times
- **Throughput**: Requests per second
- **Error Rate**: 4xx, 5xx errors
- **Database**: Query time, connections, slow queries
- **Cache Hit Rate**: Redis hit/miss ratio
### 9.2 Logging (ELK Stack)
**Filebeat Configuration**:
```yaml
# filebeat/filebeat.yml
filebeat.inputs:
- type: container
paths:
- '/var/lib/docker/containers/*/*.log'
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/lib/docker/containers/"
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "veza-logs-%{+yyyy.MM.dd}"
```
### 9.3 Tracing (Jaeger)
**OpenTelemetry Integration**:
```go
// Go - OpenTelemetry setup
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer() (*trace.TracerProvider, error) {
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint("http://jaeger:14268/api/traces")))
if err != nil {
return nil, err
}
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("veza-backend-api"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
```
## 10. BACKUP & DISASTER RECOVERY
### 10.1 Database Backups
**Automated Backup Strategy**:
- **Daily**: Full backup (3 AM UTC)
- **Hourly**: Incremental backup
- **Retention**: 30 days daily, 12 weeks weekly, 2 years monthly
**Backup Script**:
```bash
#!/bin/bash
# scripts/backup-database.sh
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/postgres"
DATABASE="veza_db"
# Full backup
pg_dump -Fc -f "$BACKUP_DIR/veza_db_$DATE.dump" "$DATABASE"
# Encrypt
gpg --encrypt --recipient backup@veza.app "$BACKUP_DIR/veza_db_$DATE.dump"
# Upload to S3
aws s3 cp "$BACKUP_DIR/veza_db_$DATE.dump.gpg" s3://veza-backups/postgres/
# Cleanup local backups > 7 days
find "$BACKUP_DIR" -name "*.dump.gpg" -mtime +7 -delete
```
**Restore Procedure**:
```bash
#!/bin/bash
# scripts/restore-database.sh
BACKUP_FILE=$1
# Download from S3
aws s3 cp "s3://veza-backups/postgres/$BACKUP_FILE" /tmp/
# Decrypt
gpg --decrypt "/tmp/$BACKUP_FILE" > "/tmp/backup.dump"
# Restore
pg_restore -d veza_db "/tmp/backup.dump"
```
### 10.2 Disaster Recovery Plan
**RTO (Recovery Time Objective)**: < 4 hours
**RPO (Recovery Point Objective)**: < 1 hour
**Recovery Procedures**:
1. **Database Failure**: Failover to standby replica (< 5 min)
2. **Application Failure**: Rollback deployment (< 5 min)
3. **Complete Region Failure**: Failover to DR region (< 4 hours)
## 11. SCALING STRATEGY
### 11.1 Horizontal Scaling
**Auto-Scaling Rules**:
- **CPU > 70%**: Scale up
- **CPU < 30%**: Scale down (after 5 min stability)
- **Memory > 80%**: Scale up
- **Request queue > 100**: Scale up
### 11.2 Database Scaling
**Read Replicas**:
- 2 read replicas minimum
- Route read queries to replicas
- Write queries to primary only
**Connection Pooling** (PgBouncer):
```ini
[databases]
veza_db = host=postgres port=5432 dbname=veza_db
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
```
## 12. OPERATIONAL PROCEDURES
### 12.1 Deployment Checklist
**Pre-Deployment**:
- [ ] All tests pass (unit, integration, E2E)
- [ ] Security scan completed (no critical vulnerabilities)
- [ ] Database migrations tested in staging
- [ ] Rollback plan documented
- [ ] Monitoring dashboards ready
- [ ] On-call engineer notified
- [ ] Deployment window scheduled (low-traffic period)
**During Deployment**:
- [ ] Monitor error rates in real-time
- [ ] Monitor response times (p95, p99)
- [ ] Check logs for errors
- [ ] Verify database migrations applied
- [ ] Test critical user flows
**Post-Deployment**:
- [ ] Verify all services healthy
- [ ] Run smoke tests
- [ ] Monitor for 30 minutes
- [ ] Update deployment log
- [ ] Notify stakeholders
### 12.2 Rollback Procedure
**Immediate Rollback** (< 5 min):
```bash
# Kubernetes
kubectl rollout undo deployment/veza-backend -n veza-production
# Verify
kubectl rollout status deployment/veza-backend -n veza-production
# Check logs
kubectl logs -f deployment/veza-backend -n veza-production
```
### 12.3 Incident Response
**Severity Levels**:
- **P0 (Critical)**: Production down, data breach
- **P1 (High)**: Major feature broken, performance degradation
- **P2 (Medium)**: Minor feature broken
- **P3 (Low)**: Cosmetic issues
**Response Procedure**:
1. Acknowledge incident (< 5 min)
2. Assess severity
3. Notify stakeholders
4. Mitigate (rollback, hotfix, scaling)
5. Root cause analysis
6. Post-mortem
## 13. CORRECTIFS DE SÉCURITÉ PRIORITAIRES
> Identifiés lors de l'audit sécurité du 2026-03-04. Ces procédures sont **bloquantes** pour tout déploiement en production.
### 13.1 Rotation du Secret JWT
Le secret JWT doit être rotaté régulièrement (minimum trimestriel) et immédiatement en cas de suspicion de compromission.
**Procédure de rotation** :
```bash
#!/bin/bash
# scripts/rotate-jwt-secret.sh
set -euo pipefail
NEW_SECRET=$(openssl rand -base64 64)
# 1. Stocker le nouveau secret dans Vault
vault kv put secret/veza/production jwt_secret="$NEW_SECRET"
# 2. Mettre à jour le secret Kubernetes
kubectl create secret generic veza-jwt-secret \
--from-literal=jwt-secret="$NEW_SECRET" \
--dry-run=client -o yaml | kubectl apply -f - -n veza-production
# 3. Redémarrage progressif (rolling restart) pour charger le nouveau secret
kubectl rollout restart deployment/veza-backend -n veza-production
kubectl rollout restart deployment/veza-chat-server -n veza-production
# 4. Attendre que le rollout soit terminé
kubectl rollout status deployment/veza-backend -n veza-production --timeout=5m
kubectl rollout status deployment/veza-chat-server -n veza-production --timeout=5m
echo "JWT secret rotation complete. Old tokens will expire naturally."
```
**Points critiques** :
- Pendant la rotation, les anciens tokens restent valides jusqu'à leur expiration naturelle (configurer une durée de vie courte : 15 min access, 7 jours refresh)
- Tester en staging avant chaque rotation production
- Logger l'événement de rotation (sans le secret) dans l'audit log
### 13.2 Alignement JWT Issuer/Audience Go ↔ Rust
Le backend Go et le chat-server Rust doivent utiliser les mêmes claims JWT (`iss`, `aud`) pour éviter les rejets de tokens inter-services.
**Configuration alignée** :
```yaml
# Configuration partagée — identique Go et Rust
jwt:
issuer: "https://api.veza.app"
audience: "https://veza.app"
algorithm: "HS256" # Migrer vers RS256 — voir section 15
access_token_ttl: "15m"
refresh_token_ttl: "7d"
```
**Vérification Go** :
```go
// veza-backend-api/internal/auth/jwt.go
claims := jwt.MapClaims{
"iss": "https://api.veza.app",
"aud": "https://veza.app",
"sub": userID,
"exp": time.Now().Add(15 * time.Minute).Unix(),
"iat": time.Now().Unix(),
}
```
**Vérification Rust** :
```rust
// veza-chat-server/src/auth.rs
let validation = Validation::new(Algorithm::HS256);
validation.set_issuer(&["https://api.veza.app"]);
validation.set_audience(&["https://veza.app"]);
```
**Procédure de déploiement** :
1. Mettre à jour la config Rust pour accepter le même `iss`/`aud`
2. Déployer le chat-server Rust en premier (il accepte les tokens existants + nouveaux)
3. Mettre à jour la config Go
4. Déployer le backend Go
5. Vérifier les logs des deux services : zéro erreur `invalid issuer` ou `invalid audience`
### 13.3 Protection de la Route /metrics
La route `/metrics` (Prometheus) expose des métriques internes et ne doit **jamais** être accessible publiquement.
**Nginx/Ingress — bloquer l'accès externe** :
```yaml
# k8s/ingress.yaml — ajout d'annotation
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: veza-ingress
annotations:
nginx.ingress.kubernetes.io/server-snippet: |
location /metrics {
deny all;
return 404;
}
```
**NetworkPolicy — restreindre au namespace monitoring** :
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-metrics-from-prometheus
namespace: veza-production
spec:
podSelector:
matchLabels:
app: veza-backend
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- port: 8080
protocol: TCP
```
**Vérification post-déploiement** :
```bash
# Doit retourner 404 ou connexion refusée depuis l'extérieur
curl -s -o /dev/null -w "%{http_code}" https://api.veza.app/metrics
# Expected: 404
# Doit fonctionner depuis le pod Prometheus
kubectl exec -n monitoring deploy/prometheus -- curl -s http://veza-backend.veza-production/metrics | head -5
# Expected: 200 avec métriques
```
## 14. CHECKLIST DE DÉPLOIEMENT ÉTHIQUE
Avant chaque déploiement en production, les points suivants doivent être vérifiés. Cette checklist complète la checklist opérationnelle standard (section 12.1).
### 14.1 Protection des Données Personnelles
- [ ] **Métriques Prometheus** : vérifier qu'aucune métrique n'expose de données personnelles (emails, IPs, user agents complets, identifiants utilisateur)
```bash
# Vérification automatisée aucune métrique ne doit contenir ces patterns
kubectl exec -n monitoring deploy/prometheus -- \
curl -s http://veza-backend.veza-production/metrics | \
grep -iE '(email|user_agent|ip_address|@)' && \
echo "FAIL: Personal data found in metrics" && exit 1 || \
echo "PASS: No personal data in metrics"
```
- [ ] **Logs anonymisés** : confirmer que les logs applicatifs ne contiennent pas de données personnelles en clair
```bash
# Vérification sur les 1000 dernières lignes de logs
kubectl logs deployment/veza-backend -n veza-production --tail=1000 | \
grep -iE '([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})' && \
echo "FAIL: Email addresses found in logs" && exit 1 || \
echo "PASS: No emails in logs"
```
- [ ] **Headers HTTP** : aucun header de tracking ou fingerprinting n'est émis par les services
### 14.2 Conformité RGPD
- [ ] **Export des données** : l'endpoint `POST /api/v1/me/data-export` retourne toutes les données de l'utilisateur authentifié (profil, tracks, playlists, historique)
- [ ] **Suppression de compte** : l'endpoint `DELETE /api/v1/me` supprime l'utilisateur et toutes ses données associées (cascade)
- [ ] **Consentement cookies** : seuls les cookies strictement nécessaires (session JWT) sont envoyés sans consentement
### 14.3 Intégrité Éthique
- [ ] **Pas de dépendance AI/ML** : aucune image Docker ne contient de framework ML (TensorFlow, PyTorch, scikit-learn, ONNX)
- [ ] **Pas de tracking tiers** : aucun script Google Analytics, Facebook Pixel, ou équivalent dans le bundle frontend
- [ ] **Algorithme de découverte** : les tests de biais (section 14 de ORIGIN_TESTING_STRATEGY) passent en CI
## 15. PLAN DE MIGRATION JWT HS256 → RS256
### 15.1 Motivation
HS256 (HMAC symétrique) nécessite le partage du même secret entre tous les services. RS256 (RSA asymétrique) permet au backend Go de signer avec une clé privée tandis que les autres services (Rust, frontend) vérifient avec la clé publique, réduisant la surface d'attaque.
### 15.2 Architecture Cible
```
┌──────────────────┐ ┌──────────────────┐
│ Backend Go │ │ Chat Server │
│ (signe JWT) │ │ Rust │
│ │ │ (vérifie JWT) │
│ Clé PRIVÉE RS256│ │ Clé PUBLIQUE │
└──────────────────┘ └──────────────────┘
│ │
│ Même clé publique │
└────────────┬───────────────┘
┌───────▼───────┐
│ Frontend │
│ (vérifie JWT │
│ optionnel) │
│ Clé PUBLIQUE │
└───────────────┘
```
### 15.3 Étapes de Migration
**Phase 1 — Préparation (semaine 1)** :
```bash
# Générer la paire de clés RSA 4096 bits
openssl genrsa -out jwt-private.pem 4096
openssl rsa -in jwt-private.pem -pubout -out jwt-public.pem
# Stocker dans Vault
vault kv put secret/veza/production \
jwt_private_key=@jwt-private.pem \
jwt_public_key=@jwt-public.pem
# Nettoyer les fichiers locaux
shred -u jwt-private.pem jwt-public.pem
```
**Phase 2 — Double validation (semaine 2)** :
Modifier les services pour accepter **les deux algorithmes** pendant la transition :
```go
// veza-backend-api — signe en RS256, valide HS256 et RS256
func (a *Auth) ValidateToken(tokenString string) (*Claims, error) {
token, err := jwt.Parse(tokenString, func(t *jwt.Token) (interface{}, error) {
switch t.Method.(type) {
case *jwt.SigningMethodRSA:
return a.rsaPublicKey, nil
case *jwt.SigningMethodHMAC:
return []byte(a.hmacSecret), nil
default:
return nil, fmt.Errorf("unexpected signing method: %v", t.Header["alg"])
}
})
// ...
}
```
```rust
// veza-chat-server — valide HS256 et RS256
fn validate_token(token: &str, config: &AuthConfig) -> Result<Claims, AuthError> {
// Try RS256 first, fall back to HS256
let rs256_result = decode::<Claims>(
token,
&DecodingKey::from_rsa_pem(config.rsa_public_key.as_bytes())?,
&Validation::new(Algorithm::RS256),
);
match rs256_result {
Ok(data) => Ok(data.claims),
Err(_) => {
let hs256_result = decode::<Claims>(
token,
&DecodingKey::from_secret(config.hmac_secret.as_bytes()),
&Validation::new(Algorithm::HS256),
);
hs256_result.map(|d| d.claims).map_err(AuthError::from)
}
}
}
```
**Phase 3 — Basculement (semaine 3)** :
1. Déployer le backend Go pour signer exclusivement en RS256
2. Attendre l'expiration de tous les tokens HS256 (max 7 jours pour les refresh tokens)
3. Supprimer le code de fallback HS256 des services Go et Rust
4. Supprimer le secret HS256 de Vault
**Phase 4 — Nettoyage (semaine 4)** :
```bash
# Supprimer l'ancien secret HS256 de Vault
vault kv delete secret/veza/production/jwt_secret_hmac
# Vérifier qu'aucun service n'utilise encore HS256
kubectl logs -l app=veza-backend -n veza-production --since=24h | \
grep -i "hs256" && echo "WARNING: HS256 still in use" || echo "CLEAN: No HS256 usage"
```
### 15.4 Rollback Plan
Si des problèmes sont détectés pendant la migration :
1. Remettre le backend Go en mode signature HS256
2. Les services continuent d'accepter les deux formats
3. Investiguer et corriger avant de retenter
### 15.5 Critères de Succès
- [ ] Tous les services acceptent RS256
- [ ] Aucun token HS256 en circulation (après expiration naturelle)
- [ ] La clé privée RSA est uniquement accessible au backend Go
- [ ] Les tests d'intégration inter-services passent avec RS256
- [ ] Performance : la vérification RS256 < 1ms (benchmarker)
## ✅ CHECKLIST DE VALIDATION
### Infrastructure
- [ ] Infrastructure as Code (Terraform) complete
- [ ] All resources versioned in Git
- [ ] Secrets in Vault (no plaintext)
- [ ] Automated provisioning tested
### Deployment
- [ ] CI/CD pipeline functional
- [ ] Zero-downtime deployment strategy (blue-green or canary)
- [ ] Automated rollback configured
- [ ] Health checks implemented
### Monitoring
- [ ] Prometheus + Grafana dashboards
- [ ] Alerting configured (PagerDuty/Slack)
- [ ] Logging centralized (ELK Stack)
- [ ] Tracing implemented (Jaeger)
### Disaster Recovery
- [ ] Automated backups (daily + hourly)
- [ ] Backup restoration tested
- [ ] Failover procedure documented
- [ ] RTO < 4h, RPO < 1h validated
## 📊 MÉTRIQUES DE SUCCÈS
### Deployment Metrics
- **Deployment Frequency**: Multiple per day
- **Lead Time**: < 1 hour (commit to production)
- **MTTR (Mean Time To Recovery)**: < 5 minutes
- **Change Failure Rate**: < 5%
### Operational Metrics
- **Uptime**: > 99.9%
- **RTO**: < 4 hours
- **RPO**: < 1 hour
- **Deployment Success Rate**: > 95%
## 🔄 HISTORIQUE DES VERSIONS
| Version | Date | Changements |
|---------|------|-------------|
| 1.0.0 | 2025-11-02 | Version initiale - Guide de déploiement complet |
| 2.0.0 | 2026-03-04 | Audit sécurité : ajout correctifs prioritaires (rotation JWT, alignement issuer/audience Go↔Rust, protection /metrics), checklist de déploiement éthique (données personnelles, RGPD, intégrité), plan de migration JWT HS256→RS256. |
---
## ⚠️ AVERTISSEMENT
**CE GUIDE EST IMMUABLE**
---
**Document créé par**: DevOps Team + SRE
**Date de création**: 2025-11-02
**Dernière révision**: 2026-03-04 (audit sécurité)
**Prochaine révision**: Quarterly (2026-06-01)
**Propriétaire**: DevOps Lead
**Statut**: ✅ **APPROUVÉ ET VERROUILLÉ**