1380 lines
34 KiB
Markdown
1380 lines
34 KiB
Markdown
|
|
# ORIGIN_DEPLOYMENT_GUIDE.md
|
||
|
|
|
||
|
|
## 📋 RÉSUMÉ EXÉCUTIF
|
||
|
|
|
||
|
|
Ce document définit le guide de déploiement complet pour la plateforme Veza en production. Il couvre Infrastructure as Code (Terraform/Ansible), containerisation (Docker/Incus), orchestration (Kubernetes), CI/CD pipelines, stratégies zero-downtime, disaster recovery, monitoring, et procedures opérationnelles pour déploiements sécurisés, automatisés et réversibles sur 24 mois.
|
||
|
|
|
||
|
|
## 🎯 OBJECTIFS
|
||
|
|
|
||
|
|
### Objectif Principal
|
||
|
|
Établir un processus de déploiement automatisé, sécurisé, reproductible et zero-downtime pour production avec rollback < 5 min, déploiements multiples par jour, et RTO < 4 heures en cas de disaster.
|
||
|
|
|
||
|
|
### Objectifs Secondaires
|
||
|
|
- Automatisation complète (Infrastructure as Code)
|
||
|
|
- Zero-downtime deployments (blue-green, canary)
|
||
|
|
- Rollback automatique en cas d'échec (< 5 min)
|
||
|
|
- Disaster recovery plan opérationnel (RTO < 4h, RPO < 1h)
|
||
|
|
- Monitoring et alerting en temps réel (Prometheus + Grafana)
|
||
|
|
|
||
|
|
## 📖 TABLE DES MATIÈRES
|
||
|
|
|
||
|
|
1. [Deployment Philosophy](#1-deployment-philosophy)
|
||
|
|
2. [Infrastructure as Code](#2-infrastructure-as-code)
|
||
|
|
3. [Containerization](#3-containerization)
|
||
|
|
4. [Kubernetes Orchestration](#4-kubernetes-orchestration)
|
||
|
|
5. [CI/CD Pipelines](#5-cicd-pipelines)
|
||
|
|
6. [Zero-Downtime Strategies](#6-zero-downtime-strategies)
|
||
|
|
7. [Configuration Management](#7-configuration-management)
|
||
|
|
8. [Secrets Management](#8-secrets-management)
|
||
|
|
9. [Monitoring & Observability](#9-monitoring--observability)
|
||
|
|
10. [Backup & Disaster Recovery](#10-backup--disaster-recovery)
|
||
|
|
11. [Scaling Strategy](#11-scaling-strategy)
|
||
|
|
12. [Operational Procedures](#12-operational-procedures)
|
||
|
|
|
||
|
|
## 🔒 RÈGLES IMMUABLES
|
||
|
|
|
||
|
|
1. **Infrastructure as Code**: 100% infrastructure versionnée (Terraform) - aucun changement manuel
|
||
|
|
2. **Immutable Infrastructure**: Jamais modifier serveurs existants, toujours redéployer
|
||
|
|
3. **Zero Downtime**: Aucun déploiement ne peut interrompre service (blue-green ou canary obligatoire)
|
||
|
|
4. **Automated Rollback**: Rollback automatique si health checks fail (< 5 min)
|
||
|
|
5. **Version Control**: Toutes les configs versionnées (Git) - aucune exception
|
||
|
|
6. **Secrets in Vault**: Aucun secret en clair (HashiCorp Vault ou équivalent)
|
||
|
|
7. **Testing in Staging**: Tous déploiements testés en staging d'abord
|
||
|
|
8. **Monitoring Required**: Alerting configuré avant mise en production
|
||
|
|
9. **Backup Verification**: Backups testés mensuellement (restore test)
|
||
|
|
10. **Documentation**: Runbooks à jour pour toutes procedures critiques
|
||
|
|
|
||
|
|
## 1. DEPLOYMENT PHILOSOPHY
|
||
|
|
|
||
|
|
### 1.1 Deployment Principles
|
||
|
|
|
||
|
|
**Twelve-Factor App**:
|
||
|
|
1. **Codebase**: One codebase tracked in Git, many deploys
|
||
|
|
2. **Dependencies**: Explicitly declare and isolate (go.mod, Cargo.lock, package-lock.json)
|
||
|
|
3. **Config**: Store config in environment (never in code)
|
||
|
|
4. **Backing Services**: Treat as attached resources (DB, Redis, S3)
|
||
|
|
5. **Build, Release, Run**: Strictly separate build and run stages
|
||
|
|
6. **Processes**: Execute app as stateless processes
|
||
|
|
7. **Port Binding**: Export services via port binding
|
||
|
|
8. **Concurrency**: Scale out via process model
|
||
|
|
9. **Disposability**: Fast startup and graceful shutdown
|
||
|
|
10. **Dev/Prod Parity**: Keep development, staging, production similar
|
||
|
|
11. **Logs**: Treat logs as event streams
|
||
|
|
12. **Admin Processes**: Run admin/management tasks as one-off processes
|
||
|
|
|
||
|
|
### 1.2 Deployment Environments
|
||
|
|
|
||
|
|
| Environment | Purpose | Update Frequency | Users |
|
||
|
|
|-------------|---------|------------------|-------|
|
||
|
|
| **Development** | Local development | Continuous | Developers |
|
||
|
|
| **Staging** | Pre-production testing | Daily | QA, Product Team |
|
||
|
|
| **Production** | Live users | Multiple/day | All users |
|
||
|
|
|
||
|
|
### 1.3 Deployment Workflow
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─────────────┐
|
||
|
|
│ Develop │ ─── git push ───> CI/CD Triggered
|
||
|
|
└─────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────┐
|
||
|
|
│ Build │ ─── Tests, Linting, Security Scan
|
||
|
|
└─────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────┐
|
||
|
|
│ Staging │ ─── Deploy to staging, E2E tests
|
||
|
|
└─────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────┐
|
||
|
|
│ Production │ ─── Blue-Green / Canary deployment
|
||
|
|
└─────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────┐
|
||
|
|
│ Monitor │ ─── Health checks, metrics, logs
|
||
|
|
└─────────────┘
|
||
|
|
│
|
||
|
|
▼ (if issues)
|
||
|
|
┌─────────────┐
|
||
|
|
│ Rollback │ ─── Automatic rollback < 5 min
|
||
|
|
└─────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
## 2. INFRASTRUCTURE AS CODE
|
||
|
|
|
||
|
|
### 2.1 Terraform Configuration
|
||
|
|
|
||
|
|
**Project Structure**:
|
||
|
|
```
|
||
|
|
terraform/
|
||
|
|
├── environments/
|
||
|
|
│ ├── production/
|
||
|
|
│ │ ├── main.tf
|
||
|
|
│ │ ├── variables.tf
|
||
|
|
│ │ ├── terraform.tfvars (encrypted)
|
||
|
|
│ │ └── outputs.tf
|
||
|
|
│ └── staging/
|
||
|
|
│ ├── main.tf
|
||
|
|
│ ├── variables.tf
|
||
|
|
│ ├── terraform.tfvars
|
||
|
|
│ └── outputs.tf
|
||
|
|
├── modules/
|
||
|
|
│ ├── compute/
|
||
|
|
│ │ ├── main.tf
|
||
|
|
│ │ ├── variables.tf
|
||
|
|
│ │ └── outputs.tf
|
||
|
|
│ ├── database/
|
||
|
|
│ ├── networking/
|
||
|
|
│ ├── storage/
|
||
|
|
│ └── kubernetes/
|
||
|
|
└── backend.tf (Terraform state in S3)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Example: Compute Module**:
|
||
|
|
```hcl
|
||
|
|
# terraform/modules/compute/main.tf
|
||
|
|
resource "aws_instance" "app_server" {
|
||
|
|
count = var.instance_count
|
||
|
|
ami = var.ami_id
|
||
|
|
instance_type = var.instance_type
|
||
|
|
|
||
|
|
vpc_security_group_ids = [aws_security_group.app.id]
|
||
|
|
subnet_id = var.subnet_ids[count.index % length(var.subnet_ids)]
|
||
|
|
|
||
|
|
user_data = templatefile("${path.module}/user_data.sh", {
|
||
|
|
environment = var.environment
|
||
|
|
})
|
||
|
|
|
||
|
|
tags = {
|
||
|
|
Name = "veza-app-${var.environment}-${count.index + 1}"
|
||
|
|
Environment = var.environment
|
||
|
|
ManagedBy = "Terraform"
|
||
|
|
}
|
||
|
|
|
||
|
|
lifecycle {
|
||
|
|
create_before_destroy = true
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
resource "aws_security_group" "app" {
|
||
|
|
name = "veza-app-${var.environment}"
|
||
|
|
description = "Security group for Veza application servers"
|
||
|
|
vpc_id = var.vpc_id
|
||
|
|
|
||
|
|
ingress {
|
||
|
|
from_port = 443
|
||
|
|
to_port = 443
|
||
|
|
protocol = "tcp"
|
||
|
|
cidr_blocks = ["0.0.0.0/0"]
|
||
|
|
}
|
||
|
|
|
||
|
|
ingress {
|
||
|
|
from_port = 80
|
||
|
|
to_port = 80
|
||
|
|
protocol = "tcp"
|
||
|
|
cidr_blocks = ["0.0.0.0/0"]
|
||
|
|
}
|
||
|
|
|
||
|
|
egress {
|
||
|
|
from_port = 0
|
||
|
|
to_port = 0
|
||
|
|
protocol = "-1"
|
||
|
|
cidr_blocks = ["0.0.0.0/0"]
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Database Module**:
|
||
|
|
```hcl
|
||
|
|
# terraform/modules/database/main.tf
|
||
|
|
resource "aws_db_instance" "postgres" {
|
||
|
|
identifier = "veza-db-${var.environment}"
|
||
|
|
engine = "postgres"
|
||
|
|
engine_version = "15.4"
|
||
|
|
instance_class = var.instance_class
|
||
|
|
|
||
|
|
allocated_storage = var.allocated_storage
|
||
|
|
max_allocated_storage = var.max_allocated_storage
|
||
|
|
storage_encrypted = true
|
||
|
|
kms_key_id = var.kms_key_id
|
||
|
|
|
||
|
|
db_name = var.database_name
|
||
|
|
username = var.master_username
|
||
|
|
password = var.master_password # From Vault
|
||
|
|
|
||
|
|
vpc_security_group_ids = [aws_security_group.database.id]
|
||
|
|
db_subnet_group_name = aws_db_subnet_group.database.name
|
||
|
|
|
||
|
|
backup_retention_period = var.backup_retention_days
|
||
|
|
backup_window = "03:00-04:00"
|
||
|
|
maintenance_window = "mon:04:00-mon:05:00"
|
||
|
|
|
||
|
|
multi_az = var.multi_az
|
||
|
|
publicly_accessible = false
|
||
|
|
skip_final_snapshot = false
|
||
|
|
final_snapshot_identifier = "veza-db-${var.environment}-final-snapshot-${formatdate("YYYY-MM-DD-hhmm", timestamp())}"
|
||
|
|
|
||
|
|
enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
|
||
|
|
|
||
|
|
tags = {
|
||
|
|
Name = "veza-db-${var.environment}"
|
||
|
|
Environment = var.environment
|
||
|
|
ManagedBy = "Terraform"
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Terraform Workflow**:
|
||
|
|
```bash
|
||
|
|
# Initialize
|
||
|
|
cd terraform/environments/production
|
||
|
|
terraform init
|
||
|
|
|
||
|
|
# Plan (review changes)
|
||
|
|
terraform plan -out=tfplan
|
||
|
|
|
||
|
|
# Apply (execute changes)
|
||
|
|
terraform apply tfplan
|
||
|
|
|
||
|
|
# Destroy (cleanup)
|
||
|
|
terraform destroy
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2.2 Ansible Configuration
|
||
|
|
|
||
|
|
**Playbook Structure**:
|
||
|
|
```
|
||
|
|
ansible/
|
||
|
|
├── inventory/
|
||
|
|
│ ├── production/
|
||
|
|
│ │ ├── hosts.yml
|
||
|
|
│ │ └── group_vars/
|
||
|
|
│ └── staging/
|
||
|
|
│ ├── hosts.yml
|
||
|
|
│ └── group_vars/
|
||
|
|
├── playbooks/
|
||
|
|
│ ├── deploy-backend.yml
|
||
|
|
│ ├── deploy-chat-server.yml
|
||
|
|
│ ├── deploy-stream-server.yml
|
||
|
|
│ └── deploy-frontend.yml
|
||
|
|
├── roles/
|
||
|
|
│ ├── common/
|
||
|
|
│ ├── docker/
|
||
|
|
│ ├── nginx/
|
||
|
|
│ ├── postgres/
|
||
|
|
│ └── monitoring/
|
||
|
|
└── ansible.cfg
|
||
|
|
```
|
||
|
|
|
||
|
|
**Deployment Playbook**:
|
||
|
|
```yaml
|
||
|
|
# ansible/playbooks/deploy-backend.yml
|
||
|
|
---
|
||
|
|
- name: Deploy Veza Backend API
|
||
|
|
hosts: backend_servers
|
||
|
|
become: yes
|
||
|
|
|
||
|
|
vars:
|
||
|
|
app_name: veza-backend-api
|
||
|
|
app_version: "{{ lookup('env', 'VERSION') | default('latest') }}"
|
||
|
|
docker_image: "registry.veza.app/{{ app_name }}:{{ app_version }}"
|
||
|
|
|
||
|
|
tasks:
|
||
|
|
- name: Pull Docker image
|
||
|
|
docker_image:
|
||
|
|
name: "{{ docker_image }}"
|
||
|
|
source: pull
|
||
|
|
|
||
|
|
- name: Stop old container
|
||
|
|
docker_container:
|
||
|
|
name: "{{ app_name }}"
|
||
|
|
state: stopped
|
||
|
|
ignore_errors: yes
|
||
|
|
|
||
|
|
- name: Remove old container
|
||
|
|
docker_container:
|
||
|
|
name: "{{ app_name }}"
|
||
|
|
state: absent
|
||
|
|
ignore_errors: yes
|
||
|
|
|
||
|
|
- name: Start new container
|
||
|
|
docker_container:
|
||
|
|
name: "{{ app_name }}"
|
||
|
|
image: "{{ docker_image }}"
|
||
|
|
state: started
|
||
|
|
restart_policy: unless-stopped
|
||
|
|
ports:
|
||
|
|
- "8080:8080"
|
||
|
|
env:
|
||
|
|
DATABASE_URL: "{{ database_url }}"
|
||
|
|
REDIS_URL: "{{ redis_url }}"
|
||
|
|
JWT_SECRET: "{{ jwt_secret }}"
|
||
|
|
volumes:
|
||
|
|
- "/var/log/{{ app_name }}:/var/log/app"
|
||
|
|
healthcheck:
|
||
|
|
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
|
||
|
|
interval: 30s
|
||
|
|
timeout: 10s
|
||
|
|
retries: 3
|
||
|
|
start_period: 40s
|
||
|
|
|
||
|
|
- name: Wait for application to be healthy
|
||
|
|
uri:
|
||
|
|
url: http://localhost:8080/health
|
||
|
|
status_code: 200
|
||
|
|
register: result
|
||
|
|
until: result.status == 200
|
||
|
|
retries: 10
|
||
|
|
delay: 5
|
||
|
|
|
||
|
|
- name: Verify deployment
|
||
|
|
debug:
|
||
|
|
msg: "{{ app_name }} version {{ app_version }} deployed successfully"
|
||
|
|
```
|
||
|
|
|
||
|
|
## 3. CONTAINERIZATION
|
||
|
|
|
||
|
|
### 3.1 Docker Images
|
||
|
|
|
||
|
|
**Multi-Stage Build (Go)**:
|
||
|
|
```dockerfile
|
||
|
|
# veza-backend-api/Dockerfile
|
||
|
|
# Stage 1: Builder
|
||
|
|
FROM golang:1.21.5-alpine3.18 AS builder
|
||
|
|
|
||
|
|
WORKDIR /app
|
||
|
|
|
||
|
|
# Copy dependencies
|
||
|
|
COPY go.mod go.sum ./
|
||
|
|
RUN go mod download
|
||
|
|
|
||
|
|
# Copy source
|
||
|
|
COPY . .
|
||
|
|
|
||
|
|
# Build binary
|
||
|
|
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -ldflags="-w -s" -o main ./cmd/api
|
||
|
|
|
||
|
|
# Stage 2: Runner
|
||
|
|
FROM alpine:3.18
|
||
|
|
|
||
|
|
# Install CA certificates for HTTPS
|
||
|
|
RUN apk --no-cache add ca-certificates
|
||
|
|
|
||
|
|
WORKDIR /root/
|
||
|
|
|
||
|
|
# Copy binary from builder
|
||
|
|
COPY --from=builder /app/main .
|
||
|
|
|
||
|
|
# Create non-root user
|
||
|
|
RUN addgroup -g 1000 appuser && \
|
||
|
|
adduser -D -u 1000 -G appuser appuser
|
||
|
|
|
||
|
|
USER appuser
|
||
|
|
|
||
|
|
# Expose port
|
||
|
|
EXPOSE 8080
|
||
|
|
|
||
|
|
# Health check
|
||
|
|
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
|
||
|
|
CMD ["/root/main", "healthcheck"]
|
||
|
|
|
||
|
|
# Run
|
||
|
|
ENTRYPOINT ["./main"]
|
||
|
|
```
|
||
|
|
|
||
|
|
**Multi-Stage Build (Rust)**:
|
||
|
|
```dockerfile
|
||
|
|
# veza-chat-server/Dockerfile
|
||
|
|
FROM rust:1.75-alpine AS builder
|
||
|
|
|
||
|
|
WORKDIR /app
|
||
|
|
|
||
|
|
RUN apk add --no-cache musl-dev
|
||
|
|
|
||
|
|
# Copy dependencies
|
||
|
|
COPY Cargo.toml Cargo.lock ./
|
||
|
|
RUN mkdir src && echo "fn main() {}" > src/main.rs && cargo build --release && rm -rf src
|
||
|
|
|
||
|
|
# Copy source
|
||
|
|
COPY . .
|
||
|
|
|
||
|
|
# Build binary
|
||
|
|
RUN cargo build --release
|
||
|
|
|
||
|
|
# Stage 2: Runner
|
||
|
|
FROM alpine:3.18
|
||
|
|
|
||
|
|
WORKDIR /app
|
||
|
|
|
||
|
|
# Copy binary
|
||
|
|
COPY --from=builder /app/target/release/veza-chat-server .
|
||
|
|
|
||
|
|
# Create non-root user
|
||
|
|
RUN addgroup -g 1000 appuser && \
|
||
|
|
adduser -D -u 1000 -G appuser appuser
|
||
|
|
|
||
|
|
USER appuser
|
||
|
|
|
||
|
|
EXPOSE 8081
|
||
|
|
|
||
|
|
HEALTHCHECK --interval=30s --timeout=10s --start-period=20s --retries=3 \
|
||
|
|
CMD ["wget", "--quiet", "--tries=1", "--spider", "http://localhost:8081/health"]
|
||
|
|
|
||
|
|
ENTRYPOINT ["./veza-chat-server"]
|
||
|
|
```
|
||
|
|
|
||
|
|
**Frontend (React/Vite)**:
|
||
|
|
```dockerfile
|
||
|
|
# apps/web/Dockerfile
|
||
|
|
FROM node:20-alpine AS builder
|
||
|
|
|
||
|
|
WORKDIR /app
|
||
|
|
|
||
|
|
COPY package*.json ./
|
||
|
|
RUN npm ci
|
||
|
|
|
||
|
|
COPY . .
|
||
|
|
RUN npm run build
|
||
|
|
|
||
|
|
# Stage 2: Nginx
|
||
|
|
FROM nginx:1.25-alpine
|
||
|
|
|
||
|
|
COPY --from=builder /app/dist /usr/share/nginx/html
|
||
|
|
COPY nginx.conf /etc/nginx/conf.d/default.conf
|
||
|
|
|
||
|
|
EXPOSE 80
|
||
|
|
|
||
|
|
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
|
||
|
|
CMD ["wget", "--quiet", "--tries=1", "--spider", "http://localhost/health"]
|
||
|
|
|
||
|
|
CMD ["nginx", "-g", "daemon off;"]
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3.2 Docker Compose (Development)
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# docker-compose.yml
|
||
|
|
version: '3.9'
|
||
|
|
|
||
|
|
services:
|
||
|
|
postgres:
|
||
|
|
image: postgres:15-alpine
|
||
|
|
environment:
|
||
|
|
POSTGRES_DB: veza_db
|
||
|
|
POSTGRES_USER: veza
|
||
|
|
POSTGRES_PASSWORD: ${DB_PASSWORD:-password}
|
||
|
|
ports:
|
||
|
|
- "5432:5432"
|
||
|
|
volumes:
|
||
|
|
- postgres_data:/var/lib/postgresql/data
|
||
|
|
healthcheck:
|
||
|
|
test: ["CMD-SHELL", "pg_isready -U veza"]
|
||
|
|
interval: 10s
|
||
|
|
timeout: 5s
|
||
|
|
retries: 5
|
||
|
|
|
||
|
|
redis:
|
||
|
|
image: redis:7-alpine
|
||
|
|
ports:
|
||
|
|
- "6379:6379"
|
||
|
|
volumes:
|
||
|
|
- redis_data:/data
|
||
|
|
healthcheck:
|
||
|
|
test: ["CMD", "redis-cli", "ping"]
|
||
|
|
interval: 10s
|
||
|
|
timeout: 3s
|
||
|
|
retries: 5
|
||
|
|
|
||
|
|
backend:
|
||
|
|
build:
|
||
|
|
context: ./veza-backend-api
|
||
|
|
dockerfile: Dockerfile
|
||
|
|
ports:
|
||
|
|
- "8080:8080"
|
||
|
|
environment:
|
||
|
|
DATABASE_URL: postgresql://veza:${DB_PASSWORD:-password}@postgres:5432/veza_db
|
||
|
|
REDIS_URL: redis://redis:6379
|
||
|
|
JWT_SECRET: ${JWT_SECRET}
|
||
|
|
depends_on:
|
||
|
|
postgres:
|
||
|
|
condition: service_healthy
|
||
|
|
redis:
|
||
|
|
condition: service_healthy
|
||
|
|
healthcheck:
|
||
|
|
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
|
||
|
|
interval: 30s
|
||
|
|
timeout: 10s
|
||
|
|
retries: 3
|
||
|
|
|
||
|
|
chat-server:
|
||
|
|
build:
|
||
|
|
context: ./veza-chat-server
|
||
|
|
dockerfile: Dockerfile
|
||
|
|
ports:
|
||
|
|
- "8081:8081"
|
||
|
|
environment:
|
||
|
|
DATABASE_URL: postgresql://veza:${DB_PASSWORD:-password}@postgres:5432/veza_db
|
||
|
|
REDIS_URL: redis://redis:6379
|
||
|
|
depends_on:
|
||
|
|
postgres:
|
||
|
|
condition: service_healthy
|
||
|
|
redis:
|
||
|
|
condition: service_healthy
|
||
|
|
|
||
|
|
frontend:
|
||
|
|
build:
|
||
|
|
context: ./apps/web
|
||
|
|
dockerfile: Dockerfile
|
||
|
|
ports:
|
||
|
|
- "3000:80"
|
||
|
|
depends_on:
|
||
|
|
- backend
|
||
|
|
|
||
|
|
volumes:
|
||
|
|
postgres_data:
|
||
|
|
redis_data:
|
||
|
|
```
|
||
|
|
|
||
|
|
## 4. KUBERNETES ORCHESTRATION
|
||
|
|
|
||
|
|
### 4.1 Kubernetes Manifests
|
||
|
|
|
||
|
|
**Deployment (Backend)**:
|
||
|
|
```yaml
|
||
|
|
# k8s/backend/deployment.yaml
|
||
|
|
apiVersion: apps/v1
|
||
|
|
kind: Deployment
|
||
|
|
metadata:
|
||
|
|
name: veza-backend
|
||
|
|
namespace: veza-production
|
||
|
|
labels:
|
||
|
|
app: veza-backend
|
||
|
|
version: v1.0.0
|
||
|
|
spec:
|
||
|
|
replicas: 3
|
||
|
|
strategy:
|
||
|
|
type: RollingUpdate
|
||
|
|
rollingUpdate:
|
||
|
|
maxSurge: 1
|
||
|
|
maxUnavailable: 0
|
||
|
|
selector:
|
||
|
|
matchLabels:
|
||
|
|
app: veza-backend
|
||
|
|
template:
|
||
|
|
metadata:
|
||
|
|
labels:
|
||
|
|
app: veza-backend
|
||
|
|
version: v1.0.0
|
||
|
|
spec:
|
||
|
|
securityContext:
|
||
|
|
runAsNonRoot: true
|
||
|
|
runAsUser: 1000
|
||
|
|
fsGroup: 1000
|
||
|
|
containers:
|
||
|
|
- name: backend
|
||
|
|
image: registry.veza.app/veza-backend-api:v1.0.0
|
||
|
|
imagePullPolicy: IfNotPresent
|
||
|
|
ports:
|
||
|
|
- containerPort: 8080
|
||
|
|
name: http
|
||
|
|
protocol: TCP
|
||
|
|
env:
|
||
|
|
- name: DATABASE_URL
|
||
|
|
valueFrom:
|
||
|
|
secretKeyRef:
|
||
|
|
name: veza-secrets
|
||
|
|
key: database-url
|
||
|
|
- name: REDIS_URL
|
||
|
|
valueFrom:
|
||
|
|
secretKeyRef:
|
||
|
|
name: veza-secrets
|
||
|
|
key: redis-url
|
||
|
|
- name: JWT_SECRET
|
||
|
|
valueFrom:
|
||
|
|
secretKeyRef:
|
||
|
|
name: veza-secrets
|
||
|
|
key: jwt-secret
|
||
|
|
resources:
|
||
|
|
requests:
|
||
|
|
cpu: 500m
|
||
|
|
memory: 512Mi
|
||
|
|
limits:
|
||
|
|
cpu: 1000m
|
||
|
|
memory: 1Gi
|
||
|
|
livenessProbe:
|
||
|
|
httpGet:
|
||
|
|
path: /health
|
||
|
|
port: 8080
|
||
|
|
initialDelaySeconds: 30
|
||
|
|
periodSeconds: 10
|
||
|
|
timeoutSeconds: 5
|
||
|
|
failureThreshold: 3
|
||
|
|
readinessProbe:
|
||
|
|
httpGet:
|
||
|
|
path: /ready
|
||
|
|
port: 8080
|
||
|
|
initialDelaySeconds: 10
|
||
|
|
periodSeconds: 5
|
||
|
|
timeoutSeconds: 3
|
||
|
|
failureThreshold: 3
|
||
|
|
securityContext:
|
||
|
|
allowPrivilegeEscalation: false
|
||
|
|
capabilities:
|
||
|
|
drop:
|
||
|
|
- ALL
|
||
|
|
readOnlyRootFilesystem: true
|
||
|
|
imagePullSecrets:
|
||
|
|
- name: registry-credentials
|
||
|
|
```
|
||
|
|
|
||
|
|
**Service**:
|
||
|
|
```yaml
|
||
|
|
# k8s/backend/service.yaml
|
||
|
|
apiVersion: v1
|
||
|
|
kind: Service
|
||
|
|
metadata:
|
||
|
|
name: veza-backend
|
||
|
|
namespace: veza-production
|
||
|
|
spec:
|
||
|
|
type: ClusterIP
|
||
|
|
selector:
|
||
|
|
app: veza-backend
|
||
|
|
ports:
|
||
|
|
- name: http
|
||
|
|
port: 80
|
||
|
|
targetPort: 8080
|
||
|
|
protocol: TCP
|
||
|
|
```
|
||
|
|
|
||
|
|
**Ingress**:
|
||
|
|
```yaml
|
||
|
|
# k8s/ingress.yaml
|
||
|
|
apiVersion: networking.k8s.io/v1
|
||
|
|
kind: Ingress
|
||
|
|
metadata:
|
||
|
|
name: veza-ingress
|
||
|
|
namespace: veza-production
|
||
|
|
annotations:
|
||
|
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||
|
|
nginx.ingress.kubernetes.io/rate-limit: "100"
|
||
|
|
nginx.ingress.kubernetes.io/ssl-redirect: "true"
|
||
|
|
spec:
|
||
|
|
ingressClassName: nginx
|
||
|
|
tls:
|
||
|
|
- hosts:
|
||
|
|
- api.veza.app
|
||
|
|
- veza.app
|
||
|
|
secretName: veza-tls
|
||
|
|
rules:
|
||
|
|
- host: api.veza.app
|
||
|
|
http:
|
||
|
|
paths:
|
||
|
|
- path: /
|
||
|
|
pathType: Prefix
|
||
|
|
backend:
|
||
|
|
service:
|
||
|
|
name: veza-backend
|
||
|
|
port:
|
||
|
|
number: 80
|
||
|
|
- host: veza.app
|
||
|
|
http:
|
||
|
|
paths:
|
||
|
|
- path: /
|
||
|
|
pathType: Prefix
|
||
|
|
backend:
|
||
|
|
service:
|
||
|
|
name: veza-frontend
|
||
|
|
port:
|
||
|
|
number: 80
|
||
|
|
```
|
||
|
|
|
||
|
|
**HorizontalPodAutoscaler**:
|
||
|
|
```yaml
|
||
|
|
# k8s/backend/hpa.yaml
|
||
|
|
apiVersion: autoscaling/v2
|
||
|
|
kind: HorizontalPodAutoscaler
|
||
|
|
metadata:
|
||
|
|
name: veza-backend-hpa
|
||
|
|
namespace: veza-production
|
||
|
|
spec:
|
||
|
|
scaleTargetRef:
|
||
|
|
apiVersion: apps/v1
|
||
|
|
kind: Deployment
|
||
|
|
name: veza-backend
|
||
|
|
minReplicas: 3
|
||
|
|
maxReplicas: 10
|
||
|
|
metrics:
|
||
|
|
- type: Resource
|
||
|
|
resource:
|
||
|
|
name: cpu
|
||
|
|
target:
|
||
|
|
type: Utilization
|
||
|
|
averageUtilization: 70
|
||
|
|
- type: Resource
|
||
|
|
resource:
|
||
|
|
name: memory
|
||
|
|
target:
|
||
|
|
type: Utilization
|
||
|
|
averageUtilization: 80
|
||
|
|
behavior:
|
||
|
|
scaleUp:
|
||
|
|
stabilizationWindowSeconds: 60
|
||
|
|
policies:
|
||
|
|
- type: Percent
|
||
|
|
value: 100
|
||
|
|
periodSeconds: 60
|
||
|
|
scaleDown:
|
||
|
|
stabilizationWindowSeconds: 300
|
||
|
|
policies:
|
||
|
|
- type: Pods
|
||
|
|
value: 1
|
||
|
|
periodSeconds: 60
|
||
|
|
```
|
||
|
|
|
||
|
|
## 5. CI/CD PIPELINES
|
||
|
|
|
||
|
|
### 5.1 GitHub Actions Workflow
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# .github/workflows/deploy-production.yml
|
||
|
|
name: Deploy to Production
|
||
|
|
|
||
|
|
on:
|
||
|
|
push:
|
||
|
|
branches:
|
||
|
|
- main
|
||
|
|
tags:
|
||
|
|
- 'v*'
|
||
|
|
|
||
|
|
env:
|
||
|
|
REGISTRY: registry.veza.app
|
||
|
|
KUBE_NAMESPACE: veza-production
|
||
|
|
|
||
|
|
jobs:
|
||
|
|
build-and-test:
|
||
|
|
runs-on: ubuntu-latest
|
||
|
|
steps:
|
||
|
|
- uses: actions/checkout@v3
|
||
|
|
|
||
|
|
- name: Run tests
|
||
|
|
run: |
|
||
|
|
make test-all
|
||
|
|
|
||
|
|
- name: Security scan
|
||
|
|
run: |
|
||
|
|
make security-scan
|
||
|
|
|
||
|
|
build-backend:
|
||
|
|
needs: build-and-test
|
||
|
|
runs-on: ubuntu-latest
|
||
|
|
steps:
|
||
|
|
- uses: actions/checkout@v3
|
||
|
|
|
||
|
|
- name: Set up Docker Buildx
|
||
|
|
uses: docker/setup-buildx-action@v2
|
||
|
|
|
||
|
|
- name: Login to Registry
|
||
|
|
uses: docker/login-action@v2
|
||
|
|
with:
|
||
|
|
registry: ${{ env.REGISTRY }}
|
||
|
|
username: ${{ secrets.REGISTRY_USERNAME }}
|
||
|
|
password: ${{ secrets.REGISTRY_PASSWORD }}
|
||
|
|
|
||
|
|
- name: Extract metadata
|
||
|
|
id: meta
|
||
|
|
uses: docker/metadata-action@v4
|
||
|
|
with:
|
||
|
|
images: ${{ env.REGISTRY }}/veza-backend-api
|
||
|
|
tags: |
|
||
|
|
type=ref,event=branch
|
||
|
|
type=ref,event=pr
|
||
|
|
type=semver,pattern={{version}}
|
||
|
|
type=semver,pattern={{major}}.{{minor}}
|
||
|
|
type=sha,prefix={{branch}}-
|
||
|
|
|
||
|
|
- name: Build and push
|
||
|
|
uses: docker/build-push-action@v4
|
||
|
|
with:
|
||
|
|
context: ./veza-backend-api
|
||
|
|
push: true
|
||
|
|
tags: ${{ steps.meta.outputs.tags }}
|
||
|
|
labels: ${{ steps.meta.outputs.labels }}
|
||
|
|
cache-from: type=registry,ref=${{ env.REGISTRY }}/veza-backend-api:buildcache
|
||
|
|
cache-to: type=registry,ref=${{ env.REGISTRY }}/veza-backend-api:buildcache,mode=max
|
||
|
|
|
||
|
|
deploy-staging:
|
||
|
|
needs: [build-backend]
|
||
|
|
runs-on: ubuntu-latest
|
||
|
|
environment: staging
|
||
|
|
steps:
|
||
|
|
- uses: actions/checkout@v3
|
||
|
|
|
||
|
|
- name: Deploy to Staging
|
||
|
|
run: |
|
||
|
|
kubectl set image deployment/veza-backend \
|
||
|
|
backend=${{ env.REGISTRY }}/veza-backend-api:${{ github.sha }} \
|
||
|
|
-n veza-staging
|
||
|
|
kubectl rollout status deployment/veza-backend -n veza-staging --timeout=5m
|
||
|
|
|
||
|
|
- name: Run E2E tests
|
||
|
|
run: |
|
||
|
|
npm run test:e2e -- --env=staging
|
||
|
|
|
||
|
|
deploy-production:
|
||
|
|
needs: [deploy-staging]
|
||
|
|
runs-on: ubuntu-latest
|
||
|
|
environment: production
|
||
|
|
steps:
|
||
|
|
- uses: actions/checkout@v3
|
||
|
|
|
||
|
|
- name: Setup kubectl
|
||
|
|
uses: azure/setup-kubectl@v3
|
||
|
|
with:
|
||
|
|
version: 'v1.28.0'
|
||
|
|
|
||
|
|
- name: Configure kubectl
|
||
|
|
run: |
|
||
|
|
echo "${{ secrets.KUBE_CONFIG }}" | base64 -d > kubeconfig
|
||
|
|
export KUBECONFIG=./kubeconfig
|
||
|
|
|
||
|
|
- name: Deploy to Production (Blue-Green)
|
||
|
|
run: |
|
||
|
|
# Deploy green environment
|
||
|
|
kubectl apply -f k8s/backend/deployment-green.yaml
|
||
|
|
kubectl rollout status deployment/veza-backend-green -n ${{ env.KUBE_NAMESPACE }} --timeout=10m
|
||
|
|
|
||
|
|
# Run smoke tests
|
||
|
|
make smoke-tests ENDPOINT=https://green.api.veza.app
|
||
|
|
|
||
|
|
# Switch traffic to green
|
||
|
|
kubectl patch service veza-backend -n ${{ env.KUBE_NAMESPACE }} \
|
||
|
|
-p '{"spec":{"selector":{"version":"green"}}}'
|
||
|
|
|
||
|
|
# Wait for validation
|
||
|
|
sleep 60
|
||
|
|
|
||
|
|
# Monitor metrics
|
||
|
|
if ! make verify-deployment; then
|
||
|
|
echo "Deployment verification failed, rolling back..."
|
||
|
|
kubectl patch service veza-backend -n ${{ env.KUBE_NAMESPACE }} \
|
||
|
|
-p '{"spec":{"selector":{"version":"blue"}}}'
|
||
|
|
exit 1
|
||
|
|
fi
|
||
|
|
|
||
|
|
# Delete old blue deployment
|
||
|
|
kubectl delete deployment veza-backend-blue -n ${{ env.KUBE_NAMESPACE }}
|
||
|
|
|
||
|
|
- name: Notify Slack
|
||
|
|
if: always()
|
||
|
|
uses: slackapi/slack-github-action@v1
|
||
|
|
with:
|
||
|
|
payload: |
|
||
|
|
{
|
||
|
|
"text": "Production deployment ${{ job.status }}: ${{ github.sha }}"
|
||
|
|
}
|
||
|
|
env:
|
||
|
|
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
|
||
|
|
```
|
||
|
|
|
||
|
|
## 6. ZERO-DOWNTIME STRATEGIES
|
||
|
|
|
||
|
|
### 6.1 Blue-Green Deployment
|
||
|
|
|
||
|
|
**Process**:
|
||
|
|
1. **Blue** (current production) serves all traffic
|
||
|
|
2. Deploy **Green** (new version) in parallel
|
||
|
|
3. Test Green thoroughly (smoke tests, health checks)
|
||
|
|
4. Switch load balancer from Blue to Green (instant cutover)
|
||
|
|
5. Monitor Green for issues (5-10 min)
|
||
|
|
6. If issues: Rollback to Blue (instant)
|
||
|
|
7. If stable: Decommission Blue
|
||
|
|
|
||
|
|
**Kubernetes Implementation**:
|
||
|
|
```bash
|
||
|
|
# Deploy green
|
||
|
|
kubectl apply -f k8s/backend/deployment-green.yaml
|
||
|
|
|
||
|
|
# Wait for readiness
|
||
|
|
kubectl wait --for=condition=available --timeout=10m deployment/veza-backend-green
|
||
|
|
|
||
|
|
# Switch service selector
|
||
|
|
kubectl patch service veza-backend -p '{"spec":{"selector":{"version":"green"}}}'
|
||
|
|
|
||
|
|
# Monitor
|
||
|
|
watch kubectl get pods -l app=veza-backend
|
||
|
|
|
||
|
|
# Rollback if needed
|
||
|
|
kubectl patch service veza-backend -p '{"spec":{"selector":{"version":"blue"}}}'
|
||
|
|
```
|
||
|
|
|
||
|
|
### 6.2 Canary Deployment
|
||
|
|
|
||
|
|
**Process**:
|
||
|
|
1. Deploy new version (canary) with 5% traffic
|
||
|
|
2. Monitor metrics (error rate, latency)
|
||
|
|
3. Gradually increase traffic: 5% → 25% → 50% → 100%
|
||
|
|
4. At each stage, verify metrics are healthy
|
||
|
|
5. If issues detected: Rollback immediately
|
||
|
|
|
||
|
|
**Kubernetes with Istio**:
|
||
|
|
```yaml
|
||
|
|
# k8s/canary/virtualservice.yaml
|
||
|
|
apiVersion: networking.istio.io/v1beta1
|
||
|
|
kind: VirtualService
|
||
|
|
metadata:
|
||
|
|
name: veza-backend
|
||
|
|
spec:
|
||
|
|
hosts:
|
||
|
|
- veza-backend
|
||
|
|
http:
|
||
|
|
- match:
|
||
|
|
- headers:
|
||
|
|
canary:
|
||
|
|
exact: "true"
|
||
|
|
route:
|
||
|
|
- destination:
|
||
|
|
host: veza-backend
|
||
|
|
subset: canary
|
||
|
|
- route:
|
||
|
|
- destination:
|
||
|
|
host: veza-backend
|
||
|
|
subset: stable
|
||
|
|
weight: 95
|
||
|
|
- destination:
|
||
|
|
host: veza-backend
|
||
|
|
subset: canary
|
||
|
|
weight: 5
|
||
|
|
```
|
||
|
|
|
||
|
|
**Automated Canary with Flagger**:
|
||
|
|
```yaml
|
||
|
|
# k8s/canary/flagger-canary.yaml
|
||
|
|
apiVersion: flagger.app/v1beta1
|
||
|
|
kind: Canary
|
||
|
|
metadata:
|
||
|
|
name: veza-backend
|
||
|
|
namespace: veza-production
|
||
|
|
spec:
|
||
|
|
targetRef:
|
||
|
|
apiVersion: apps/v1
|
||
|
|
kind: Deployment
|
||
|
|
name: veza-backend
|
||
|
|
service:
|
||
|
|
port: 80
|
||
|
|
analysis:
|
||
|
|
interval: 1m
|
||
|
|
threshold: 5
|
||
|
|
maxWeight: 50
|
||
|
|
stepWeight: 10
|
||
|
|
metrics:
|
||
|
|
- name: request-success-rate
|
||
|
|
thresholdRange:
|
||
|
|
min: 99
|
||
|
|
interval: 1m
|
||
|
|
- name: request-duration
|
||
|
|
thresholdRange:
|
||
|
|
max: 500
|
||
|
|
interval: 1m
|
||
|
|
webhooks:
|
||
|
|
- name: acceptance-test
|
||
|
|
type: pre-rollout
|
||
|
|
url: http://flagger-loadtester.test/
|
||
|
|
timeout: 30s
|
||
|
|
metadata:
|
||
|
|
type: bash
|
||
|
|
cmd: "curl -s http://veza-backend-canary/health | grep -q ok"
|
||
|
|
```
|
||
|
|
|
||
|
|
## 7. CONFIGURATION MANAGEMENT
|
||
|
|
|
||
|
|
### 7.1 ConfigMap (Non-Sensitive Config)
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# k8s/backend/configmap.yaml
|
||
|
|
apiVersion: v1
|
||
|
|
kind: ConfigMap
|
||
|
|
metadata:
|
||
|
|
name: veza-backend-config
|
||
|
|
namespace: veza-production
|
||
|
|
data:
|
||
|
|
APP_ENV: "production"
|
||
|
|
LOG_LEVEL: "info"
|
||
|
|
API_RATE_LIMIT: "300"
|
||
|
|
MAX_UPLOAD_SIZE: "500MB"
|
||
|
|
CORS_ORIGINS: "https://veza.app,https://www.veza.app"
|
||
|
|
```
|
||
|
|
|
||
|
|
### 7.2 Secrets (Sensitive Data)
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# k8s/backend/secret.yaml (encrypted with SOPS or sealed-secrets)
|
||
|
|
apiVersion: v1
|
||
|
|
kind: Secret
|
||
|
|
metadata:
|
||
|
|
name: veza-secrets
|
||
|
|
namespace: veza-production
|
||
|
|
type: Opaque
|
||
|
|
data:
|
||
|
|
database-url: <base64-encoded>
|
||
|
|
redis-url: <base64-encoded>
|
||
|
|
jwt-secret: <base64-encoded>
|
||
|
|
stripe-api-key: <base64-encoded>
|
||
|
|
```
|
||
|
|
|
||
|
|
**Create Secret from Vault**:
|
||
|
|
```bash
|
||
|
|
# Fetch from Vault and create K8s secret
|
||
|
|
vault kv get -field=database_url secret/veza/production | base64 | \
|
||
|
|
kubectl create secret generic veza-secrets \
|
||
|
|
--from-literal=database-url=- \
|
||
|
|
-n veza-production
|
||
|
|
```
|
||
|
|
|
||
|
|
## 8. SECRETS MANAGEMENT
|
||
|
|
|
||
|
|
### 8.1 HashiCorp Vault
|
||
|
|
|
||
|
|
**Vault Structure**:
|
||
|
|
```
|
||
|
|
secret/
|
||
|
|
├── veza/
|
||
|
|
│ ├── production/
|
||
|
|
│ │ ├── database_url
|
||
|
|
│ │ ├── redis_url
|
||
|
|
│ │ ├── jwt_secret
|
||
|
|
│ │ ├── stripe_api_key
|
||
|
|
│ │ ├── aws_access_key
|
||
|
|
│ │ └── aws_secret_key
|
||
|
|
│ └── staging/
|
||
|
|
│ └── ...
|
||
|
|
```
|
||
|
|
|
||
|
|
**Store Secret**:
|
||
|
|
```bash
|
||
|
|
# Write secret
|
||
|
|
vault kv put secret/veza/production \
|
||
|
|
database_url="postgresql://..." \
|
||
|
|
redis_url="redis://..." \
|
||
|
|
jwt_secret="..."
|
||
|
|
|
||
|
|
# Read secret
|
||
|
|
vault kv get secret/veza/production
|
||
|
|
|
||
|
|
# Rotate secret (new version)
|
||
|
|
vault kv put secret/veza/production jwt_secret="new-secret"
|
||
|
|
```
|
||
|
|
|
||
|
|
**Vault Agent Injector (Kubernetes)**:
|
||
|
|
```yaml
|
||
|
|
apiVersion: v1
|
||
|
|
kind: Pod
|
||
|
|
metadata:
|
||
|
|
annotations:
|
||
|
|
vault.hashicorp.com/agent-inject: "true"
|
||
|
|
vault.hashicorp.com/role: "veza-backend"
|
||
|
|
vault.hashicorp.com/agent-inject-secret-database: "secret/data/veza/production"
|
||
|
|
vault.hashicorp.com/agent-inject-template-database: |
|
||
|
|
{{- with secret "secret/data/veza/production" -}}
|
||
|
|
export DATABASE_URL="{{ .Data.data.database_url }}"
|
||
|
|
{{- end }}
|
||
|
|
```
|
||
|
|
|
||
|
|
## 9. MONITORING & OBSERVABILITY
|
||
|
|
|
||
|
|
### 9.1 Prometheus + Grafana
|
||
|
|
|
||
|
|
**Prometheus Configuration**:
|
||
|
|
```yaml
|
||
|
|
# prometheus/prometheus.yml
|
||
|
|
global:
|
||
|
|
scrape_interval: 15s
|
||
|
|
evaluation_interval: 15s
|
||
|
|
|
||
|
|
scrape_configs:
|
||
|
|
- job_name: 'veza-backend'
|
||
|
|
kubernetes_sd_configs:
|
||
|
|
- role: pod
|
||
|
|
relabel_configs:
|
||
|
|
- source_labels: [__meta_kubernetes_pod_label_app]
|
||
|
|
action: keep
|
||
|
|
regex: veza-backend
|
||
|
|
- source_labels: [__meta_kubernetes_pod_ip]
|
||
|
|
target_label: __address__
|
||
|
|
replacement: $1:8080
|
||
|
|
|
||
|
|
- job_name: 'postgres'
|
||
|
|
static_configs:
|
||
|
|
- targets: ['postgres-exporter:9187']
|
||
|
|
|
||
|
|
- job_name: 'redis'
|
||
|
|
static_configs:
|
||
|
|
- targets: ['redis-exporter:9121']
|
||
|
|
```
|
||
|
|
|
||
|
|
**Grafana Dashboard**:
|
||
|
|
- **API Latency**: p50, p95, p99 response times
|
||
|
|
- **Throughput**: Requests per second
|
||
|
|
- **Error Rate**: 4xx, 5xx errors
|
||
|
|
- **Database**: Query time, connections, slow queries
|
||
|
|
- **Cache Hit Rate**: Redis hit/miss ratio
|
||
|
|
|
||
|
|
### 9.2 Logging (ELK Stack)
|
||
|
|
|
||
|
|
**Filebeat Configuration**:
|
||
|
|
```yaml
|
||
|
|
# filebeat/filebeat.yml
|
||
|
|
filebeat.inputs:
|
||
|
|
- type: container
|
||
|
|
paths:
|
||
|
|
- '/var/lib/docker/containers/*/*.log'
|
||
|
|
processors:
|
||
|
|
- add_kubernetes_metadata:
|
||
|
|
host: ${NODE_NAME}
|
||
|
|
matchers:
|
||
|
|
- logs_path:
|
||
|
|
logs_path: "/var/lib/docker/containers/"
|
||
|
|
|
||
|
|
output.elasticsearch:
|
||
|
|
hosts: ["elasticsearch:9200"]
|
||
|
|
index: "veza-logs-%{+yyyy.MM.dd}"
|
||
|
|
```
|
||
|
|
|
||
|
|
### 9.3 Tracing (Jaeger)
|
||
|
|
|
||
|
|
**OpenTelemetry Integration**:
|
||
|
|
```go
|
||
|
|
// Go - OpenTelemetry setup
|
||
|
|
import (
|
||
|
|
"go.opentelemetry.io/otel"
|
||
|
|
"go.opentelemetry.io/otel/exporters/jaeger"
|
||
|
|
"go.opentelemetry.io/otel/sdk/trace"
|
||
|
|
)
|
||
|
|
|
||
|
|
func initTracer() (*trace.TracerProvider, error) {
|
||
|
|
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint("http://jaeger:14268/api/traces")))
|
||
|
|
if err != nil {
|
||
|
|
return nil, err
|
||
|
|
}
|
||
|
|
|
||
|
|
tp := trace.NewTracerProvider(
|
||
|
|
trace.WithBatcher(exporter),
|
||
|
|
trace.WithResource(resource.NewWithAttributes(
|
||
|
|
semconv.SchemaURL,
|
||
|
|
semconv.ServiceNameKey.String("veza-backend-api"),
|
||
|
|
)),
|
||
|
|
)
|
||
|
|
|
||
|
|
otel.SetTracerProvider(tp)
|
||
|
|
return tp, nil
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## 10. BACKUP & DISASTER RECOVERY
|
||
|
|
|
||
|
|
### 10.1 Database Backups
|
||
|
|
|
||
|
|
**Automated Backup Strategy**:
|
||
|
|
- **Daily**: Full backup (3 AM UTC)
|
||
|
|
- **Hourly**: Incremental backup
|
||
|
|
- **Retention**: 30 days daily, 12 weeks weekly, 2 years monthly
|
||
|
|
|
||
|
|
**Backup Script**:
|
||
|
|
```bash
|
||
|
|
#!/bin/bash
|
||
|
|
# scripts/backup-database.sh
|
||
|
|
|
||
|
|
DATE=$(date +%Y%m%d_%H%M%S)
|
||
|
|
BACKUP_DIR="/backups/postgres"
|
||
|
|
DATABASE="veza_db"
|
||
|
|
|
||
|
|
# Full backup
|
||
|
|
pg_dump -Fc -f "$BACKUP_DIR/veza_db_$DATE.dump" "$DATABASE"
|
||
|
|
|
||
|
|
# Encrypt
|
||
|
|
gpg --encrypt --recipient backup@veza.app "$BACKUP_DIR/veza_db_$DATE.dump"
|
||
|
|
|
||
|
|
# Upload to S3
|
||
|
|
aws s3 cp "$BACKUP_DIR/veza_db_$DATE.dump.gpg" s3://veza-backups/postgres/
|
||
|
|
|
||
|
|
# Cleanup local backups > 7 days
|
||
|
|
find "$BACKUP_DIR" -name "*.dump.gpg" -mtime +7 -delete
|
||
|
|
```
|
||
|
|
|
||
|
|
**Restore Procedure**:
|
||
|
|
```bash
|
||
|
|
#!/bin/bash
|
||
|
|
# scripts/restore-database.sh
|
||
|
|
|
||
|
|
BACKUP_FILE=$1
|
||
|
|
|
||
|
|
# Download from S3
|
||
|
|
aws s3 cp "s3://veza-backups/postgres/$BACKUP_FILE" /tmp/
|
||
|
|
|
||
|
|
# Decrypt
|
||
|
|
gpg --decrypt "/tmp/$BACKUP_FILE" > "/tmp/backup.dump"
|
||
|
|
|
||
|
|
# Restore
|
||
|
|
pg_restore -d veza_db "/tmp/backup.dump"
|
||
|
|
```
|
||
|
|
|
||
|
|
### 10.2 Disaster Recovery Plan
|
||
|
|
|
||
|
|
**RTO (Recovery Time Objective)**: < 4 hours
|
||
|
|
**RPO (Recovery Point Objective)**: < 1 hour
|
||
|
|
|
||
|
|
**Recovery Procedures**:
|
||
|
|
1. **Database Failure**: Failover to standby replica (< 5 min)
|
||
|
|
2. **Application Failure**: Rollback deployment (< 5 min)
|
||
|
|
3. **Complete Region Failure**: Failover to DR region (< 4 hours)
|
||
|
|
|
||
|
|
## 11. SCALING STRATEGY
|
||
|
|
|
||
|
|
### 11.1 Horizontal Scaling
|
||
|
|
|
||
|
|
**Auto-Scaling Rules**:
|
||
|
|
- **CPU > 70%**: Scale up
|
||
|
|
- **CPU < 30%**: Scale down (after 5 min stability)
|
||
|
|
- **Memory > 80%**: Scale up
|
||
|
|
- **Request queue > 100**: Scale up
|
||
|
|
|
||
|
|
### 11.2 Database Scaling
|
||
|
|
|
||
|
|
**Read Replicas**:
|
||
|
|
- 2 read replicas minimum
|
||
|
|
- Route read queries to replicas
|
||
|
|
- Write queries to primary only
|
||
|
|
|
||
|
|
**Connection Pooling** (PgBouncer):
|
||
|
|
```ini
|
||
|
|
[databases]
|
||
|
|
veza_db = host=postgres port=5432 dbname=veza_db
|
||
|
|
|
||
|
|
[pgbouncer]
|
||
|
|
pool_mode = transaction
|
||
|
|
max_client_conn = 1000
|
||
|
|
default_pool_size = 25
|
||
|
|
reserve_pool_size = 5
|
||
|
|
```
|
||
|
|
|
||
|
|
## 12. OPERATIONAL PROCEDURES
|
||
|
|
|
||
|
|
### 12.1 Deployment Checklist
|
||
|
|
|
||
|
|
**Pre-Deployment**:
|
||
|
|
- [ ] All tests pass (unit, integration, E2E)
|
||
|
|
- [ ] Security scan completed (no critical vulnerabilities)
|
||
|
|
- [ ] Database migrations tested in staging
|
||
|
|
- [ ] Rollback plan documented
|
||
|
|
- [ ] Monitoring dashboards ready
|
||
|
|
- [ ] On-call engineer notified
|
||
|
|
- [ ] Deployment window scheduled (low-traffic period)
|
||
|
|
|
||
|
|
**During Deployment**:
|
||
|
|
- [ ] Monitor error rates in real-time
|
||
|
|
- [ ] Monitor response times (p95, p99)
|
||
|
|
- [ ] Check logs for errors
|
||
|
|
- [ ] Verify database migrations applied
|
||
|
|
- [ ] Test critical user flows
|
||
|
|
|
||
|
|
**Post-Deployment**:
|
||
|
|
- [ ] Verify all services healthy
|
||
|
|
- [ ] Run smoke tests
|
||
|
|
- [ ] Monitor for 30 minutes
|
||
|
|
- [ ] Update deployment log
|
||
|
|
- [ ] Notify stakeholders
|
||
|
|
|
||
|
|
### 12.2 Rollback Procedure
|
||
|
|
|
||
|
|
**Immediate Rollback** (< 5 min):
|
||
|
|
```bash
|
||
|
|
# Kubernetes
|
||
|
|
kubectl rollout undo deployment/veza-backend -n veza-production
|
||
|
|
|
||
|
|
# Verify
|
||
|
|
kubectl rollout status deployment/veza-backend -n veza-production
|
||
|
|
|
||
|
|
# Check logs
|
||
|
|
kubectl logs -f deployment/veza-backend -n veza-production
|
||
|
|
```
|
||
|
|
|
||
|
|
### 12.3 Incident Response
|
||
|
|
|
||
|
|
**Severity Levels**:
|
||
|
|
- **P0 (Critical)**: Production down, data breach
|
||
|
|
- **P1 (High)**: Major feature broken, performance degradation
|
||
|
|
- **P2 (Medium)**: Minor feature broken
|
||
|
|
- **P3 (Low)**: Cosmetic issues
|
||
|
|
|
||
|
|
**Response Procedure**:
|
||
|
|
1. Acknowledge incident (< 5 min)
|
||
|
|
2. Assess severity
|
||
|
|
3. Notify stakeholders
|
||
|
|
4. Mitigate (rollback, hotfix, scaling)
|
||
|
|
5. Root cause analysis
|
||
|
|
6. Post-mortem
|
||
|
|
|
||
|
|
## ✅ CHECKLIST DE VALIDATION
|
||
|
|
|
||
|
|
### Infrastructure
|
||
|
|
- [ ] Infrastructure as Code (Terraform) complete
|
||
|
|
- [ ] All resources versioned in Git
|
||
|
|
- [ ] Secrets in Vault (no plaintext)
|
||
|
|
- [ ] Automated provisioning tested
|
||
|
|
|
||
|
|
### Deployment
|
||
|
|
- [ ] CI/CD pipeline functional
|
||
|
|
- [ ] Zero-downtime deployment strategy (blue-green or canary)
|
||
|
|
- [ ] Automated rollback configured
|
||
|
|
- [ ] Health checks implemented
|
||
|
|
|
||
|
|
### Monitoring
|
||
|
|
- [ ] Prometheus + Grafana dashboards
|
||
|
|
- [ ] Alerting configured (PagerDuty/Slack)
|
||
|
|
- [ ] Logging centralized (ELK Stack)
|
||
|
|
- [ ] Tracing implemented (Jaeger)
|
||
|
|
|
||
|
|
### Disaster Recovery
|
||
|
|
- [ ] Automated backups (daily + hourly)
|
||
|
|
- [ ] Backup restoration tested
|
||
|
|
- [ ] Failover procedure documented
|
||
|
|
- [ ] RTO < 4h, RPO < 1h validated
|
||
|
|
|
||
|
|
## 📊 MÉTRIQUES DE SUCCÈS
|
||
|
|
|
||
|
|
### Deployment Metrics
|
||
|
|
- **Deployment Frequency**: Multiple per day
|
||
|
|
- **Lead Time**: < 1 hour (commit to production)
|
||
|
|
- **MTTR (Mean Time To Recovery)**: < 5 minutes
|
||
|
|
- **Change Failure Rate**: < 5%
|
||
|
|
|
||
|
|
### Operational Metrics
|
||
|
|
- **Uptime**: > 99.9%
|
||
|
|
- **RTO**: < 4 hours
|
||
|
|
- **RPO**: < 1 hour
|
||
|
|
- **Deployment Success Rate**: > 95%
|
||
|
|
|
||
|
|
## 🔄 HISTORIQUE DES VERSIONS
|
||
|
|
|
||
|
|
| Version | Date | Changements |
|
||
|
|
|---------|------|-------------|
|
||
|
|
| 1.0.0 | 2025-11-02 | Version initiale - Guide de déploiement complet |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## ⚠️ AVERTISSEMENT
|
||
|
|
|
||
|
|
**CE GUIDE EST IMMUABLE**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Document créé par**: DevOps Team + SRE
|
||
|
|
**Date de création**: 2025-11-02
|
||
|
|
**Prochaine révision**: Quarterly (2026-02-01)
|
||
|
|
**Propriétaire**: DevOps Lead
|
||
|
|
|
||
|
|
**Statut**: ✅ **APPROUVÉ ET VERROUILLÉ**
|
||
|
|
|