senke/veza

okinrev a01d7a25ac adding initial stream server (Rust)

2025-12-03 20:36:56 +01:00

11 KiB

Raw Blame History

📖 PRODUCTION GUIDE - VEZA RUST MODULES

Guide complet pour le déploiement et l'exploitation des modules Rust en production
Version : 2.0 Production-Ready
Dernière mise à jour : 1er juillet 2025

🎯 APERÇU SYSTÈME

Architecture Production

┌─────────────────────────────────────────┐
│                VEZA PLATFORM            │
├─────────────────┬───────────────────────┤
│   CHAT SERVER   │   STREAM SERVER       │
│   (Rust)        │   (Rust)              │
├─────────────────┼───────────────────────┤
│ • 100k+ WS      │ • 10k+ Streams        │
│ • <10ms latency │ • 100k+ Listeners     │
│ • E2E Encryption│ • Adaptive Bitrate    │
│ • AI Moderation │ • Real-time Effects   │
└─────────────────┴───────────────────────┘
            │
    ┌───────┼───────┐
    │   BACKEND GO  │
    │ (API Gateway) │
    └───────────────┘

Spécifications Techniques

Performance : 100k+ connexions WebSocket simultanées
Latency : <10ms P99 pour messages, <50ms pour streaming
Throughput : 10k+ requêtes/seconde par instance
Availability : 99.99% uptime target
Scalability : Horizontale avec auto-scaling

🔧 CONFIGURATION PRODUCTION

Variables d'Environnement

# === CORE CONFIG ===
RUST_LOG=info
ENVIRONMENT=production
SERVICE_NAME=veza-stream-server
VERSION=2.0.0

# === NETWORK ===
HOST=0.0.0.0
PORT=8080
WS_PORT=8081
GRPC_PORT=50051

# === DATABASE ===
DATABASE_URL=postgresql://veza:secure_pass@postgres:5432/veza_prod
DATABASE_POOL_SIZE=100
DATABASE_TIMEOUT_MS=5000

# === REDIS ===
REDIS_URL=redis://redis:6379
REDIS_POOL_SIZE=50
REDIS_TTL_DEFAULT=3600

# === MONITORING ===
PROMETHEUS_PORT=9090
JAEGER_ENDPOINT=http://jaeger:14268/api/traces
METRICS_ENABLED=true
TRACING_ENABLED=true

# === PERFORMANCE ===
MAX_CONNECTIONS=100000
WORKER_THREADS=16
BLOCKING_THREADS=32
MEMORY_LIMIT_MB=8192

# === SECURITY ===
JWT_SECRET=your_production_jwt_secret_here
ENCRYPTION_KEY=your_32_byte_encryption_key_here
RATE_LIMIT_PER_MINUTE=1000
ENABLE_CORS=true
ALLOWED_ORIGINS=https://veza.live,https://app.veza.live

# === AUDIO STREAMING ===
MAX_STREAMS=10000
MAX_LISTENERS_PER_STREAM=10000
ADAPTIVE_BITRATE=true
DEFAULT_BITRATE=128
CODECS_ENABLED=opus,aac,mp3

# === CHAT ===
MAX_MESSAGE_SIZE=8192
MESSAGE_HISTORY_LIMIT=1000
MODERATION_ENABLED=true
E2E_ENCRYPTION=optional

Limites et Quotas Recommandés

production_limits:
  # CPU & Memory
  cpu_request: "2000m"      # 2 CPU cores minimum
  cpu_limit: "8000m"        # 8 CPU cores maximum
  memory_request: "4Gi"     # 4GB RAM minimum
  memory_limit: "16Gi"      # 16GB RAM maximum
  
  # Network
  max_connections: 100000
  bandwidth_limit: "1Gbps"
  
  # Storage
  ephemeral_storage: "10Gi"
  logs_retention: "30d"
  
  # Application
  max_message_rate: 100     # messages/second/user
  max_file_upload: "200MB"
  concurrent_streams: 10000

🚀 DÉPLOIEMENT PRODUCTION

1. Pré-requis Infrastructure

Kubernetes 1.25+ ou Docker Swarm
PostgreSQL 14+ avec High Availability
Redis 7.0+ cluster mode
Load Balancer avec SSL termination
Monitoring : Prometheus + Grafana stack

2. Health Checks

healthcheck:
  readiness:
    path: /health/ready
    port: 8080
    timeout: 5s
    period: 10s
    
  liveness:
    path: /health/live
    port: 8080
    timeout: 5s
    period: 30s
    failure_threshold: 3

3. Graceful Shutdown

// Configuration de graceful shutdown (30s)
tokio::select! {
    _ = signal::ctrl_c() => {
        info!("🛑 Graceful shutdown initiated");
        
        // 1. Stop accepting new connections
        server.stop_accepting().await;
        
        // 2. Wait for existing connections to finish (max 30s)
        timeout(Duration::from_secs(30), 
               server.wait_for_connections()).await;
        
        // 3. Force close remaining connections
        server.force_close().await;
        
        info!("✅ Graceful shutdown completed");
    }
}

📊 MONITORING & ALERTING

Métriques Clés à Surveiller

Performance Metrics

# Latency (target: P99 < 50ms)
http_request_duration_seconds{quantile="0.99"} < 0.05

# Throughput (target: > 10k req/s)
rate(http_requests_total[1m]) > 10000

# Error Rate (target: < 0.1%)
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) < 0.001

Resource Metrics

# CPU Usage (alert: > 80%)
cpu_usage_percent > 80

# Memory Usage (alert: > 85%)
memory_usage_percent > 85

# Connection Count (alert: > 90k)
websocket_connections_active > 90000

Business Metrics

# Active Users (alert: < 1k unusual drop)
increase(active_users_total[5m]) < -1000

# Message Success Rate (alert: < 99.9%)
message_delivery_success_rate < 0.999

# Stream Quality (alert: > 5% degraded)
stream_quality_degraded_percent > 5

Alerting Rules

groups:
  - name: veza-rust-modules
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.99, http_request_duration_seconds) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High latency detected"
          
      - alert: HighErrorRate  
        expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.01
        for: 1m
        labels:
          severity: critical
          
      - alert: ServiceDown
        expr: up == 0
        for: 30s
        labels:
          severity: critical

🔒 SÉCURITÉ PRODUCTION

1. Network Security

TLS 1.3 obligatoire pour toutes les connexions
Certificate pinning pour communications inter-services
Network policies Kubernetes restrictives
DDoS protection avec rate limiting intelligent

2. Data Protection

// Encryption at rest
let encryption_key = load_key_from_vault().await?;
let encrypted_data = AES_256_GCM.encrypt(&data, &encryption_key)?;

// Encryption in transit
let tls_config = TlsConfig::builder()
    .cert_file("/certs/server.crt")
    .key_file("/certs/server.key")
    .min_tls_version(TlsVersion::TLSv1_3)
    .build()?;

3. Authentication & Authorization

JWT avec rotation automatique (24h)
RBAC granulaire par resource
API keys avec scopes limités
2FA obligatoire pour comptes privilégiés

🔄 MAINTENANCE & OPÉRATIONS

1. Mise à Jour Rolling

# 1. Update image version
kubectl set image deployment/stream-server \
    stream-server=veza/stream-server:v2.1.0

# 2. Monitor rollout
kubectl rollout status deployment/stream-server

# 3. Validate health
kubectl get pods -l app=stream-server

2. Backup & Recovery

Database : Point-in-time recovery (PITR) avec PostgreSQL
Configuration : Git-ops avec validation automatique
Logs : Rétention 30 jours avec archivage S3
Metrics : Rétention 1 an avec downsampling

3. Scaling Operations

# Horizontal scaling
kubectl scale deployment/stream-server --replicas=10

# Vertical scaling (HPA)
kubectl autoscale deployment/stream-server \
    --cpu-percent=70 --min=5 --max=50

🚨 RUNBOOKS INCIDENTS

Incident 1 : High Latency

# 1. Diagnostic rapide
kubectl top pods -l app=stream-server
kubectl logs -l app=stream-server --tail=100

# 2. Scaling immédiat si CPU > 80%
kubectl scale deployment/stream-server --replicas=20

# 3. Investigation
kubectl exec -it stream-server-xxx -- /bin/bash
htop  # Vérifier CPU/Memory
ss -tulpn  # Vérifier connexions réseau

Incident 2 : Service Down

# 1. Restart rapid
kubectl rollout restart deployment/stream-server

# 2. Check dependencies
kubectl get pods -l app=postgres
kubectl get pods -l app=redis

# 3. Traffic rerouting
kubectl patch service/stream-server -p '{"spec":{"selector":{"app":"stream-server-backup"}}}'

Incident 3 : Memory Leak

# 1. Memory profiling
kubectl exec stream-server-xxx -- /bin/bash
curl http://localhost:9090/debug/pprof/heap > heap.prof

# 2. Graceful restart par batch
for pod in $(kubectl get pods -l app=stream-server -o name); do
    kubectl delete $pod
    sleep 30  # Wait for replacement
done

📈 PERFORMANCE TUNING

1. OS Level Optimizations

# Network tuning
echo 'net.core.somaxconn=65535' >> /etc/sysctl.conf
echo 'net.core.netdev_max_backlog=5000' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_max_syn_backlog=65535' >> /etc/sysctl.conf

# File descriptor limits
echo '* soft nofile 1048576' >> /etc/security/limits.conf
echo '* hard nofile 1048576' >> /etc/security/limits.conf

2. Application Tuning

// Tokio runtime optimization
let rt = tokio::runtime::Builder::new_multi_thread()
    .worker_threads(num_cpus::get() * 2)
    .max_blocking_threads(512)
    .thread_stack_size(2 * 1024 * 1024)  // 2MB stack
    .enable_all()
    .build()?;

3. Database Optimization

-- Connection pooling
ALTER SYSTEM SET max_connections = 500;
ALTER SYSTEM SET shared_buffers = '4GB';
ALTER SYSTEM SET effective_cache_size = '12GB';
ALTER SYSTEM SET work_mem = '64MB';

-- Index optimization for chat messages
CREATE INDEX CONCURRENTLY idx_messages_room_created 
    ON messages(room_id, created_at DESC) 
    WHERE deleted_at IS NULL;

🔍 TROUBLESHOOTING

Problèmes Fréquents

1. WebSocket Connections Dropping

# Check load balancer timeout
kubectl describe ingress stream-server

# Verify heartbeat configuration
grep -r "ping_interval" src/

# Monitor connection metrics
curl http://localhost:9090/metrics | grep websocket

2. Audio Stream Latency

// Verify buffer configuration
let buffer_config = BufferConfig {
    target_latency: Duration::from_millis(50),
    max_buffer_size: 1024 * 8,  // 8KB
    adaptive: true,
};

3. Memory Usage Growth

# Check for connection leaks
ss -s | grep tcp
lsof -p $(pgrep stream-server) | wc -l

# Monitor memory pools
curl http://localhost:9090/debug/pprof/allocs

📋 CHECKLIST PRODUCTION

Pre-Deployment

Load testing completed (100k+ connections)
Security audit passed
Monitoring configured
Backup strategy validated
Disaster recovery tested

Post-Deployment

Health checks passing
Metrics collecting properly
Logs flowing to aggregation
Alerts configured and tested
Performance within targets

Weekly Maintenance

Check resource utilization trends
Review error logs
Update security patches
Validate backup integrity
Performance regression testing

🎯 Cette documentation garantit un déploiement production robuste et maintenable des modules Rust Veza.

11 KiB Raw Blame History

📖 PRODUCTION GUIDE - VEZA RUST MODULES

🎯 APERÇU SYSTÈME

Architecture Production

Spécifications Techniques

🔧 CONFIGURATION PRODUCTION

Variables d'Environnement

Limites et Quotas Recommandés

🚀 DÉPLOIEMENT PRODUCTION

1. Pré-requis Infrastructure

2. Health Checks

3. Graceful Shutdown

📊 MONITORING & ALERTING

Métriques Clés à Surveiller

Performance Metrics

Resource Metrics

Business Metrics

Alerting Rules

🔒 SÉCURITÉ PRODUCTION

1. Network Security

2. Data Protection

3. Authentication & Authorization

🔄 MAINTENANCE & OPÉRATIONS

1. Mise à Jour Rolling

2. Backup & Recovery

3. Scaling Operations

🚨 RUNBOOKS INCIDENTS

Incident 1 : High Latency

Incident 2 : Service Down

Incident 3 : Memory Leak

📈 PERFORMANCE TUNING

1. OS Level Optimizations

2. Application Tuning

3. Database Optimization

🔍 TROUBLESHOOTING

Problèmes Fréquents

1. WebSocket Connections Dropping

2. Audio Stream Latency

3. Memory Usage Growth

📋 CHECKLIST PRODUCTION

Pre-Deployment

Post-Deployment

Weekly Maintenance

11 KiB

Raw Blame History