525 lines
12 KiB
Markdown
525 lines
12 KiB
Markdown
|
|
# Backend Status & Monitoring - Documentation Complète
|
||
|
|
|
||
|
|
**Version**: 1.0
|
||
|
|
**Date**: 2025-12-05
|
||
|
|
**Priorité**: P1 - Monitoring Production
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📋 Vue d'ensemble
|
||
|
|
|
||
|
|
Ce document décrit l'implémentation complète du système de monitoring et de health checks pour le backend Go de Veza. Cette implémentation inclut :
|
||
|
|
|
||
|
|
- ✅ Route `/health` simplifiée (stateless)
|
||
|
|
- ✅ Route `/status` complète avec vérifications de tous les services
|
||
|
|
- ✅ Intégration Sentry pour le tracking d'erreurs
|
||
|
|
- ✅ Logging structuré avec zap
|
||
|
|
- ✅ Métriques Prometheus pour les health checks
|
||
|
|
- ✅ Tests d'intégration
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🔍 Endpoints de Health Check
|
||
|
|
|
||
|
|
### 1. `/health` - Health Check Simple
|
||
|
|
|
||
|
|
**Route**: `GET /health` ou `GET /api/v1/health`
|
||
|
|
|
||
|
|
**Description**: Endpoint stateless qui retourne toujours `{status: "ok"}`. Aucune vérification de dépendances externes.
|
||
|
|
|
||
|
|
**Réponse**:
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"status": "ok"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Status Code**: `200 OK`
|
||
|
|
|
||
|
|
**Usage**:
|
||
|
|
- Kubernetes liveness probe
|
||
|
|
- Load balancer health check
|
||
|
|
- Monitoring basique
|
||
|
|
|
||
|
|
**Exemple**:
|
||
|
|
```bash
|
||
|
|
curl http://localhost:8080/api/v1/health
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2. `/status` - Status Complet
|
||
|
|
|
||
|
|
**Route**: `GET /api/v1/status`
|
||
|
|
|
||
|
|
**Description**: Endpoint complet qui vérifie l'état de tous les services dépendants (DB, Redis, Chat Server, Stream Server).
|
||
|
|
|
||
|
|
**Réponse**:
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"status": "ok",
|
||
|
|
"uptime_seconds": 12345,
|
||
|
|
"services": {
|
||
|
|
"database": {
|
||
|
|
"status": "ok",
|
||
|
|
"latency_ms": 3.2
|
||
|
|
},
|
||
|
|
"redis": {
|
||
|
|
"status": "ok",
|
||
|
|
"latency_ms": 1.5
|
||
|
|
},
|
||
|
|
"chat_server": {
|
||
|
|
"status": "ok",
|
||
|
|
"latency_ms": 4.8
|
||
|
|
},
|
||
|
|
"stream_server": {
|
||
|
|
"status": "ok",
|
||
|
|
"latency_ms": 6.1
|
||
|
|
}
|
||
|
|
},
|
||
|
|
"version": "v1.0.0",
|
||
|
|
"git_commit": "abc123",
|
||
|
|
"build_time": "2025-12-05T14:33:00Z",
|
||
|
|
"environment": "production"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Status Codes**:
|
||
|
|
- `200 OK`: Tous les services sont opérationnels
|
||
|
|
- `503 Service Unavailable`: Au moins un service est en erreur (status: "degraded")
|
||
|
|
|
||
|
|
**Status des Services**:
|
||
|
|
- `ok`: Service opérationnel avec latence normale
|
||
|
|
- `slow`: Service opérationnel mais latence élevée
|
||
|
|
- `error`: Service inaccessible ou en erreur
|
||
|
|
|
||
|
|
**Seuils de Latence**:
|
||
|
|
- Database: 100ms (au-delà = "slow")
|
||
|
|
- Redis: 50ms (au-delà = "slow")
|
||
|
|
- Chat Server: 100ms (au-delà = "slow")
|
||
|
|
- Stream Server: 100ms (au-delà = "slow")
|
||
|
|
|
||
|
|
**Exemple**:
|
||
|
|
```bash
|
||
|
|
curl http://localhost:8080/api/v1/status
|
||
|
|
```
|
||
|
|
|
||
|
|
**Exemple avec service dégradé**:
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"status": "degraded",
|
||
|
|
"uptime_seconds": 12345,
|
||
|
|
"services": {
|
||
|
|
"database": {
|
||
|
|
"status": "ok",
|
||
|
|
"latency_ms": 3.2
|
||
|
|
},
|
||
|
|
"redis": {
|
||
|
|
"status": "error",
|
||
|
|
"latency_ms": 0,
|
||
|
|
"message": "connection refused"
|
||
|
|
},
|
||
|
|
"chat_server": {
|
||
|
|
"status": "ok",
|
||
|
|
"latency_ms": 4.8
|
||
|
|
},
|
||
|
|
"stream_server": {
|
||
|
|
"status": "ok",
|
||
|
|
"latency_ms": 6.1
|
||
|
|
}
|
||
|
|
},
|
||
|
|
"version": "v1.0.0",
|
||
|
|
"git_commit": "abc123",
|
||
|
|
"build_time": "2025-12-05T14:33:00Z",
|
||
|
|
"environment": "production"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🔧 Configuration
|
||
|
|
|
||
|
|
### Variables d'Environnement
|
||
|
|
|
||
|
|
#### Health Check
|
||
|
|
Aucune variable requise pour `/health` (stateless).
|
||
|
|
|
||
|
|
#### Status Endpoint
|
||
|
|
Les variables suivantes sont utilisées pour `/status`:
|
||
|
|
|
||
|
|
| Variable | Description | Default | Requis |
|
||
|
|
|----------|-------------|---------|--------|
|
||
|
|
| `CHAT_SERVER_URL` | URL du serveur de chat | `http://localhost:8081` | Non |
|
||
|
|
| `STREAM_SERVER_URL` | URL du serveur de streaming | `http://localhost:8082` | Non |
|
||
|
|
| `APP_VERSION` | Version de l'application | `v1.0.0` | Non |
|
||
|
|
| `GIT_COMMIT` | Commit Git | `unknown` | Non |
|
||
|
|
| `BUILD_TIME` | Date de build | (vide) | Non |
|
||
|
|
|
||
|
|
**Note**: Si `CHAT_SERVER_URL` ou `STREAM_SERVER_URL` ne sont pas configurés, ces services ne seront pas vérifiés dans `/status`.
|
||
|
|
|
||
|
|
### Sentry Configuration
|
||
|
|
|
||
|
|
| Variable | Description | Default | Requis |
|
||
|
|
|----------|-------------|---------|--------|
|
||
|
|
| `SENTRY_DSN` | DSN Sentry pour error tracking | (vide) | Non |
|
||
|
|
| `SENTRY_ENV` | Environnement Sentry | `APP_ENV` | Non |
|
||
|
|
| `SENTRY_SAMPLE_RATE_ERRORS` | Sample rate pour les erreurs (0.0-1.0) | `1.0` | Non |
|
||
|
|
| `SENTRY_SAMPLE_RATE_TRANSACTIONS` | Sample rate pour les transactions (0.0-1.0) | `0.1` | Non |
|
||
|
|
|
||
|
|
**Exemple**:
|
||
|
|
```bash
|
||
|
|
export SENTRY_DSN="https://xxx@xxx.ingest.sentry.io/xxx"
|
||
|
|
export SENTRY_ENV="production"
|
||
|
|
export SENTRY_SAMPLE_RATE_ERRORS=1.0
|
||
|
|
export SENTRY_SAMPLE_RATE_TRANSACTIONS=0.1
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📊 Métriques Prometheus
|
||
|
|
|
||
|
|
### Health Check Metrics
|
||
|
|
|
||
|
|
Les métriques suivantes sont exposées pour les health checks:
|
||
|
|
|
||
|
|
#### `veza_health_check_duration_ms`
|
||
|
|
Histogramme de la durée des health checks par service.
|
||
|
|
|
||
|
|
**Labels**:
|
||
|
|
- `service`: `database`, `redis`, `chat_server`, `stream_server`
|
||
|
|
|
||
|
|
**Buckets**: `1, 5, 10, 25, 50, 100, 250, 500, 1000` (ms)
|
||
|
|
|
||
|
|
**Exemple**:
|
||
|
|
```
|
||
|
|
veza_health_check_duration_ms_bucket{service="database",le="10"} 45
|
||
|
|
veza_health_check_duration_ms_bucket{service="database",le="50"} 98
|
||
|
|
veza_health_check_duration_ms_sum{service="database"} 1234.5
|
||
|
|
veza_health_check_duration_ms_count{service="database"} 100
|
||
|
|
```
|
||
|
|
|
||
|
|
#### `veza_health_check_status`
|
||
|
|
Gauge du status de chaque service.
|
||
|
|
|
||
|
|
**Labels**:
|
||
|
|
- `service`: `database`, `redis`, `chat_server`, `stream_server`
|
||
|
|
|
||
|
|
**Valeurs**:
|
||
|
|
- `1.0`: Service OK
|
||
|
|
- `0.5`: Service lent (slow)
|
||
|
|
- `0.0`: Service en erreur
|
||
|
|
|
||
|
|
**Exemple**:
|
||
|
|
```
|
||
|
|
veza_health_check_status{service="database"} 1.0
|
||
|
|
veza_health_check_status{service="redis"} 0.5
|
||
|
|
veza_health_check_status{service="chat_server"} 0.0
|
||
|
|
```
|
||
|
|
|
||
|
|
### Accès aux Métriques
|
||
|
|
|
||
|
|
**Endpoint**: `GET /api/v1/metrics`
|
||
|
|
|
||
|
|
**Exemple**:
|
||
|
|
```bash
|
||
|
|
curl http://localhost:8080/api/v1/metrics | grep health_check
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🐛 Intégration Sentry
|
||
|
|
|
||
|
|
### Initialisation
|
||
|
|
|
||
|
|
Sentry est initialisé automatiquement dans `cmd/api/main.go` si `SENTRY_DSN` est configuré.
|
||
|
|
|
||
|
|
### Middleware
|
||
|
|
|
||
|
|
Le middleware `SentryRecover` capture automatiquement:
|
||
|
|
- Les panics (avec stack trace)
|
||
|
|
- Les erreurs HTTP 5xx
|
||
|
|
- Les erreurs du contexte Gin
|
||
|
|
|
||
|
|
### Contexte Capturé
|
||
|
|
|
||
|
|
Pour chaque erreur, Sentry capture:
|
||
|
|
- Méthode HTTP
|
||
|
|
- Path de la requête
|
||
|
|
- Query parameters
|
||
|
|
- IP du client
|
||
|
|
- Request ID (si présent)
|
||
|
|
- User ID (si authentifié)
|
||
|
|
|
||
|
|
### Exemple d'Erreur dans Sentry
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"message": "Panic: runtime error: invalid memory address",
|
||
|
|
"level": "error",
|
||
|
|
"tags": {
|
||
|
|
"component": "gin",
|
||
|
|
"request_id": "req-12345"
|
||
|
|
},
|
||
|
|
"contexts": {
|
||
|
|
"request": {
|
||
|
|
"method": "POST",
|
||
|
|
"path": "/api/v1/tracks",
|
||
|
|
"query": "",
|
||
|
|
"ip": "192.168.1.1"
|
||
|
|
}
|
||
|
|
},
|
||
|
|
"user": {
|
||
|
|
"id": "user-123",
|
||
|
|
"username": "user-123"
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📝 Logging Structuré
|
||
|
|
|
||
|
|
### Format
|
||
|
|
|
||
|
|
Tous les logs utilisent le format JSON structuré avec zap.
|
||
|
|
|
||
|
|
### Champs Standards
|
||
|
|
|
||
|
|
Chaque requête HTTP logge:
|
||
|
|
- `method`: Méthode HTTP (GET, POST, etc.)
|
||
|
|
- `path`: Chemin de la requête
|
||
|
|
- `query`: Query parameters
|
||
|
|
- `ip`: IP du client
|
||
|
|
- `user_agent`: User agent
|
||
|
|
- `latency`: Durée de la requête
|
||
|
|
- `status`: Status code HTTP
|
||
|
|
- `body_size`: Taille de la réponse
|
||
|
|
- `request_id`: ID unique de la requête (si présent)
|
||
|
|
- `user_id`: ID de l'utilisateur (si authentifié)
|
||
|
|
- `trace_id`: ID de trace (si présent)
|
||
|
|
- `span_id`: ID de span (si présent)
|
||
|
|
|
||
|
|
### Niveaux de Log
|
||
|
|
|
||
|
|
- **INFO**: Requêtes réussies (2xx, 3xx)
|
||
|
|
- **WARN**: Erreurs client (4xx)
|
||
|
|
- **ERROR**: Erreurs serveur (5xx)
|
||
|
|
|
||
|
|
### Exemple de Log
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"level": "info",
|
||
|
|
"ts": 1701878400.123,
|
||
|
|
"msg": "Request completed",
|
||
|
|
"method": "GET",
|
||
|
|
"path": "/api/v1/status",
|
||
|
|
"query": "",
|
||
|
|
"ip": "192.168.1.1",
|
||
|
|
"user_agent": "curl/7.68.0",
|
||
|
|
"latency": "0.012345s",
|
||
|
|
"status": 200,
|
||
|
|
"body_size": 456,
|
||
|
|
"request_id": "req-12345"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🧪 Tests
|
||
|
|
|
||
|
|
### Tests Unitaires
|
||
|
|
|
||
|
|
Les tests sont dans `tests/integration/api_health_test.go`:
|
||
|
|
|
||
|
|
- `TestAPIHealth`: Test de `/health`
|
||
|
|
- `TestAPIHealthV1`: Test de `/api/v1/health`
|
||
|
|
- `TestAPIStatus`: Test de `/status` avec services réels
|
||
|
|
- `TestAPIStatusDegraded`: Test de `/status` avec service dégradé
|
||
|
|
|
||
|
|
### Exécution des Tests
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd veza-backend-api
|
||
|
|
go test ./tests/integration -v -run TestAPIHealth
|
||
|
|
go test ./tests/integration -v -run TestAPIStatus
|
||
|
|
```
|
||
|
|
|
||
|
|
### Tests d'Intégration HTTP
|
||
|
|
|
||
|
|
Pour tester avec un serveur réel:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Démarrer le serveur
|
||
|
|
make run
|
||
|
|
|
||
|
|
# Dans un autre terminal
|
||
|
|
curl http://localhost:8080/api/v1/health
|
||
|
|
curl http://localhost:8080/api/v1/status
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📈 Dashboard Grafana Recommandé
|
||
|
|
|
||
|
|
### Panels Suggérés
|
||
|
|
|
||
|
|
1. **Health Check Status**
|
||
|
|
- Query: `veza_health_check_status`
|
||
|
|
- Type: Gauge
|
||
|
|
- Alerte: Si valeur < 1.0
|
||
|
|
|
||
|
|
2. **Health Check Latency**
|
||
|
|
- Query: `rate(veza_health_check_duration_ms_sum[5m]) / rate(veza_health_check_duration_ms_count[5m])`
|
||
|
|
- Type: Graph
|
||
|
|
- Alerte: Si latence > 100ms
|
||
|
|
|
||
|
|
3. **Service Availability**
|
||
|
|
- Query: `avg_over_time(veza_health_check_status[5m])`
|
||
|
|
- Type: Stat
|
||
|
|
- Alerte: Si disponibilité < 0.95
|
||
|
|
|
||
|
|
4. **Error Rate**
|
||
|
|
- Query: `rate(veza_errors_total[5m])`
|
||
|
|
- Type: Graph
|
||
|
|
- Alerte: Si taux d'erreur > 1%
|
||
|
|
|
||
|
|
### Exemple de Dashboard JSON
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"dashboard": {
|
||
|
|
"title": "Veza Backend Health",
|
||
|
|
"panels": [
|
||
|
|
{
|
||
|
|
"title": "Health Check Status",
|
||
|
|
"targets": [
|
||
|
|
{
|
||
|
|
"expr": "veza_health_check_status"
|
||
|
|
}
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"title": "Health Check Latency",
|
||
|
|
"targets": [
|
||
|
|
{
|
||
|
|
"expr": "rate(veza_health_check_duration_ms_sum[5m]) / rate(veza_health_check_duration_ms_count[5m])"
|
||
|
|
}
|
||
|
|
]
|
||
|
|
}
|
||
|
|
]
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🚀 Procédure de Test Locale
|
||
|
|
|
||
|
|
### 1. Démarrer les Services
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Démarrer PostgreSQL
|
||
|
|
docker-compose up -d postgres
|
||
|
|
|
||
|
|
# Démarrer Redis
|
||
|
|
docker-compose up -d redis
|
||
|
|
|
||
|
|
# Démarrer le backend
|
||
|
|
cd veza-backend-api
|
||
|
|
go run cmd/api/main.go
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Tester `/health`
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl http://localhost:8080/api/v1/health
|
||
|
|
# Réponse: {"status":"ok"}
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Tester `/status`
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl http://localhost:8080/api/v1/status | jq
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4. Vérifier les Métriques
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl http://localhost:8080/api/v1/metrics | grep health_check
|
||
|
|
```
|
||
|
|
|
||
|
|
### 5. Tester avec Service Dégradé
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Arrêter Redis
|
||
|
|
docker-compose stop redis
|
||
|
|
|
||
|
|
# Vérifier le status
|
||
|
|
curl http://localhost:8080/api/v1/status | jq
|
||
|
|
# Le status devrait être "degraded" et redis en "error"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🔍 Dépannage
|
||
|
|
|
||
|
|
### Problème: `/status` retourne toujours "degraded"
|
||
|
|
|
||
|
|
**Causes possibles**:
|
||
|
|
1. Un service est inaccessible (DB, Redis, Chat Server, Stream Server)
|
||
|
|
2. Latence élevée (> seuil)
|
||
|
|
|
||
|
|
**Solution**:
|
||
|
|
1. Vérifier les logs: `docker-compose logs backend`
|
||
|
|
2. Vérifier la connectivité: `curl http://localhost:8081/health` (chat server)
|
||
|
|
3. Vérifier les métriques: `curl http://localhost:8080/api/v1/metrics | grep health_check`
|
||
|
|
|
||
|
|
### Problème: Sentry ne capture pas les erreurs
|
||
|
|
|
||
|
|
**Causes possibles**:
|
||
|
|
1. `SENTRY_DSN` non configuré
|
||
|
|
2. Sample rate trop bas
|
||
|
|
|
||
|
|
**Solution**:
|
||
|
|
1. Vérifier `SENTRY_DSN` dans les variables d'environnement
|
||
|
|
2. Augmenter `SENTRY_SAMPLE_RATE_ERRORS` à 1.0 pour les tests
|
||
|
|
|
||
|
|
### Problème: Métriques Prometheus non visibles
|
||
|
|
|
||
|
|
**Causes possibles**:
|
||
|
|
1. Endpoint `/metrics` non accessible
|
||
|
|
2. Métriques non enregistrées
|
||
|
|
|
||
|
|
**Solution**:
|
||
|
|
1. Vérifier l'endpoint: `curl http://localhost:8080/api/v1/metrics`
|
||
|
|
2. Vérifier les logs pour les erreurs d'enregistrement
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📚 Références
|
||
|
|
|
||
|
|
- [Prometheus Metrics](https://prometheus.io/docs/concepts/metric_types/)
|
||
|
|
- [Sentry Go SDK](https://docs.sentry.io/platforms/go/)
|
||
|
|
- [Zap Logger](https://github.com/uber-go/zap)
|
||
|
|
- [Gin Framework](https://gin-gonic.com/docs/)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## ✅ Checklist de Déploiement
|
||
|
|
|
||
|
|
- [ ] Variables d'environnement configurées (`SENTRY_DSN`, `CHAT_SERVER_URL`, etc.)
|
||
|
|
- [ ] Endpoint `/health` accessible depuis le load balancer
|
||
|
|
- [ ] Endpoint `/status` accessible pour le monitoring
|
||
|
|
- [ ] Métriques Prometheus scrapées par Prometheus
|
||
|
|
- [ ] Dashboard Grafana configuré
|
||
|
|
- [ ] Alertes configurées (service down, latence élevée)
|
||
|
|
- [ ] Tests d'intégration passent
|
||
|
|
- [ ] Documentation à jour
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Auteur**: Veza Backend Team
|
||
|
|
**Dernière mise à jour**: 2025-12-05
|
||
|
|
|