262 lines
6.6 KiB
Markdown
262 lines
6.6 KiB
Markdown
# Runbook: Upload Stuck in "uploading" Status
|
|
|
|
## Signal
|
|
|
|
**Symptômes observables**:
|
|
- Upload reste en statut `uploading` > 10 minutes (anormal)
|
|
- Utilisateur ne peut pas accéder au fichier uploadé
|
|
- Logs: Pas de transition `uploading` → `processing` → `completed`
|
|
- Métriques: `veza_file_uploads_total{status="uploading"}` reste élevé
|
|
|
|
**Endpoints concernés**:
|
|
- `POST /api/v1/upload` - Upload initial
|
|
- `GET /api/v1/uploads/:id/status` - Vérification statut
|
|
- `GET /api/v1/tracks/:id` - Accès track après upload
|
|
|
|
## Hypothèses
|
|
|
|
1. **Job worker down** - Worker qui traite les uploads ne fonctionne plus
|
|
2. **Queue bloquée** - RabbitMQ/Job queue saturée ou bloquée
|
|
3. **Storage problème** - Fichier non accessible, permissions, espace disque
|
|
4. **Processing échoué silencieusement** - Erreur non loggée, statut non mis à jour
|
|
5. **Timeout processing** - Traitement trop long, timeout avant completion
|
|
|
|
## Vérifications
|
|
|
|
### 1. Vérifier statut upload spécifique
|
|
|
|
```bash
|
|
# Via API
|
|
curl -H "Authorization: Bearer <token>" \
|
|
http://localhost:8080/api/v1/uploads/<upload_id>/status
|
|
|
|
# Réponse attendue:
|
|
# {
|
|
# "success": true,
|
|
# "data": {
|
|
# "status": "uploading", # ← Bloqué ici
|
|
# "progress": 100,
|
|
# "created_at": "2025-12-15T10:00:00Z"
|
|
# }
|
|
# }
|
|
```
|
|
|
|
### 2. Vérifier logs application
|
|
|
|
```bash
|
|
# Chercher upload spécifique
|
|
grep "<upload_id>" /var/log/veza-backend-api/*.log
|
|
|
|
# Chercher erreurs processing
|
|
grep -i "upload.*error\|processing.*failed\|job.*failed" /var/log/veza-backend-api/*.log | tail -50
|
|
|
|
# Chercher jobs worker
|
|
grep -i "job worker\|process.*upload" /var/log/veza-backend-api/*.log | tail -50
|
|
```
|
|
|
|
### 3. Vérifier job worker
|
|
|
|
```bash
|
|
# Vérifier processus worker
|
|
ps aux | grep "job.*worker\|worker.*upload"
|
|
|
|
# Vérifier logs worker (si séparé)
|
|
tail -100 /var/log/veza-worker/*.log
|
|
```
|
|
|
|
### 4. Vérifier queue (RabbitMQ)
|
|
|
|
```bash
|
|
# Si RabbitMQ activé
|
|
rabbitmqctl list_queues name messages messages_ready messages_unacknowledged
|
|
|
|
# Vérifier connexion RabbitMQ
|
|
curl http://localhost:15672/api/queues # (si management activé)
|
|
```
|
|
|
|
### 5. Vérifier storage
|
|
|
|
```bash
|
|
# Vérifier fichier uploadé existe
|
|
ls -lh /var/veza/uploads/<upload_id>/
|
|
|
|
# Vérifier permissions
|
|
ls -la /var/veza/uploads/<upload_id>/
|
|
|
|
# Vérifier espace disque
|
|
df -h /var/veza/uploads
|
|
|
|
# Vérifier inodes (si problème)
|
|
df -i /var/veza/uploads
|
|
```
|
|
|
|
### 6. Vérifier base de données
|
|
|
|
```sql
|
|
-- Vérifier statut upload en DB
|
|
SELECT id, status, progress, created_at, updated_at, error_message
|
|
FROM uploads
|
|
WHERE id = '<upload_id>';
|
|
|
|
-- Chercher uploads bloqués (> 10 min en uploading)
|
|
SELECT id, status, created_at, updated_at
|
|
FROM uploads
|
|
WHERE status = 'uploading'
|
|
AND created_at < NOW() - INTERVAL '10 minutes'
|
|
ORDER BY created_at;
|
|
|
|
-- Vérifier jobs en attente
|
|
SELECT id, type, status, created_at, started_at, completed_at
|
|
FROM job_queue
|
|
WHERE type = 'process_upload'
|
|
AND status IN ('pending', 'processing')
|
|
ORDER BY created_at;
|
|
```
|
|
|
|
## Actions Correctives
|
|
|
|
### Si job worker down
|
|
|
|
1. **Redémarrer job worker**:
|
|
```bash
|
|
sudo systemctl restart veza-backend-api
|
|
# ou
|
|
docker restart veza-backend-api
|
|
```
|
|
|
|
2. **Vérifier worker démarre**:
|
|
```bash
|
|
grep "Job Worker démarré" /var/log/veza-backend-api/*.log
|
|
```
|
|
|
|
3. **Relancer processing manuel** (si possible):
|
|
- Via API admin (si disponible)
|
|
- Ou directement en DB (voir ci-dessous)
|
|
|
|
### Si queue bloquée
|
|
|
|
1. **Vérifier RabbitMQ**:
|
|
```bash
|
|
sudo systemctl status rabbitmq-server
|
|
# ou
|
|
docker ps | grep rabbitmq
|
|
```
|
|
|
|
2. **Redémarrer RabbitMQ** (si nécessaire):
|
|
```bash
|
|
sudo systemctl restart rabbitmq-server
|
|
```
|
|
|
|
3. **Purger queue** (si nécessaire, ⚠️ perte jobs):
|
|
```bash
|
|
rabbitmqctl purge_queue <queue_name>
|
|
```
|
|
|
|
### Si fichier manquant/inaccessible
|
|
|
|
1. **Vérifier fichier existe**:
|
|
```bash
|
|
find /var/veza/uploads -name "*<upload_id>*"
|
|
```
|
|
|
|
2. **Vérifier permissions**:
|
|
```bash
|
|
chown -R veza:veza /var/veza/uploads/<upload_id>/
|
|
chmod -R 644 /var/veza/uploads/<upload_id>/
|
|
```
|
|
|
|
3. **Si fichier manquant**:
|
|
- Marquer upload comme `failed` en DB
|
|
- Notifier utilisateur
|
|
- Documenter perte fichier
|
|
|
|
### Si processing échoué silencieusement
|
|
|
|
1. **Forcer re-processing** (via DB):
|
|
```sql
|
|
-- Marquer comme pending pour re-traitement
|
|
UPDATE uploads
|
|
SET status = 'pending', updated_at = NOW()
|
|
WHERE id = '<upload_id>' AND status = 'uploading';
|
|
|
|
-- Ou créer job manuel
|
|
INSERT INTO job_queue (id, type, payload, status, created_at)
|
|
VALUES (
|
|
gen_random_uuid(),
|
|
'process_upload',
|
|
jsonb_build_object('upload_id', '<upload_id>'),
|
|
'pending',
|
|
NOW()
|
|
);
|
|
```
|
|
|
|
2. **Vérifier logs après re-processing**:
|
|
```bash
|
|
tail -f /var/log/veza-backend-api/*.log | grep "<upload_id>"
|
|
```
|
|
|
|
### Si timeout processing
|
|
|
|
1. **Augmenter timeout** (si configurable):
|
|
- Modifier timeout dans `internal/jobs/upload_processor.go`
|
|
- Redémarrer worker
|
|
|
|
2. **Diviser traitement** (long terme):
|
|
- Implémenter processing par chunks
|
|
- Ajouter checkpoints
|
|
|
|
## Actions Préventives
|
|
|
|
### Monitoring à ajouter
|
|
|
|
1. **Alerte uploads bloqués**:
|
|
```yaml
|
|
- alert: VezaUploadsStuck
|
|
expr: |
|
|
count(uploads{status="uploading", created_at < now() - 10m}) > 0
|
|
```
|
|
|
|
2. **Métrique temps processing**:
|
|
- Ajouter métrique `veza_upload_processing_duration_seconds`
|
|
- Alerter si > seuil (ex: 5 minutes)
|
|
|
|
### Améliorations code
|
|
|
|
1. **Timeout explicite**:
|
|
- Ajouter timeout sur processing (ex: 10 min)
|
|
- Marquer comme `failed` si timeout
|
|
|
|
2. **Retry logic**:
|
|
- Implémenter retry automatique (max 3 tentatives)
|
|
- Backoff exponentiel
|
|
|
|
3. **Health check job worker**:
|
|
- Endpoint `/health/worker` vérifiant queue/jobs
|
|
- Intégrer dans `/readyz`
|
|
|
|
## Post-Mortem Notes
|
|
|
|
### À documenter après résolution
|
|
|
|
- **Upload ID affecté**: `<upload_id>`
|
|
- **Cause racine**: Job worker down / Queue bloquée / Storage / Processing / Timeout
|
|
- **Durée de l'incident**: De [heure début] à [heure fin]
|
|
- **Impact**: Nombre d'uploads bloqués, utilisateurs affectés
|
|
- **Actions prises**: Liste des actions correctives
|
|
- **Actions préventives**:
|
|
- [ ] Ajouter monitoring uploads bloqués
|
|
- [ ] Implémenter timeout explicite
|
|
- [ ] Ajouter retry logic
|
|
- [ ] Améliorer logging processing
|
|
|
|
### Métriques à surveiller post-incident
|
|
|
|
- `veza_file_uploads_total{status="uploading"}` - Doit diminuer
|
|
- `veza_file_uploads_total{status="completed"}` - Doit augmenter
|
|
- Temps moyen processing - Doit rester < 5 minutes
|
|
|
|
## Références
|
|
|
|
- Handler upload: `internal/handlers/upload.go`
|
|
- Job processor: `internal/jobs/upload_processor.go` (si existe)
|
|
- Documentation upload async: `docs/UPLOAD_ASYNC.md`
|