veza/docs/PERFORMANCE_BASELINE.md
senke 59be60e1c3
Some checks failed
Veza CI / Backend (Go) (push) Failing after 4m55s
Veza CI / Rust (Stream Server) (push) Successful in 5m37s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 1m16s
E2E Playwright / e2e (full) (push) Failing after 12m18s
Veza CI / Frontend (Web) (push) Failing after 15m31s
Veza CI / Notify on failure (push) Successful in 3s
feat(perf): k6 mixed-scenarios load test + nightly workflow + baseline doc (W4 Day 20)
End of W4. Capacity validation gate before launch : sustain 1650 VU
concurrent (100 upload + 500 streaming + 1000 browse + 50 checkout)
on staging without breaking p95 < 500 ms or error rate > 0.5 %.
Acceptance bar : 3 nuits consécutives green.

- scripts/loadtest/k6_mixed_scenarios.js : 4 parallel scenarios via
  k6's executor=constant-vus. Per-scenario p95 thresholds layered on
  top of the global gate so a single-flow regression doesn't get
  masked. discardResponseBodies=true (memory pressure ; we assert
  on status codes + latency, not payload). VU counts overridable via
  UPLOAD_VUS / STREAM_VUS / BROWSE_VUS / CHECKOUT_VUS env vars for
  local runs.
  * upload     : 100 VU, initiate + 10 × 1 MiB chunks (10 MiB tracks).
  * streaming  : 500 VU, master.m3u8 → 256k playlist → 4 .ts segments.
  * browse     : 1000 VU, mix 60% search / 30% list / 10% detail.
  * checkout   : 50 VU, list-products + POST orders (rejected at
    validation — exercises auth + rate-limit + Redis state, doesn't
    burn Hyperswitch sandbox quota).

- .github/workflows/loadtest.yml : Forgejo Actions nightly cron
  02:30 UTC. workflow_dispatch lets the operator override duration
  + base_url for ad-hoc capacity drills. Pre-flight GET /api/v1/health
  aborts before consuming runner time when staging is already down.
  Artifacts : k6-summary.json (30d retention) + the script itself.
  Step summary annotates p95/p99 + failed rate so the Action listing
  shows the verdict at a glance.

- docs/PERFORMANCE_BASELINE.md §v1.0.9 W4 Day 20 : scenarios table,
  thresholds, local-run command, operating notes (token rotation,
  upload-scenario approximation, staging-only guard rail), Grafana
  cross-reference, acceptance gate spelled out.

Acceptance (Day 20) : workflow file is valid YAML ; k6 script parses
clean (Node test acknowledges k6/* imports as runtime-provided, the
rest of the syntax checks). Real green-night accumulation requires
the workflow running on staging — that's a deployment milestone, not
a code change.

W4 verification gate progress : Lighthouse PWA / HLS ABR / faceted
search / HAProxy failover / k6 nightly capacity all wired ; W4 = done.
W5 (pentest interne + game day + canary + status page) up next.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 11:44:06 +02:00

164 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Performance Baseline — Veza API
**Version** : v0.951
**Objectif** : Documenter les latences P50/P95/P99 des endpoints critiques pour détecter les régressions.
## Méthodologie
1. Démarrer l'API en mode profiling : `pprof` est exposé si `ENABLE_PPROF=true`
2. Exécuter un load test (k6 ou Go) sur les endpoints critiques
3. Mesurer latences via Prometheus (`http_request_duration_seconds`) ou pprof
## Endpoints critiques à monitorer
| Endpoint | Méthode | Description |
|----------|---------|-------------|
| `/api/v1/auth/login` | POST | Login utilisateur |
| `/api/v1/auth/register` | POST | Inscription |
| `/api/v1/tracks` | GET | Liste des tracks (cursor pagination v0.931) |
| `/api/v1/tracks/search` | GET | Recherche |
| `/api/v1/users/me` | GET | Profil utilisateur |
| `/api/v1/marketplace/orders` | POST | Création commande |
| `/api/v1/notifications` | GET | Notifications |
| `/api/v1/conversations` | GET | Conversations |
| `/api/v1/analytics/me` | GET | Analytics |
| `/health` | GET | Health check |
## Cibles v1.0 (v0.951)
- **P99 < 500ms** sur tous les endpoints critiques à 500 req/s (stress_500rps.js)
- **1000 WebSocket** : connexions stables 5 min, taux livraison > 99% (stress_1000ws.js)
- **50 uploads concurrents** : tous réussis, backpressure respecté (uploads.js)
- **GET /tracks** : pagination cursor-based (v0.931) garantit des performances constantes quelle que soit la page
## Scripts k6 v0.951
| Script | Commande | Seuils |
|--------|----------|--------|
| API stress 500 VUs | `k6 run loadtests/backend/stress_500rps.js` | P99 < 500ms (login, tracks, search, products) |
| WebSocket 1000 | `k6 run loadtests/chat/stress_1000ws.js` | ws_connection_failures < 1%, ws_message_failures < 1% |
| Uploads 50 | `k6 run loadtests/backend/uploads.js` | P95 < 5s (simple), P95 < 8s (chunked) |
Voir [loadtests/README.md](../loadtests/README.md) pour l'exécution complète.
## Commande pprof
```bash
# Profiler 30s pendant un load test
go tool pprof -http=:8081 http://localhost:8080/debug/pprof/profile?seconds=30
```
## Métriques Prometheus
Les middlewares de monitoring exposent `http_request_duration_seconds` avec les labels `method`, `path`, `status`. Utiliser des histogram quantiles pour P50/P95/P99.
## Lighthouse v0.982 (Frontend)
**Objectif** : Performance 90, Accessibility 90, Best Practices 90 sur les pages critiques.
### Pages à auditer
| Page | Route | Cible Performance | Cible Accessibility |
|------|-------|-------------------|---------------------|
| Login | `/login` | 90 | 90 |
| Dashboard | `/dashboard` | 90 | 90 |
| Tracks | `/library` ou `/tracks` | 90 | 90 |
| Marketplace | `/marketplace` | 90 | 90 |
| Search | `/search` | 90 | 90 |
| Profile | `/profile` | 90 | 90 |
### Procédure d'audit
```bash
# Prérequis: app frontend en cours d'exécution (npm run dev ou build + preview)
npx lighthouse http://localhost:4173/ --view --output=html --output-path=./lighthouse-reports/home.html
npx lighthouse http://localhost:4173/login --view --output=html --output-path=./lighthouse-reports/login.html
# Répéter pour chaque page critique
```
### Dernier audit
Voir [config/incus/LIGHTHOUSE_AUDIT_REPORT.md](../config/incus/LIGHTHOUSE_AUDIT_REPORT.md) pour le dernier rapport (2026-01-15). Accessibility 93, Best Practices 96 objectif v0.982 atteint sur ces critères. Performance à revalider après corrections NO_LCP.
---
## Résultats v1.0.2
**Prérequis** : `docker compose up -d`, backend + PostgreSQL + Redis.
### Load tests corrigés (v0.502)
- WebSocket load test : CHAT_ORIGIN pointant vers backend `ws://localhost:8080`, WS_URL = `/api/v1/ws`
- Fichiers : `loadtests/config.js`, `loadtests/chat/stress_1000ws.js`, `loadtests/chat/websocket.js`
### Commandes pour exécution
```bash
k6 run loadtests/backend/stress_500rps.js # 500 req/s, P99 < 500ms
k6 run loadtests/chat/stress_1000ws.js # 1000 WebSocket, < 1% échec
k6 run loadtests/backend/uploads.js # 50 uploads
```
### Tableau résultats (à remplir après exécution sur infra)
| Endpoint / Script | P50 | P95 | P99 | Taux échec |
|------------------|-----|-----|-----|------------|
| stress_500rps (login, tracks, search) | | | | |
| stress_1000ws | | | | |
| uploads | | | | |
---
## v1.0.9 W4 Day 20 — Mixed-scenarios nightly k6
Capacity gate before launch : sustain **1650 VU concurrent** for 5 minutes on staging without breaking the global thresholds. Scheduled by `.github/workflows/loadtest.yml` at 02:30 UTC ; the acceptance bar is "3 nuits consécutives green" before the launch goes hot.
### Scenarios
Run in parallel via the k6 scenarios block in `scripts/loadtest/k6_mixed_scenarios.js`. Each one uses `executor: constant-vus` so the steady state is unambiguous.
| Scenario | VU | Workload | Per-scenario p95 gate |
| ---------- | ---- | ----------------------------------------------------- | --------------------- |
| upload | 100 | initiate + 10×1 MiB chunks (synthetic 10 MiB tracks) | global only |
| streaming | 500 | master.m3u8 quality playlist 4 segments loop | < 300 ms |
| browse | 1000 | search 60% / list 30% / detail 10% | < 400 ms |
| checkout | 50 | list products POST orders (rejected at validation) | < 800 ms |
### Global thresholds (acceptance bar)
| Metric | Threshold | Reason |
| -------------------- | -------------------- | ------------------------------------------------- |
| `http_req_duration` | p(95) < 500 ms | Roadmap §Day 20. |
| `http_req_duration` | p(99) < 1500 ms | Tail latency cap ; catches one-off sync stalls. |
| `http_req_failed` | rate < 0.5 % | Roadmap §Day 20. Looser per-scenario for upload + checkout (network + Hyperswitch). |
### How to run locally
```bash
# Against the lab haproxy (no auth required for browse/streaming) :
k6 run scripts/loadtest/k6_mixed_scenarios.js \
--env BASE_URL=http://haproxy.lxd \
--env STREAM_TRACK_ID=<seed-uuid> \
--env DURATION=2m \
--env UPLOAD_VUS=10 --env STREAM_VUS=50 --env BROWSE_VUS=100 --env CHECKOUT_VUS=5
# Full nightly profile against staging :
USER_TOKEN=$(./scripts/issue-loadtest-token.sh) \
k6 run scripts/loadtest/k6_mixed_scenarios.js \
--env BASE_URL=https://staging.veza.fr \
--env STREAM_TRACK_ID=<seed-uuid> \
--env USER_TOKEN="$USER_TOKEN"
```
### Operating notes
- **Override per-scenario VU** with `UPLOAD_VUS`, `STREAM_VUS`, `BROWSE_VUS`, `CHECKOUT_VUS` env vars to dial the load down for local runs.
- **Staging-only.** The workflow refuses to run against prod ; the `BASE_URL` is set from `vars.STAGING_BASE_URL` (or `DEFAULT_BASE_URL` env in the workflow) and never reads from a prod-shaped variable.
- **Token rotation.** `STAGING_LOADTEST_TOKEN` is a long-lived token bound to a dedicated `loadtest@veza.music` user with role=user (no admin powers). Rotate quarterly.
- **Upload scenario approximation.** The chunked endpoint expects multipart bodies ; for load shaping we POST raw 1 MiB chunks with the upload-id header. The cost path (auth + rate-limit + Redis state) is exercised correctly even though the resulting upload is rejected at the multipart parser.
### After-run dashboard
The Grafana dashboard `Veza API Overview` (config/grafana/dashboards/api-overview.json) carries the p95/p99 panels. Add the k6 run window via the timepicker to compare. The k6 JSON summary uploaded as a workflow artifact carries the per-scenario breakdown that the dashboard can't show directly.
### Acceptance gate (W4 verification)
- 3 consecutive nightly runs green (no threshold violation).
- p95 < 500 ms on the global metric.
- Per-scenario gates met for every flow.
When the gate breaks, the workflow's "Annotate thresholds in summary" step writes the failing values to the GitHub Actions summary so the on-call can triage from a single page.