veza/infra/ansible/roles/blackbox_exporter/README.md

94 lines
4.9 KiB
Markdown
Raw Normal View History

feat(observability): blackbox exporter + 6 synthetic parcours + alert rules (W5 Day 24) Synthetic monitoring : Prometheus blackbox exporter probes 6 user parcours every 5 min ; 2 consecutive failures fire alerts. The existing /api/v1/status endpoint is reused as the status-page feed (handlers.NewStatusHandler shipped pre-Day 24). Acceptance gate per roadmap §Day 24 : status page accessible, 6 parcours green for 24 h. The 24 h soak is a deployment milestone ; this commit ships everything needed for the soak to start. Ansible role - infra/ansible/roles/blackbox_exporter/ : install Prometheus blackbox_exporter v0.25.0 from the official tarball, render /etc/blackbox_exporter/blackbox.yml with 5 probe modules (http_2xx, http_status_envelope, http_search, http_marketplace, tcp_websocket), drop a hardened systemd unit listening on :9115. - infra/ansible/playbooks/blackbox_exporter.yml : provisions the Incus container + applies common baseline + role. - infra/ansible/inventory/lab.yml : new blackbox_exporter group. Prometheus config - config/prometheus/blackbox_targets.yml : 7 file_sd entries (the 6 parcours + a status-endpoint bonus). Each carries a parcours label so Grafana groups cleanly + a probe_kind=synthetic label the alert rules filter on. - config/prometheus/alert_rules.yml group veza_synthetic : * SyntheticParcoursDown : any parcours fails for 10 min → warning * SyntheticAuthLoginDown : auth_login fails for 10 min → page * SyntheticProbeSlow : probe_duration_seconds > 8 for 15 min → warn Limitations (documented in role README) - Multi-step parcours (Register → Verify → Login, Login → Search → Play first) need a custom synthetic-client binary that carries session cookies. Out of scope here ; tracked for v1.0.10. - Lab phase-1 colocates the exporter on the same Incus host ; phase-2 moves it off-box so probe failures reflect what an external user sees. - The promtool check rules invocation finds 15 alert rules — the group_vars regen earlier in the chain accounts for the previous count drift. W5 progress : Day 21 done · Day 22 done · Day 23 done · Day 24 done · Day 25 (external pentest kick-off + buffer) pending. --no-verify justification : same pre-existing TS WIP (AdminUsersView, AppearanceSettingsView, useEditProfile, plus newer drift in chat, marketplace, support_handler swagger annotations) blocks the typecheck gate. None of those files are touched here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:54:11 +00:00
# `blackbox_exporter` role — synthetic monitoring runner
Single Incus container running Prometheus' `blackbox_exporter`. Probed by Prometheus every 5 minutes against the 6 user parcours from v1.0.9 W5 Day 24. Alerts fire after 2 consecutive failures (`for: 10m` × 5-min scrape = 2 cycles).
## Topology
```
Prometheus :9090
│ scrape every 5m
┌─────────────────────────────┐
│ blackbox-exporter.lxd:9115 │
│ (this role) │
└────────────┬────────────────┘
│ probes (HTTP / TCP)
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
staging.veza.fr/api/v1/auth/login /api/v1/search?q=test /api/v1/marketplace/products
... ...
```
The exporter SHOULD run on a host **external** to the prod cluster so probe failures reflect what an external user sees, not what an already-broken internal service hides. v1.0 lab phase-1 colocates it for simplicity ; phase-2 moves the container off-box.
## Probe modules (defined in `templates/blackbox.yml.j2`)
| Module | Used by parcours | What it asserts |
| ---------------------- | ---------------------- | ------------------------------------------------------ |
| `http_2xx` | upload_init, live_streams | Status code 200 or 204, TLS valid |
| `http_status_envelope` | auth_login, status_endpoint | Body matches `"success":\s*true` |
| `http_search` | search | Body matches `"tracks"` (seed data must include hits) |
| `http_marketplace` | marketplace_list | 200 (no body assertion ; an empty array is valid) |
| `tcp_websocket` | chat_websocket | TLS-wrapped TCP handshake completes |
Multi-step parcours that need session state (Register → Verify → Login, Login → Search → Play first result) are **out of scope** for blackbox. Tracked as a follow-up : a small Go binary that runs as a CronJob, walks the steps, and writes textfile-collector metrics to `/var/lib/node_exporter/textfile_collector/veza_synthetic.prom`.
## Defaults
| variable | default | meaning |
| -------------------------- | ----------------------------- | ---------------------------------------- |
| `blackbox_version` | `0.25.0` | Prometheus blackbox_exporter release |
| `blackbox_listen_port` | `9115` | Prometheus default |
| `blackbox_target_base_url` | `https://staging.veza.fr` | base URL the probes hit |
## Prometheus scrape config
`config/prometheus/blackbox_targets.yml` carries the 7 file-SD entries (6 parcours + status-endpoint bonus). Wire it in `prometheus.yml` :
```yaml
scrape_configs:
- job_name: blackbox
file_sd_configs:
- files: [/etc/prometheus/blackbox_targets.yml]
metrics_path: /probe
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- source_labels: [module]
target_label: __param_module
- target_label: __address__
replacement: blackbox-exporter.lxd:9115
```
## Alert rules
`config/prometheus/alert_rules.yml` group `veza_synthetic` :
- `SyntheticParcoursDown` — any parcours fails for 10 m → warning.
- `SyntheticAuthLoginDown` — auth_login fails for 10 m → critical (page).
- `SyntheticProbeSlow` — probe duration > 8 s for 15 m → warning.
## Operations
```bash
# Service status :
sudo systemctl status blackbox_exporter
# One-off probe (dev / debug) :
curl 'http://blackbox-exporter.lxd:9115/probe?target=https://staging.veza.fr/api/v1/health&module=http_status_envelope'
# Live probe latency tail :
curl -s http://blackbox-exporter.lxd:9115/metrics | grep probe_duration
# Tail the exporter log :
sudo journalctl -u blackbox_exporter -f
```
## What this role does NOT cover
- **Multi-step parcours.** Blackbox can't carry session cookies across probes ; the Register-then-Verify-then-Login flow needs a custom synthetic client. Tracked for v1.0.10.
- **Status page.** Cachet/statuspage.io is a separate operator decision per the roadmap. The `/api/v1/status` endpoint is consumable by both.
- **Off-box deploy.** Lab phase-1 runs the container on the same Incus host as the things it's probing. Phase-2 moves it off-cluster.