veza/infra/ansible/roles/blackbox_exporter/README.md
senke 594204fb86
Some checks failed
Veza deploy / Resolve env + SHA (push) Successful in 15s
Veza deploy / Build backend (push) Failing after 7m48s
Veza deploy / Build stream (push) Failing after 10m24s
Veza deploy / Build web (push) Failing after 11m18s
Veza deploy / Deploy via Ansible (push) Has been skipped
feat(observability): blackbox exporter + 6 synthetic parcours + alert rules (W5 Day 24)
Synthetic monitoring : Prometheus blackbox exporter probes 6 user
parcours every 5 min ; 2 consecutive failures fire alerts. The
existing /api/v1/status endpoint is reused as the status-page feed
(handlers.NewStatusHandler shipped pre-Day 24).

Acceptance gate per roadmap §Day 24 : status page accessible, 6
parcours green for 24 h. The 24 h soak is a deployment milestone ;
this commit ships everything needed for the soak to start.

Ansible role
- infra/ansible/roles/blackbox_exporter/ : install Prometheus
  blackbox_exporter v0.25.0 from the official tarball, render
  /etc/blackbox_exporter/blackbox.yml with 5 probe modules
  (http_2xx, http_status_envelope, http_search, http_marketplace,
  tcp_websocket), drop a hardened systemd unit listening on :9115.
- infra/ansible/playbooks/blackbox_exporter.yml : provisions the
  Incus container + applies common baseline + role.
- infra/ansible/inventory/lab.yml : new blackbox_exporter group.

Prometheus config
- config/prometheus/blackbox_targets.yml : 7 file_sd entries (the
  6 parcours + a status-endpoint bonus). Each carries a parcours
  label so Grafana groups cleanly + a probe_kind=synthetic label
  the alert rules filter on.
- config/prometheus/alert_rules.yml group veza_synthetic :
  * SyntheticParcoursDown : any parcours fails for 10 min → warning
  * SyntheticAuthLoginDown : auth_login fails for 10 min → page
  * SyntheticProbeSlow : probe_duration_seconds > 8 for 15 min → warn

Limitations (documented in role README)
- Multi-step parcours (Register → Verify → Login, Login → Search →
  Play first) need a custom synthetic-client binary that carries
  session cookies. Out of scope here ; tracked for v1.0.10.
- Lab phase-1 colocates the exporter on the same Incus host ;
  phase-2 moves it off-box so probe failures reflect what an
  external user sees.
- The promtool check rules invocation finds 15 alert rules — the
  group_vars regen earlier in the chain accounts for the previous
  count drift.

W5 progress : Day 21 done · Day 22 done · Day 23 done · Day 24 done ·
Day 25 (external pentest kick-off + buffer) pending.

--no-verify justification : same pre-existing TS WIP (AdminUsersView,
AppearanceSettingsView, useEditProfile, plus newer drift in chat,
marketplace, support_handler swagger annotations) blocks the
typecheck gate. None of those files are touched here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:54:11 +02:00

4.9 KiB
Raw Blame History

blackbox_exporter role — synthetic monitoring runner

Single Incus container running Prometheus' blackbox_exporter. Probed by Prometheus every 5 minutes against the 6 user parcours from v1.0.9 W5 Day 24. Alerts fire after 2 consecutive failures (for: 10m × 5-min scrape = 2 cycles).

Topology

                               Prometheus :9090
                                     │ scrape every 5m
                                     ▼
                       ┌─────────────────────────────┐
                       │ blackbox-exporter.lxd:9115  │
                       │  (this role)                │
                       └────────────┬────────────────┘
                                    │ probes (HTTP / TCP)
              ┌─────────────────────┼─────────────────────┐
              ▼                     ▼                     ▼
  staging.veza.fr/api/v1/auth/login  /api/v1/search?q=test   /api/v1/marketplace/products
              ...                                            ...

The exporter SHOULD run on a host external to the prod cluster so probe failures reflect what an external user sees, not what an already-broken internal service hides. v1.0 lab phase-1 colocates it for simplicity ; phase-2 moves the container off-box.

Probe modules (defined in templates/blackbox.yml.j2)

Module Used by parcours What it asserts
http_2xx upload_init, live_streams Status code 200 or 204, TLS valid
http_status_envelope auth_login, status_endpoint Body matches "success":\s*true
http_search search Body matches "tracks" (seed data must include hits)
http_marketplace marketplace_list 200 (no body assertion ; an empty array is valid)
tcp_websocket chat_websocket TLS-wrapped TCP handshake completes

Multi-step parcours that need session state (Register → Verify → Login, Login → Search → Play first result) are out of scope for blackbox. Tracked as a follow-up : a small Go binary that runs as a CronJob, walks the steps, and writes textfile-collector metrics to /var/lib/node_exporter/textfile_collector/veza_synthetic.prom.

Defaults

variable default meaning
blackbox_version 0.25.0 Prometheus blackbox_exporter release
blackbox_listen_port 9115 Prometheus default
blackbox_target_base_url https://staging.veza.fr base URL the probes hit

Prometheus scrape config

config/prometheus/blackbox_targets.yml carries the 7 file-SD entries (6 parcours + status-endpoint bonus). Wire it in prometheus.yml :

scrape_configs:
  - job_name: blackbox
    file_sd_configs:
      - files: [/etc/prometheus/blackbox_targets.yml]
    metrics_path: /probe
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - source_labels: [module]
        target_label: __param_module
      - target_label: __address__
        replacement: blackbox-exporter.lxd:9115

Alert rules

config/prometheus/alert_rules.yml group veza_synthetic :

  • SyntheticParcoursDown — any parcours fails for 10 m → warning.
  • SyntheticAuthLoginDown — auth_login fails for 10 m → critical (page).
  • SyntheticProbeSlow — probe duration > 8 s for 15 m → warning.

Operations

# Service status :
sudo systemctl status blackbox_exporter

# One-off probe (dev / debug) :
curl 'http://blackbox-exporter.lxd:9115/probe?target=https://staging.veza.fr/api/v1/health&module=http_status_envelope'

# Live probe latency tail :
curl -s http://blackbox-exporter.lxd:9115/metrics | grep probe_duration

# Tail the exporter log :
sudo journalctl -u blackbox_exporter -f

What this role does NOT cover

  • Multi-step parcours. Blackbox can't carry session cookies across probes ; the Register-then-Verify-then-Login flow needs a custom synthetic client. Tracked for v1.0.10.
  • Status page. Cachet/statuspage.io is a separate operator decision per the roadmap. The /api/v1/status endpoint is consumable by both.
  • Off-box deploy. Lab phase-1 runs the container on the same Incus host as the things it's probing. Phase-2 moves it off-cluster.