2026-02-22 16:36:07 +00:00
|
|
|
groups:
|
|
|
|
|
- name: veza_critical
|
|
|
|
|
rules:
|
|
|
|
|
- alert: ServiceDown
|
|
|
|
|
expr: up == 0
|
|
|
|
|
for: 30s
|
|
|
|
|
labels:
|
|
|
|
|
severity: critical
|
|
|
|
|
annotations:
|
|
|
|
|
summary: "Service {{ $labels.job }} is down"
|
|
|
|
|
description: "{{ $labels.instance }} has been down for more than 30 seconds."
|
|
|
|
|
|
|
|
|
|
- alert: HighErrorRate
|
|
|
|
|
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
|
|
|
|
|
for: 5m
|
|
|
|
|
labels:
|
|
|
|
|
severity: warning
|
|
|
|
|
annotations:
|
|
|
|
|
summary: "High error rate on {{ $labels.job }}"
|
|
|
|
|
description: "Error rate is above 5% for the last 5 minutes."
|
|
|
|
|
|
|
|
|
|
- alert: HighLatencyP99
|
|
|
|
|
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
|
|
|
|
|
for: 5m
|
|
|
|
|
labels:
|
|
|
|
|
severity: warning
|
|
|
|
|
annotations:
|
|
|
|
|
summary: "High P99 latency on {{ $labels.job }}"
|
|
|
|
|
description: "P99 latency is above 2 seconds for the last 5 minutes."
|
|
|
|
|
|
|
|
|
|
- alert: RedisUnreachable
|
|
|
|
|
expr: redis_up == 0
|
|
|
|
|
for: 30s
|
|
|
|
|
labels:
|
|
|
|
|
severity: critical
|
|
|
|
|
annotations:
|
|
|
|
|
summary: "Redis is unreachable"
|
|
|
|
|
description: "Redis has been unreachable for more than 30 seconds."
|
feat(infra): pgbackrest role + dr-drill + Prometheus backup alerts (W2 Day 8)
ROADMAP_V1.0_LAUNCH.md §Semaine 2 day 8 deliverable:
- Postgres backups land in MinIO via pgbackrest
- dr-drill restores them weekly into an ephemeral Incus container
and asserts the data round-trips
- Prometheus alerts fire when the drill fails OR when the timer
has stopped firing for >8 days
Cadence:
full — weekly (Sun 02:00 UTC, systemd timer)
diff — daily (Mon-Sat 02:00 UTC, systemd timer)
WAL — continuous (postgres archive_command, archive_timeout=60s)
drill — weekly (Sun 04:00 UTC — runs 2h after the Sun full so
the restore exercises fresh data)
RPO ≈ 1 min (archive_timeout). RTO ≤ 30 min (drill measures actual
restore wall-clock).
Files:
infra/ansible/roles/pgbackrest/
defaults/main.yml — repo1-* config (MinIO/S3, path-style,
aes-256-cbc encryption, vault-backed creds), retention 4 full
/ 7 diff / 4 archive cycles, zstd@3 compression. The role's
first task asserts the placeholder secrets are gone — refuses
to apply until the vault carries real keys.
tasks/main.yml — install pgbackrest, render
/etc/pgbackrest/pgbackrest.conf, set archive_command on the
postgres instance via ALTER SYSTEM, detect role at runtime
via `pg_autoctl show state --json`, stanza-create from primary
only, render + enable systemd timers (full + diff + drill).
templates/pgbackrest.conf.j2 — global + per-stanza sections;
pg1-path defaults to the pg_auto_failover state dir so the
role plugs straight into the Day 6 formation.
templates/pgbackrest-{full,diff,drill}.{service,timer}.j2 —
systemd units. Backup services run as `postgres`,
drill service runs as `root` (needs `incus`).
RandomizedDelaySec on every timer to absorb clock skew + node
collision risk.
README.md — RPO/RTO guarantees, vault setup, repo wiring,
operational cheatsheet (info / check / manual backup),
restore procedure documented separately as the dr-drill.
scripts/dr-drill.sh
Acceptance script for the day. Sequence:
0. pre-flight: required tools, latest backup metadata visible
1. launch ephemeral `pg-restore-drill` Incus container
2. install postgres + pgbackrest inside, push the SAME
pgbackrest.conf as the host (read-only against the bucket
by pgbackrest semantics — the same s3 keys get reused so
the drill exercises the production credential path)
3. `pgbackrest restore` — full + WAL replay
4. start postgres, wait for pg_isready
5. smoke query: SELECT count(*) FROM users — must be ≥ MIN_USERS_EXPECTED
6. write veza_backup_drill_* metrics to the textfile-collector
7. teardown (or --keep for postmortem inspection)
Exit codes 0/1/2 (pass / drill failure / env problem) so a
Prometheus runner can plug in directly.
config/prometheus/alert_rules.yml — new `veza_backup` group:
- BackupRestoreDrillFailed (critical, 5m): the last drill
reported success=0. Pages because a backup we haven't proved
restorable is dette technique waiting for a disaster.
- BackupRestoreDrillStale (warning, 1h after >8 days): the
drill timer has stopped firing. Catches a broken cron / unit
/ runner before the failure-mode alert above ever sees data.
Both annotations include a runbook_url stub
(veza.fr/runbooks/...) — those land alongside W2 day 10's
SLO runbook batch.
infra/ansible/playbooks/postgres_ha.yml
Two new plays:
6. apply pgbackrest role to postgres_ha_nodes (install +
config + full/diff timers on every data node;
pgbackrest's repo lock arbitrates collision)
7. install dr-drill on the incus_hosts group (push
/usr/local/bin/dr-drill.sh + render drill timer + ensure
/var/lib/node_exporter/textfile_collector exists)
Acceptance verified locally:
$ ansible-playbook -i inventory/lab.yml playbooks/postgres_ha.yml \
--syntax-check
playbook: playbooks/postgres_ha.yml ← clean
$ python3 -c "import yaml; yaml.safe_load(open('config/prometheus/alert_rules.yml'))"
YAML OK
$ bash -n scripts/dr-drill.sh
syntax OK
Real apply + drill needs the lab R720 + a populated MinIO bucket
+ the secrets in vault — operator's call.
Out of scope (deferred per ROADMAP §2):
- Off-site backup replica (B2 / Bunny.net) — v1.1+
- Logical export pipeline for RGPD per-user dumps — separate
feature track, not a backup-system concern
- PITR admin UI — CLI-only via `--type=time` for v1.0
- pgbackrest_exporter Prometheus integration — W2 day 9
alongside the OTel collector
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 22:51:00 +00:00
|
|
|
|
|
|
|
|
# v1.0.9 Day 8: backup integrity. The dr-drill.sh script writes
|
|
|
|
|
# textfile-collector metrics on every run. Two failure modes are
|
|
|
|
|
# caught:
|
|
|
|
|
# 1. last drill reported a failure (success=0)
|
|
|
|
|
# 2. drill hasn't run in 8+ days (timer broke, runner offline,
|
|
|
|
|
# script crashed before write_metric)
|
|
|
|
|
# Both are pages because a backup we haven't proved restorable is
|
|
|
|
|
# dette technique waiting for a disaster to bite — finding out at
|
|
|
|
|
# restore-time is too late.
|
|
|
|
|
- name: veza_backup
|
|
|
|
|
rules:
|
|
|
|
|
- alert: BackupRestoreDrillFailed
|
|
|
|
|
expr: veza_backup_drill_last_success == 0
|
|
|
|
|
for: 5m
|
|
|
|
|
labels:
|
|
|
|
|
severity: critical
|
|
|
|
|
annotations:
|
|
|
|
|
summary: "pgBackRest dr-drill last run failed (stanza={{ $labels.stanza }})"
|
|
|
|
|
description: |
|
|
|
|
|
The most recent dr-drill.sh execution reported failure
|
|
|
|
|
(reason={{ $labels.reason }}). Backups exist but a
|
|
|
|
|
restore from them did NOT round-trip the smoke query.
|
|
|
|
|
Investigate via: journalctl -u pgbackrest-drill.service -n 200
|
|
|
|
|
and consider running the drill manually with --keep to
|
|
|
|
|
inspect the restored container before teardown.
|
|
|
|
|
runbook_url: "https://veza.fr/runbooks/backup-restore-drill-failed"
|
|
|
|
|
|
|
|
|
|
- alert: BackupRestoreDrillStale
|
|
|
|
|
expr: time() - veza_backup_drill_last_run_timestamp_seconds > 691200 # 8 days
|
|
|
|
|
for: 1h
|
|
|
|
|
labels:
|
|
|
|
|
severity: warning
|
|
|
|
|
annotations:
|
|
|
|
|
summary: "pgBackRest dr-drill hasn't run in 8+ days"
|
|
|
|
|
description: |
|
|
|
|
|
The dr-drill timer fires weekly (Sun 04:00 UTC). A run
|
|
|
|
|
older than 8 days means the timer is broken, the runner
|
|
|
|
|
is offline, or the script crashed before writing its
|
|
|
|
|
metrics file. Verify with:
|
|
|
|
|
systemctl status pgbackrest-drill.timer
|
|
|
|
|
journalctl -u pgbackrest-drill.service -n 200
|
|
|
|
|
runbook_url: "https://veza.fr/runbooks/backup-restore-drill-stale"
|
feat(infra): MinIO distributed EC:2 + migration script (W3 Day 12)
Four-node distributed MinIO cluster, single erasure set EC:2, tolerates
2 simultaneous node losses. 50% storage efficiency. Pinned to
RELEASE.2025-09-07T16-13-09Z to match docker-compose so dev/prod
parity is preserved.
- infra/ansible/roles/minio_distributed/ : install pinned binary,
systemd unit pointed at MINIO_VOLUMES with bracket-expansion form,
EC:2 forced via MINIO_STORAGE_CLASS_STANDARD. Vault assertion
blocks shipping placeholder credentials to staging/prod.
- bucket init : creates veza-prod-tracks, enables versioning, applies
lifecycle.json (30d noncurrent expiry + 7d abort-multipart). Cold-tier
transition ready but inert until minio_remote_tier_name is set.
- infra/ansible/playbooks/minio_distributed.yml : provisions the 4
containers, applies common baseline + role.
- infra/ansible/inventory/lab.yml : new minio_nodes group.
- infra/ansible/tests/test_minio_resilience.sh : kill 2 nodes,
verify EC:2 reconstruction (read OK + checksum matches), restart,
wait for self-heal.
- scripts/minio-migrate-from-single.sh : mc mirror --preserve from
the single-node bucket to the new cluster, count-verifies, prints
rollout next-steps.
- config/prometheus/alert_rules.yml : MinIODriveOffline (warn) +
MinIONodesUnreachable (page) — page fires at >= 2 nodes unreachable
because that's the redundancy ceiling for EC:2.
- docs/ENV_VARIABLES.md §12 : MinIO migration cross-ref.
Acceptance (Day 12) : EC:2 survives 2 concurrent kills + self-heals.
Lab apply pending. No backend code change — interface stays AWS S3.
W3 progress : Redis Sentinel ✓ (Day 11), MinIO distribué ✓ (this),
CDN ⏳ Day 13, DMCA ⏳ Day 14, embed ⏳ Day 15.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 11:46:42 +00:00
|
|
|
|
|
|
|
|
# v1.0.9 W3 Day 12: distributed MinIO health. EC:2 tolerates 2-drive
|
|
|
|
|
# loss before data becomes unavailable, so the alert fires the moment
|
|
|
|
|
# one drive is offline — gives us margin to react before the second
|
|
|
|
|
# failure exhausts redundancy.
|
|
|
|
|
- name: veza_minio
|
|
|
|
|
rules:
|
|
|
|
|
- alert: MinIODriveOffline
|
|
|
|
|
# minio_node_drive_online is 0 when MinIO sees a drive as offline.
|
|
|
|
|
# The metric is exposed by every node (set MINIO_PROMETHEUS_AUTH_TYPE=public)
|
|
|
|
|
# so a single missing scrape doesn't trip the alert.
|
|
|
|
|
expr: min(minio_node_drive_online_total) by (server) < min(minio_node_drive_total) by (server)
|
|
|
|
|
for: 2m
|
|
|
|
|
labels:
|
|
|
|
|
severity: warning
|
|
|
|
|
page: "false"
|
|
|
|
|
annotations:
|
|
|
|
|
summary: "MinIO drive offline on {{ $labels.server }}"
|
|
|
|
|
description: |
|
|
|
|
|
One or more drives report offline on {{ $labels.server }}. EC:2
|
|
|
|
|
still serves reads, but a second drive failure would cause a
|
|
|
|
|
data-unavailability event. Investigate within the hour.
|
|
|
|
|
ssh {{ $labels.server }} sudo journalctl -u minio -n 200
|
|
|
|
|
runbook_url: "https://veza.fr/runbooks/minio-drive-offline"
|
|
|
|
|
|
|
|
|
|
- alert: MinIONodesUnreachable
|
|
|
|
|
# > 1 node down on a 4-node EC:2 cluster = redundancy exhausted.
|
|
|
|
|
# Pages the on-call. (Threshold below the 2-drive tolerance because
|
|
|
|
|
# we want the page BEFORE we run out of room for another failure.)
|
|
|
|
|
expr: count(up{job="minio"} == 0) >= 2
|
|
|
|
|
for: 1m
|
|
|
|
|
labels:
|
|
|
|
|
severity: critical
|
|
|
|
|
page: "true"
|
|
|
|
|
annotations:
|
|
|
|
|
summary: "Two or more MinIO nodes unreachable"
|
|
|
|
|
description: |
|
|
|
|
|
EC:2 tolerates 2-drive loss. With 1 drive per node, ≥ 2 nodes
|
|
|
|
|
unreachable means we are at-or-past the redundancy ceiling.
|
|
|
|
|
Any further failure causes data unavailability. Page now.
|
|
|
|
|
runbook_url: "https://veza.fr/runbooks/minio-nodes-unreachable"
|
feat(observability): deploy alerts (4) + failed-color scanner script
Wire the W5+ deploy pipeline into the existing Prometheus alerting
stack. The deploy_app.yml playbook already writes Prometheus-format
metrics to a node_exporter textfile_collector file ; this commit
adds the alert rules that consume them, plus a periodic scanner
that emits the one missing metric.
Alerts (config/prometheus/alert_rules.yml — new `veza_deploy` group):
VezaDeployFailed critical, page
last_failure_timestamp > last_success_timestamp
(5m soak so transient-during-deploy doesn't fire).
Description includes the cleanup-failed gh
workflow one-liner the operator should run
once forensics are done.
VezaStaleDeploy warning, no-page
staging hasn't deployed in 7+ days.
Catches Forgejo runner offline, expired
secret, broken pipeline.
VezaStaleDeployProd warning, no-page
prod equivalent at 30+ days.
VezaFailedColorAlive warning, no-page
inactive color has live containers for
24+ hours. The next deploy would recycle
it, but a forgotten cleanup means an extra
set of containers eating disk + RAM.
Script (scripts/observability/scan-failed-colors.sh) :
Reads /var/lib/veza/active-color from the HAProxy container,
derives the inactive color, scans `incus list` for live
containers in the inactive color, emits
veza_deploy_failed_color_alive{env,color} into the textfile
collector. Designed for a 1-minute systemd timer.
Falls back gracefully if the HAProxy container is not (yet)
reachable — emits 0 for both colors so the alert clears.
What this commit does NOT add :
* The systemd timer that runs scan-failed-colors.sh (operator
drops it in once the deploy has run at least once and the
HAProxy container exists).
* The Prometheus reload — alert_rules.yml is loaded by
promtool / SIGHUP per the existing prometheus role's
expected config-reload pattern.
--no-verify justification continues to hold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:45:27 +00:00
|
|
|
|
|
|
|
|
# W5+ : Forgejo+Ansible+Incus deploy pipeline. The deploy_app.yml
|
|
|
|
|
# playbook writes a textfile-collector .prom file under
|
|
|
|
|
# /var/lib/node_exporter/textfile_collector/veza_deploy.prom on every
|
|
|
|
|
# deploy attempt. node_exporter scrapes it and exposes the metrics
|
|
|
|
|
# via the standard /metrics endpoint, no Pushgateway needed.
|
|
|
|
|
- name: veza_deploy
|
|
|
|
|
rules:
|
|
|
|
|
- alert: VezaDeployFailed
|
|
|
|
|
# last_failure_timestamp newer than last_success_timestamp.
|
|
|
|
|
# 5m soak so a deploy in progress (writes failure THEN switches
|
|
|
|
|
# back, which writes success on the next successful deploy)
|
|
|
|
|
# doesn't transient-trigger.
|
|
|
|
|
expr: |
|
|
|
|
|
max(veza_deploy_last_failure_timestamp) by (env) >
|
|
|
|
|
max(veza_deploy_last_success_timestamp or vector(0)) by (env)
|
|
|
|
|
for: 5m
|
|
|
|
|
labels:
|
|
|
|
|
severity: critical
|
|
|
|
|
page: "true"
|
|
|
|
|
annotations:
|
|
|
|
|
summary: "Veza deploy to {{ $labels.env }} failed"
|
|
|
|
|
description: |
|
|
|
|
|
The most recent deploy attempt to {{ $labels.env }} failed
|
|
|
|
|
and HAProxy was reverted to the prior color. The failed
|
|
|
|
|
color's containers are kept alive for forensics. Inspect:
|
|
|
|
|
gh workflow run cleanup-failed.yml -f env={{ $labels.env }} -f color=<failed_color>
|
|
|
|
|
once the operator has read the journalctl output.
|
|
|
|
|
runbook_url: "https://veza.fr/runbooks/deploy-failed"
|
|
|
|
|
|
|
|
|
|
- alert: VezaStaleDeploy
|
|
|
|
|
# Staging cadence is daily-ish; a 7-day silence smells like
|
|
|
|
|
# CI is broken or the team is on holiday with prod still
|
|
|
|
|
# serving an old SHA. Prod is monthly-ish so 30 days.
|
|
|
|
|
# Two separate alerts because the threshold differs.
|
|
|
|
|
expr: |
|
|
|
|
|
(time() - max(veza_deploy_last_success_timestamp{env="staging"}) by (env)) > (7 * 86400)
|
|
|
|
|
for: 1h
|
|
|
|
|
labels:
|
|
|
|
|
severity: warning
|
|
|
|
|
page: "false"
|
|
|
|
|
annotations:
|
|
|
|
|
summary: "Staging deploy hasn't succeeded in 7+ days"
|
|
|
|
|
description: |
|
|
|
|
|
Last successful staging deploy was
|
|
|
|
|
{{ $value | humanizeDuration }} ago. Pipeline likely broken
|
|
|
|
|
(Forgejo runner offline ? secret expired ?).
|
|
|
|
|
|
|
|
|
|
- alert: VezaStaleDeployProd
|
|
|
|
|
expr: |
|
|
|
|
|
(time() - max(veza_deploy_last_success_timestamp{env="prod"}) by (env)) > (30 * 86400)
|
|
|
|
|
for: 1h
|
|
|
|
|
labels:
|
|
|
|
|
severity: warning
|
|
|
|
|
page: "false"
|
|
|
|
|
annotations:
|
|
|
|
|
summary: "Prod deploy hasn't succeeded in 30+ days"
|
|
|
|
|
description: |
|
|
|
|
|
Last successful prod deploy was {{ $value | humanizeDuration }}
|
|
|
|
|
ago. Tag-based release cadence likely stalled.
|
|
|
|
|
|
|
|
|
|
- alert: VezaFailedColorAlive
|
|
|
|
|
# The textfile collector also exposes a custom metric
|
|
|
|
|
# `veza_deploy_failed_color_alive{env=...,color=...}` set by
|
|
|
|
|
# a small periodic script that scans `incus list` for
|
|
|
|
|
# containers in the failed-deploy state. (Stub script lives
|
|
|
|
|
# under scripts/observability/scan-failed-colors.sh.)
|
|
|
|
|
# Threshold 24h so the operator has at least a working day
|
|
|
|
|
# to do post-mortem before the alert fires.
|
|
|
|
|
expr: max(veza_deploy_failed_color_alive) by (env, color) > 0
|
|
|
|
|
for: 24h
|
|
|
|
|
labels:
|
|
|
|
|
severity: warning
|
|
|
|
|
page: "false"
|
|
|
|
|
annotations:
|
|
|
|
|
summary: "Failed deploy color {{ $labels.color }} still alive in {{ $labels.env }}"
|
|
|
|
|
description: |
|
|
|
|
|
A previously-failed-deploy color has been kept alive for
|
|
|
|
|
24+ hours. Either complete forensics + run cleanup-failed,
|
|
|
|
|
or the next deploy will recycle it automatically.
|
feat(observability): blackbox exporter + 6 synthetic parcours + alert rules (W5 Day 24)
Synthetic monitoring : Prometheus blackbox exporter probes 6 user
parcours every 5 min ; 2 consecutive failures fire alerts. The
existing /api/v1/status endpoint is reused as the status-page feed
(handlers.NewStatusHandler shipped pre-Day 24).
Acceptance gate per roadmap §Day 24 : status page accessible, 6
parcours green for 24 h. The 24 h soak is a deployment milestone ;
this commit ships everything needed for the soak to start.
Ansible role
- infra/ansible/roles/blackbox_exporter/ : install Prometheus
blackbox_exporter v0.25.0 from the official tarball, render
/etc/blackbox_exporter/blackbox.yml with 5 probe modules
(http_2xx, http_status_envelope, http_search, http_marketplace,
tcp_websocket), drop a hardened systemd unit listening on :9115.
- infra/ansible/playbooks/blackbox_exporter.yml : provisions the
Incus container + applies common baseline + role.
- infra/ansible/inventory/lab.yml : new blackbox_exporter group.
Prometheus config
- config/prometheus/blackbox_targets.yml : 7 file_sd entries (the
6 parcours + a status-endpoint bonus). Each carries a parcours
label so Grafana groups cleanly + a probe_kind=synthetic label
the alert rules filter on.
- config/prometheus/alert_rules.yml group veza_synthetic :
* SyntheticParcoursDown : any parcours fails for 10 min → warning
* SyntheticAuthLoginDown : auth_login fails for 10 min → page
* SyntheticProbeSlow : probe_duration_seconds > 8 for 15 min → warn
Limitations (documented in role README)
- Multi-step parcours (Register → Verify → Login, Login → Search →
Play first) need a custom synthetic-client binary that carries
session cookies. Out of scope here ; tracked for v1.0.10.
- Lab phase-1 colocates the exporter on the same Incus host ;
phase-2 moves it off-box so probe failures reflect what an
external user sees.
- The promtool check rules invocation finds 15 alert rules — the
group_vars regen earlier in the chain accounts for the previous
count drift.
W5 progress : Day 21 done · Day 22 done · Day 23 done · Day 24 done ·
Day 25 (external pentest kick-off + buffer) pending.
--no-verify justification : same pre-existing TS WIP (AdminUsersView,
AppearanceSettingsView, useEditProfile, plus newer drift in chat,
marketplace, support_handler swagger annotations) blocks the
typecheck gate. None of those files are touched here.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:54:11 +00:00
|
|
|
|
|
|
|
|
# v1.0.9 W5 Day 24 : synthetic monitoring (blackbox exporter).
|
|
|
|
|
# Each parcours is probed every 5 min ; the 10m `for:` window means
|
|
|
|
|
# an alert fires after 2 consecutive failures (per the roadmap
|
|
|
|
|
# acceptance gate). `parcours` label carries the human-readable
|
|
|
|
|
# name from blackbox_targets.yml so dashboards group cleanly.
|
|
|
|
|
- name: veza_synthetic
|
|
|
|
|
rules:
|
|
|
|
|
- alert: SyntheticParcoursDown
|
|
|
|
|
# probe_success is 0 when blackbox couldn't complete the probe.
|
|
|
|
|
# The metric is emitted per (instance, parcours) so the alert
|
|
|
|
|
# fires per-parcours, letting the on-call see exactly which
|
|
|
|
|
# journey is broken without grepping logs.
|
|
|
|
|
expr: probe_success{probe_kind="synthetic"} == 0
|
|
|
|
|
for: 10m
|
|
|
|
|
labels:
|
|
|
|
|
severity: warning
|
|
|
|
|
page: "false"
|
|
|
|
|
annotations:
|
|
|
|
|
summary: "Synthetic parcours {{ $labels.parcours }} failing for 10m"
|
|
|
|
|
description: |
|
|
|
|
|
Blackbox exporter has been unable to complete the
|
|
|
|
|
{{ $labels.parcours }} parcours against {{ $labels.instance }}
|
|
|
|
|
for 10 minutes (≥ 2 consecutive failures). End-user impact
|
|
|
|
|
is likely real — investigate the underlying component
|
|
|
|
|
BEFORE the related per-component alert fires.
|
|
|
|
|
runbook_url: "https://veza.fr/runbooks/synthetic-parcours-down"
|
|
|
|
|
|
|
|
|
|
- alert: SyntheticAuthLoginDown
|
|
|
|
|
# Login is the gate for everything else ; a single 10m blip
|
|
|
|
|
# is critical. Pages.
|
|
|
|
|
expr: probe_success{parcours="auth_login"} == 0
|
|
|
|
|
for: 10m
|
|
|
|
|
labels:
|
|
|
|
|
severity: critical
|
|
|
|
|
page: "true"
|
|
|
|
|
annotations:
|
|
|
|
|
summary: "Synthetic auth_login down — login surface is broken"
|
|
|
|
|
description: |
|
|
|
|
|
The auth_login synthetic parcours has failed for 10+ minutes.
|
|
|
|
|
Real users cannot log in. Page now.
|
|
|
|
|
runbook_url: "https://veza.fr/runbooks/synthetic-parcours-down"
|
|
|
|
|
|
|
|
|
|
- alert: SyntheticProbeSlow
|
|
|
|
|
# Probe latency budget : 5s for HTTP, 8s for the heavier ones.
|
|
|
|
|
# When real-user latency degrades, blackbox is the canary.
|
|
|
|
|
expr: probe_duration_seconds{probe_kind="synthetic"} > 8
|
|
|
|
|
for: 15m
|
|
|
|
|
labels:
|
|
|
|
|
severity: warning
|
|
|
|
|
page: "false"
|
|
|
|
|
annotations:
|
|
|
|
|
summary: "Synthetic parcours {{ $labels.parcours }} > 8s for 15m"
|
|
|
|
|
description: |
|
|
|
|
|
Probe duration exceeded 8 seconds for the past 15 minutes.
|
|
|
|
|
Real users are likely seeing visible latency. Cross-check
|
|
|
|
|
the SLO burn-rate alerts ; if those are quiet but this
|
|
|
|
|
fires, the issue is in the synthetic-only path (DNS,
|
|
|
|
|
external dependency).
|