veza/config/prometheus/alert_rules.yml
senke 199a8efbfe feat(infra): MinIO cross-region replication + DR runbook (v1.0.10 ops item 8)
Closes the "single-region MinIO" gap. The 4-node EC:2 cluster
tolerates 2 simultaneous drive losses but a regional outage
(network partition, DC fire, operator error wiping the cluster)
remains a single point of data loss.

New Ansible role minio_replication :
- Wrapper script veza-minio-replicate.sh runs `mc mirror --preserve`
  from the local cluster to a remote S3-compatible target every 6h
  (configurable via OnCalendar).
- Writes textfile-collector metrics on each run :
    veza_minio_replication_last_run_timestamp_seconds
    veza_minio_replication_last_success_timestamp_seconds
    veza_minio_replication_last_duration_seconds
    veza_minio_replication_last_status (1/0)
    veza_minio_replication_target_bytes
- systemd timer with Persistent=true catches up missed runs after
  reboot (this is the disaster-recovery surface, can't afford to
  silently skip ticks).
- Idempotent : `mc alias set` re-applies cleanly, `mc mb
  --ignore-existing` for the target bucket.
- Refuses to run with vault placeholders to avoid accidental
  prod application against bogus credentials.

Why mc mirror, not MinIO native bucket replication : works against
any S3-compatible target (Wasabi, Backblaze B2, AWS S3) with just
an access key, where MinIO BR/SR requires the target to be
MinIO-managed and bidirectionally reachable. mc is the lowest-
common-denominator that lets us decouple from the choice of
target operator.

Alerts in alert_rules.yml veza_minio_backup group :
- MinioReplicationLastFailed (warning, single failed run)
- MinioReplicationStale (CRITICAL, no success in 12h — past RPO)
- MinioReplicationNeverSucceeded (warning, fresh deploy stuck)
- MinioReplicationTargetShrunk (CRITICAL, > 20% drop in 1h —
  operator-error guard rail)

Runbook docs/runbooks/minio-replication.md covers triage by alert,
common ops tasks (manual sync, pause, credential rotation), and
the manual restore procedure for DR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 00:04:25 +02:00

746 lines
33 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

groups:
- name: veza_critical
rules:
- alert: ServiceDown
expr: up == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
description: "{{ $labels.instance }} has been down for more than 30 seconds."
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is above 5% for the last 5 minutes."
- alert: HighLatencyP99
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency on {{ $labels.job }}"
description: "P99 latency is above 2 seconds for the last 5 minutes."
- alert: RedisUnreachable
expr: redis_up == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Redis is unreachable"
description: "Redis has been unreachable for more than 30 seconds."
# v1.0.9 Day 8: backup integrity. The dr-drill.sh script writes
# textfile-collector metrics on every run. Two failure modes are
# caught:
# 1. last drill reported a failure (success=0)
# 2. drill hasn't run in 8+ days (timer broke, runner offline,
# script crashed before write_metric)
# Both are pages because a backup we haven't proved restorable is
# dette technique waiting for a disaster to bite — finding out at
# restore-time is too late.
- name: veza_backup
rules:
- alert: BackupRestoreDrillFailed
expr: veza_backup_drill_last_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: "pgBackRest dr-drill last run failed (stanza={{ $labels.stanza }})"
description: |
The most recent dr-drill.sh execution reported failure
(reason={{ $labels.reason }}). Backups exist but a
restore from them did NOT round-trip the smoke query.
Investigate via: journalctl -u pgbackrest-drill.service -n 200
and consider running the drill manually with --keep to
inspect the restored container before teardown.
runbook_url: "https://veza.fr/runbooks/backup-restore-drill-failed"
- alert: BackupRestoreDrillStale
expr: time() - veza_backup_drill_last_run_timestamp_seconds > 691200 # 8 days
for: 1h
labels:
severity: warning
annotations:
summary: "pgBackRest dr-drill hasn't run in 8+ days"
description: |
The dr-drill timer fires weekly (Sun 04:00 UTC). A run
older than 8 days means the timer is broken, the runner
is offline, or the script crashed before writing its
metrics file. Verify with:
systemctl status pgbackrest-drill.timer
journalctl -u pgbackrest-drill.service -n 200
runbook_url: "https://veza.fr/runbooks/backup-restore-drill-stale"
# v1.0.9 W3 Day 12: distributed MinIO health. EC:2 tolerates 2-drive
# loss before data becomes unavailable, so the alert fires the moment
# one drive is offline — gives us margin to react before the second
# failure exhausts redundancy.
- name: veza_minio
rules:
- alert: MinIODriveOffline
# minio_node_drive_online is 0 when MinIO sees a drive as offline.
# The metric is exposed by every node (set MINIO_PROMETHEUS_AUTH_TYPE=public)
# so a single missing scrape doesn't trip the alert.
expr: min(minio_node_drive_online_total) by (server) < min(minio_node_drive_total) by (server)
for: 2m
labels:
severity: warning
page: "false"
annotations:
summary: "MinIO drive offline on {{ $labels.server }}"
description: |
One or more drives report offline on {{ $labels.server }}. EC:2
still serves reads, but a second drive failure would cause a
data-unavailability event. Investigate within the hour.
ssh {{ $labels.server }} sudo journalctl -u minio -n 200
runbook_url: "https://veza.fr/runbooks/minio-drive-offline"
- alert: MinIONodesUnreachable
# > 1 node down on a 4-node EC:2 cluster = redundancy exhausted.
# Pages the on-call. (Threshold below the 2-drive tolerance because
# we want the page BEFORE we run out of room for another failure.)
expr: count(up{job="minio"} == 0) >= 2
for: 1m
labels:
severity: critical
page: "true"
annotations:
summary: "Two or more MinIO nodes unreachable"
description: |
EC:2 tolerates 2-drive loss. With 1 drive per node, ≥ 2 nodes
unreachable means we are at-or-past the redundancy ceiling.
Any further failure causes data unavailability. Page now.
runbook_url: "https://veza.fr/runbooks/minio-nodes-unreachable"
# W5+ : Forgejo+Ansible+Incus deploy pipeline. The deploy_app.yml
# playbook writes a textfile-collector .prom file under
# /var/lib/node_exporter/textfile_collector/veza_deploy.prom on every
# deploy attempt. node_exporter scrapes it and exposes the metrics
# via the standard /metrics endpoint, no Pushgateway needed.
- name: veza_deploy
rules:
- alert: VezaDeployFailed
# last_failure_timestamp newer than last_success_timestamp.
# 5m soak so a deploy in progress (writes failure THEN switches
# back, which writes success on the next successful deploy)
# doesn't transient-trigger.
expr: |
max(veza_deploy_last_failure_timestamp) by (env) >
max(veza_deploy_last_success_timestamp or vector(0)) by (env)
for: 5m
labels:
severity: critical
page: "true"
annotations:
summary: "Veza deploy to {{ $labels.env }} failed"
description: |
The most recent deploy attempt to {{ $labels.env }} failed
and HAProxy was reverted to the prior color. The failed
color's containers are kept alive for forensics. Inspect:
gh workflow run cleanup-failed.yml -f env={{ $labels.env }} -f color=<failed_color>
once the operator has read the journalctl output.
runbook_url: "https://veza.fr/runbooks/deploy-failed"
- alert: VezaStaleDeploy
# Staging cadence is daily-ish; a 7-day silence smells like
# CI is broken or the team is on holiday with prod still
# serving an old SHA. Prod is monthly-ish so 30 days.
# Two separate alerts because the threshold differs.
expr: |
(time() - max(veza_deploy_last_success_timestamp{env="staging"}) by (env)) > (7 * 86400)
for: 1h
labels:
severity: warning
page: "false"
annotations:
summary: "Staging deploy hasn't succeeded in 7+ days"
description: |
Last successful staging deploy was
{{ $value | humanizeDuration }} ago. Pipeline likely broken
(Forgejo runner offline ? secret expired ?).
- alert: VezaStaleDeployProd
expr: |
(time() - max(veza_deploy_last_success_timestamp{env="prod"}) by (env)) > (30 * 86400)
for: 1h
labels:
severity: warning
page: "false"
annotations:
summary: "Prod deploy hasn't succeeded in 30+ days"
description: |
Last successful prod deploy was {{ $value | humanizeDuration }}
ago. Tag-based release cadence likely stalled.
- alert: VezaFailedColorAlive
# The textfile collector also exposes a custom metric
# `veza_deploy_failed_color_alive{env=...,color=...}` set by
# a small periodic script that scans `incus list` for
# containers in the failed-deploy state. (Stub script lives
# under scripts/observability/scan-failed-colors.sh.)
# Threshold 24h so the operator has at least a working day
# to do post-mortem before the alert fires.
expr: max(veza_deploy_failed_color_alive) by (env, color) > 0
for: 24h
labels:
severity: warning
page: "false"
annotations:
summary: "Failed deploy color {{ $labels.color }} still alive in {{ $labels.env }}"
description: |
A previously-failed-deploy color has been kept alive for
24+ hours. Either complete forensics + run cleanup-failed,
or the next deploy will recycle it automatically.
# v1.0.9 W5 Day 24 : synthetic monitoring (blackbox exporter).
# Each parcours is probed every 5 min ; the 10m `for:` window means
# an alert fires after 2 consecutive failures (per the roadmap
# acceptance gate). `parcours` label carries the human-readable
# name from blackbox_targets.yml so dashboards group cleanly.
- name: veza_synthetic
rules:
- alert: SyntheticParcoursDown
# probe_success is 0 when blackbox couldn't complete the probe.
# The metric is emitted per (instance, parcours) so the alert
# fires per-parcours, letting the on-call see exactly which
# journey is broken without grepping logs.
expr: probe_success{probe_kind="synthetic"} == 0
for: 10m
labels:
severity: warning
page: "false"
annotations:
summary: "Synthetic parcours {{ $labels.parcours }} failing for 10m"
description: |
Blackbox exporter has been unable to complete the
{{ $labels.parcours }} parcours against {{ $labels.instance }}
for 10 minutes (≥ 2 consecutive failures). End-user impact
is likely real — investigate the underlying component
BEFORE the related per-component alert fires.
runbook_url: "https://veza.fr/runbooks/synthetic-parcours-down"
- alert: SyntheticAuthLoginDown
# Login is the gate for everything else ; a single 10m blip
# is critical. Pages.
expr: probe_success{parcours="auth_login"} == 0
for: 10m
labels:
severity: critical
page: "true"
annotations:
summary: "Synthetic auth_login down — login surface is broken"
description: |
The auth_login synthetic parcours has failed for 10+ minutes.
Real users cannot log in. Page now.
runbook_url: "https://veza.fr/runbooks/synthetic-parcours-down"
- alert: SyntheticProbeSlow
# Probe latency budget : 5s for HTTP, 8s for the heavier ones.
# When real-user latency degrades, blackbox is the canary.
expr: probe_duration_seconds{probe_kind="synthetic"} > 8
for: 15m
labels:
severity: warning
page: "false"
annotations:
summary: "Synthetic parcours {{ $labels.parcours }} > 8s for 15m"
description: |
Probe duration exceeded 8 seconds for the past 15 minutes.
Real users are likely seeing visible latency. Cross-check
the SLO burn-rate alerts ; if those are quiet but this
fires, the issue is in the synthetic-only path (DNS,
external dependency).
# v1.0.10 ops item 10 — Business KPI alerts. Infra alerts catch tech
# failures (5xx, latency, queue depth). These catch business failures :
# the platform is technically healthy but users can't sign up, sellers
# don't get paid, revenue trends down. Source counters live in
# internal/monitoring/business_metrics.go ; signups + tracks reuse the
# pre-existing per-feature counters in metrics.go.
- name: veza_business
rules:
- alert: SignupsDropAlarm
# Compares the last hour's signup rate to the same hour last
# week. A signup-flow break (frontend bug, captcha provider
# outage, email-sender broken so the verify link never lands)
# is invisible on the 5xx dashboard but catastrophic for
# growth. The 50% threshold is a heuristic — tune up if the
# weekly seasonality is noisy. Suppressed on weekends because
# the weekend signup baseline is already noisy enough that
# paging here would be all false positives.
expr: |
(
sum(rate(veza_users_registered_total[1h]))
/
sum(rate(veza_users_registered_total[1h] offset 7d))
) < 0.5
and
sum(rate(veza_users_registered_total[1h] offset 7d)) > 0.001
and
(day_of_week() != 0 and day_of_week() != 6)
for: 30m
labels:
severity: warning
page: "false"
annotations:
summary: "Signups dropped >50% vs same hour last week"
description: |
Hourly signup rate is below 50% of the same hour last week.
Likely causes : signup-flow regression on web, captcha
provider outage, email-sender broken (verify link never
arrives), age-gate validation too strict. Check the
signup funnel dashboard and the auth.register span on
the OpenTelemetry collector.
runbook_url: "https://veza.fr/runbooks/signups-drop"
- alert: LoginsFailureSpike
# Sudden spike in failed logins is either a real attack
# (credential stuffing) or an internal bug (auth service
# broken, password-hash mismatch after a migration).
# Triggers on >50 failures/min sustained for 10m. The
# account-takeover signal is the success/failure ratio,
# but the absolute rate is a better page-trigger because
# ratio-based alerts flap during low-traffic hours.
expr: |
sum(rate(veza_business_logins_total{outcome=~"failure_.*"}[5m])) > 50/60
for: 10m
labels:
severity: warning
page: "false"
annotations:
summary: "Login failure rate >50/min for 10m"
description: |
Failed logins spiking. Either a credential-stuffing attack
(expected source IPs concentrated, check the rate-limit
audit log) or the auth service is broken (check the
auth.login span errors and the password-verify code path).
runbook_url: "https://veza.fr/runbooks/login-failures-spike"
- alert: PaymentFailuresSpike
# >20% of orders failing in the last 30 minutes. Real
# threshold here is "Hyperswitch is sick" or "our webhook
# signature verification is broken" — both block revenue
# immediately. The 30m window dampens isolated card declines
# which are normal background noise.
expr: |
(
sum(rate(veza_business_orders_total{status="failed"}[30m]))
/
sum(rate(veza_business_orders_total{status=~"created|completed|failed"}[30m]))
) > 0.2
and
sum(rate(veza_business_orders_total{status=~"created|completed|failed"}[30m])) > 0.01
for: 15m
labels:
severity: critical
page: "true"
annotations:
summary: "Payment failure rate >20% for 15m"
description: |
More than one in five payment attempts failed in the last
30 minutes. Check Hyperswitch dashboard, the
payment.webhook span on the OTEL collector, and verify
the webhook signature secret hasn't been rotated without
updating ours.
runbook_url: "https://veza.fr/runbooks/payment-failures"
- alert: RevenueDropAlarm
# Same shape as SignupsDropAlarm but on revenue cents.
# Catches the case where signups are flat but conversion to
# purchase tanks (broken checkout flow, broken pricing
# display, exclusive-license duplication blocking sales).
expr: |
(
sum(rate(veza_business_revenue_cents_total[1h]))
/
sum(rate(veza_business_revenue_cents_total[1h] offset 7d))
) < 0.4
and
sum(rate(veza_business_revenue_cents_total[1h] offset 7d)) > 0.001
for: 1h
labels:
severity: warning
page: "false"
annotations:
summary: "Revenue dropped >60% vs same hour last week"
description: |
Hourly revenue rate is below 40% of the same hour last
week. Cross-check with PaymentFailuresSpike : if that's
quiet, the issue is upstream of payment (checkout flow,
pricing display, an exclusive-license guard blocking
otherwise-good orders).
runbook_url: "https://veza.fr/runbooks/revenue-drop"
- alert: AccountDeletionEndpointBroken
# No deletions in 24h is suspicious if the platform has
# any meaningful churn. The actual signal we want is "the
# endpoint isn't reachable" — RGPD requires it to stay
# reachable, and if it's silently broken we're non-compliant.
# The threshold is loose : as long as ONE deletion lands
# in 24h, we don't page. If it stays at zero for 48h, page.
# Skip when the platform has fewer than ~50 active users
# (early launch) — the rate is genuinely zero.
expr: |
increase(veza_business_account_deletions_total[48h]) == 0
and
sum(rate(veza_users_registered_total[7d])) > 0.0001
for: 6h
labels:
severity: warning
page: "false"
annotations:
summary: "No account deletions in 48h — endpoint may be broken"
description: |
The /users/me DELETE endpoint hasn't recorded a single
deletion in 48 hours despite ongoing signup activity.
Likely the endpoint is broken (RGPD non-compliance risk)
or the metric instrumentation regressed. Test the
endpoint manually and check the RecordAccountDeletion
call site in account_deletion_handler.go.
runbook_url: "https://veza.fr/runbooks/deletion-endpoint-broken"
# v1.0.10 ops item 9 — Real User Monitoring alerts. Synthetic probes
# already alert on server-side latency ; these alerts catch the
# "users in the wild are seeing it slow even though our infra
# dashboards are green" gap (slow CDN edges, third-party scripts,
# bloated bundle on a route, mobile-CPU regressions). The alerts
# fire when the p75 user experience crosses Google's published
# Web Vitals thresholds for a sustained window.
- name: veza_rum
rules:
- alert: WebVitalsLCPP75Poor
# p75 LCP > 4s for 30m on the same route+device. 4s is the
# Google "poor" threshold ; we page only on the worst
# category because "needs improvement" (2.54s) is a
# backlog signal, not an incident.
expr: |
histogram_quantile(0.75, sum(rate(veza_web_vitals_lcp_seconds_bucket[15m])) by (route, device, le)) > 4.0
and
sum(rate(veza_web_vitals_lcp_seconds_count[15m])) by (route, device) > 0.05
for: 30m
labels:
severity: warning
page: "false"
annotations:
summary: "LCP p75 > 4s on {{ $labels.route }}/{{ $labels.device }} for 30m"
description: |
Real users on route={{ $labels.route }} device={{ $labels.device }}
are seeing Largest Contentful Paint above the Google
"poor" threshold. Usually : a heavy hero image / late-
loading font / large bundle on this route. Check the
bundle-size CI artifact, the CDN cache hit rate for
this route's HTML, and recent third-party script
additions.
runbook_url: "https://veza.fr/runbooks/web-vitals-lcp"
- alert: WebVitalsCLSP75Poor
# p75 CLS > 0.25 for 30m on the same route+device. Layout
# shift > 0.25 is the "poor" category — usually caused by
# late-loading images without dimensions, ad slots inserted
# post-paint, or font swaps.
expr: |
histogram_quantile(0.75, sum(rate(veza_web_vitals_cls_bucket[15m])) by (route, device, le)) > 0.25
and
sum(rate(veza_web_vitals_cls_count[15m])) by (route, device) > 0.05
for: 30m
labels:
severity: warning
page: "false"
annotations:
summary: "CLS p75 > 0.25 on {{ $labels.route }}/{{ $labels.device }} for 30m"
description: |
Real users are seeing Cumulative Layout Shift above the
"poor" threshold on route={{ $labels.route }}. Usually :
an image / iframe / ad without explicit width/height,
a font swap shifting paragraphs down, or content
injected post-paint. Inspect the route's React tree
for new dynamic elements.
runbook_url: "https://veza.fr/runbooks/web-vitals-cls"
- alert: WebVitalsINPP75Poor
# p75 INP > 0.5s for 30m. INP measures interaction
# responsiveness — > 500ms is genuinely sluggish UI.
# Often caused by a heavy event handler or main-thread
# blocking from a third-party script.
expr: |
histogram_quantile(0.75, sum(rate(veza_web_vitals_inp_seconds_bucket[15m])) by (route, device, le)) > 0.5
and
sum(rate(veza_web_vitals_inp_seconds_count[15m])) by (route, device) > 0.05
for: 30m
labels:
severity: warning
page: "false"
annotations:
summary: "INP p75 > 500ms on {{ $labels.route }}/{{ $labels.device }} for 30m"
description: |
Real users see > 500ms response after interaction. Look
at recent commits to the React tree for this route
(heavy onClick handlers, synchronous state updates that
re-render large subtrees), and the third-party scripts
loaded on this route (analytics, chat widgets).
runbook_url: "https://veza.fr/runbooks/web-vitals-inp"
- alert: WebVitalsBeaconsStopped
# No beacons in 30m on a window where we expect them.
# Usually means : frontend instrumentation broke, the
# endpoint is rejecting (CORS / 4xx), or the CDN is
# blocking the POST. Compares to the same-time-of-day
# baseline 24h ago to avoid pages during low-traffic
# nights.
expr: |
sum(rate(veza_web_vitals_beacons_total[15m])) == 0
and
sum(rate(veza_web_vitals_beacons_total[15m] offset 24h)) > 0.05
for: 30m
labels:
severity: warning
page: "false"
annotations:
summary: "RUM beacons stopped flowing for 30m"
description: |
No Web Vitals beacons received in 30 minutes despite
yesterday's same-hour baseline showing traffic. Likely :
frontend webVitals.ts module crashed, the
/api/v1/observability/web-vitals endpoint is rejecting
(check WebVitalsRejectedTotal), or a CDN / WAF rule
is blocking the POST.
runbook_url: "https://veza.fr/runbooks/web-vitals-beacons-stopped"
# v1.0.10 ops item 11 — DB pool monitoring + N+1 detection alerts.
# The pool gauges are fed by StartPoolStatsExporter at 15s ticks.
# The query-count histogram is fed by the N1QueryCounter
# middleware on every HTTP request. Together these alerts catch :
# - Pool exhaustion under load (more requests want a connection
# than the pool has, requests pile up waiting)
# - N+1 regressions on a specific route (the ListXxx handler that
# used to do 1 query now does 200)
- name: veza_db_pool_n1
rules:
- alert: PoolExhaustionImminent
# in_use ≥ 90% of MaxOpenConns sustained for 5m. The 90%
# is a head-room buffer ; once we hit 100% the next request
# blocks waiting for a free connection and request latency
# spikes. Acting at 90% gives ops a chance to scale out
# before the visible incident.
# MaxOpenConns is configurable via DB_MAX_OPEN_CONNS ; the
# default is 50 in pool.go. The expression hardcodes 50 in
# the threshold because Prometheus can't read app env vars
# — bump this if DB_MAX_OPEN_CONNS is raised.
expr: |
veza_db_connections{state="in_use"} >= 45
for: 5m
labels:
severity: warning
page: "false"
annotations:
summary: "DB pool 90%+ saturated for 5m"
description: |
Pool in_use ≥ 45 / 50. Either the DB is slow (queries
taking longer to release connections) or request volume
spiked. Check the slow-query log + the request-rate
dashboard, and scale the pool via DB_MAX_OPEN_CONNS
if this is a sustained baseline shift.
runbook_url: "https://veza.fr/runbooks/pool-exhaustion"
- alert: PoolStatsExporterStuck
# Pool gauges haven't moved in 5m. Either the exporter
# goroutine died or sql.DB.Stats() is consistently
# returning the same value (a real but rare scenario,
# usually a sign of a hung connection somewhere).
expr: |
changes(veza_db_connections{state="open"}[5m]) == 0
and
rate(http_requests_total[5m]) > 1
for: 10m
labels:
severity: warning
page: "false"
annotations:
summary: "DB pool gauges frozen for 10m despite traffic"
description: |
The pool-stats exporter goroutine appears stuck (no
change in open-connection count for 10m even though
HTTP traffic is flowing). Restart the API process or
inspect the journal for "pool stats sample failed"
warnings.
runbook_url: "https://veza.fr/runbooks/pool-exhaustion"
- alert: N1QuerySpike
# > 3% of recent requests on a route exceed the N+1
# threshold. The histogram bucket boundaries are 1, 5,
# 10, 20, 50, 100, 200, 500, 1000 — anything in the
# >100 range is firmly N+1 territory regardless of route.
# The 3% noise floor avoids paging on the occasional
# admin batch operation that legitimately sweeps tables.
expr: |
(
sum(rate(veza_db_n1_suspicions_total[10m])) by (route)
/
sum(rate(veza_db_request_query_count_count[10m])) by (route)
) > 0.03
and
sum(rate(veza_db_request_query_count_count[10m])) by (route) > 0.05
for: 15m
labels:
severity: warning
page: "false"
annotations:
summary: "N+1 suspicion rate >3% on {{ $labels.route }} for 15m"
description: |
More than 3% of recent requests to {{ $labels.route }}
ran more queries than the N+1 threshold (default 50).
Check the handler for missing Preload calls or
loop-of-singles patterns. The slow-query log will
show the per-row queries.
runbook_url: "https://veza.fr/runbooks/n-plus-one"
- alert: SlowQuerySustained
# > 2 slow queries / minute sustained for 15m on the
# same operation+table. Slow query is anything over the
# PerformanceMonitor threshold (default 1s). One-off
# slow queries are noise ; sustained slow queries on the
# same table point at a missing index or a regressed plan.
expr: |
sum(rate(veza_db_slow_queries_total[10m])) by (operation, table) > 0.033
for: 15m
labels:
severity: warning
page: "false"
annotations:
summary: "Slow {{ $labels.operation }} on {{ $labels.table }} >2/min for 15m"
description: |
Sustained slow queries on table={{ $labels.table }}
operation={{ $labels.operation }}. Likely : missing
index, plan regression after a recent schema change,
or a suddenly-large table hitting a sequential scan.
Check pg_stat_statements + the slow-query log.
runbook_url: "https://veza.fr/runbooks/slow-query-sustained"
# v1.0.10 ops item 8 — MinIO cross-region replication alerts.
# The mc mirror script writes textfile metrics on every run ;
# node_exporter exposes them as veza_minio_replication_*. Two
# failure modes are caught :
# 1. Last run failed (status=0) — actionable now.
# 2. No successful run in > 12h — the timer broke, the host
# is offline, or the wrapper died before writing. Either
# way the backup we have is older than the RPO (6h cadence
# doubled = 12h).
- name: veza_minio_backup
rules:
- alert: MinioReplicationLastFailed
# The last replication attempt returned non-zero. Pages
# at warning because a single failed run is recoverable
# (the next 6h tick retries) — but the operator should
# see it surfaced rather than discover it at restore time.
expr: |
veza_minio_replication_last_status == 0
for: 5m
labels:
severity: warning
page: "false"
annotations:
summary: "Last MinIO replication run for {{ $labels.bucket }} failed"
description: |
veza-minio-replicate.service last completed with a
non-zero exit. Check `journalctl -u veza-minio-replicate`
for the mc-mirror error (network ? permissions ?
target full ?). The next timer tick will retry — if
this fires repeatedly the issue is configuration,
not transient.
runbook_url: "https://veza.fr/runbooks/minio-replication"
- alert: MinioReplicationStale
# > 12h since the last *successful* replication. Pages at
# critical because once we cross the RPO the backup we hold
# is no longer fit for purpose — restoring from it would
# lose more user data than the documented SLA permits.
# The underlying timer fires every 6h, so any value above
# ~14h indicates either the timer is broken or the script
# has been failing for at least 2 ticks.
expr: |
(time() - veza_minio_replication_last_success_timestamp_seconds) > 12 * 60 * 60
and
veza_minio_replication_last_success_timestamp_seconds > 0
for: 30m
labels:
severity: critical
page: "true"
annotations:
summary: "MinIO replication for {{ $labels.bucket }} hasn't succeeded in 12+ hours"
description: |
No successful replication in 12+ hours. RPO is 6h, so
we're past it. Likely : timer disabled / unit failed
to load, host offline, mc remote alias broken, target
cluster down. Check `systemctl status veza-minio-replicate.timer`
and the wrapper script's last journal output.
runbook_url: "https://veza.fr/runbooks/minio-replication"
- alert: MinioReplicationNeverSucceeded
# Never-succeeded gauge stuck at 0 for 24h+ on a bucket
# that the prod cluster IS writing to. Different from
# Stale above : Stale assumes there was at least one
# success at some point ; this catches the freshly-deployed
# role that hasn't gone through a single green run.
expr: |
veza_minio_replication_last_success_timestamp_seconds == 0
and
veza_minio_replication_last_run_timestamp_seconds > 0
for: 24h
labels:
severity: warning
page: "false"
annotations:
summary: "MinIO replication for {{ $labels.bucket }} has never succeeded"
description: |
The replication script has run at least once but
never returned success. Almost certainly a config
error : wrong remote endpoint, bad credentials, or
a target bucket that doesn't exist and can't be
created (permission). Inspect the most recent
journalctl output for veza-minio-replicate and fix
before the first DR scenario lands.
runbook_url: "https://veza.fr/runbooks/minio-replication"
- alert: MinioReplicationTargetShrunk
# The target bucket size dropped > 20% in 1h. A legitimate
# cause is a lifecycle policy aging-out old data, but a
# 20% drop in an hour is implausible from lifecycle alone.
# More commonly : someone fat-fingered a bucket policy or
# ran `mc rb --force` against the wrong alias.
expr: |
(
(veza_minio_replication_target_bytes offset 1h)
-
veza_minio_replication_target_bytes
)
/
(veza_minio_replication_target_bytes offset 1h)
> 0.2
and
(veza_minio_replication_target_bytes offset 1h) > 1e9
for: 5m
labels:
severity: critical
page: "true"
annotations:
summary: "MinIO replication target bucket {{ $labels.bucket }} shrunk >20% in 1h"
description: |
The replication target bucket lost more than 20% of
its size in the past hour. Almost always operator
error (wrong bucket targeted by mc rb, lifecycle
misconfiguration, or `--remove` flag accidentally
enabled when the source had a mass deletion). Pause
the timer immediately :
systemctl stop veza-minio-replicate.timer
and investigate before the next tick propagates the
damage further.
runbook_url: "https://veza.fr/runbooks/minio-replication"