veza/config
senke f4eb4732dd feat(observability): deploy alerts (4) + failed-color scanner script
Wire the W5+ deploy pipeline into the existing Prometheus alerting
stack. The deploy_app.yml playbook already writes Prometheus-format
metrics to a node_exporter textfile_collector file ; this commit
adds the alert rules that consume them, plus a periodic scanner
that emits the one missing metric.

Alerts (config/prometheus/alert_rules.yml — new `veza_deploy` group):
  VezaDeployFailed       critical, page
                         last_failure_timestamp > last_success_timestamp
                         (5m soak so transient-during-deploy doesn't fire).
                         Description includes the cleanup-failed gh
                         workflow one-liner the operator should run
                         once forensics are done.
  VezaStaleDeploy        warning, no-page
                         staging hasn't deployed in 7+ days.
                         Catches Forgejo runner offline, expired
                         secret, broken pipeline.
  VezaStaleDeployProd    warning, no-page
                         prod equivalent at 30+ days.
  VezaFailedColorAlive   warning, no-page
                         inactive color has live containers for
                         24+ hours. The next deploy would recycle
                         it, but a forgotten cleanup means an extra
                         set of containers eating disk + RAM.

Script (scripts/observability/scan-failed-colors.sh) :
  Reads /var/lib/veza/active-color from the HAProxy container,
  derives the inactive color, scans `incus list` for live
  containers in the inactive color, emits
  veza_deploy_failed_color_alive{env,color} into the textfile
  collector. Designed for a 1-minute systemd timer.
  Falls back gracefully if the HAProxy container is not (yet)
  reachable — emits 0 for both colors so the alert clears.

What this commit does NOT add :
  * The systemd timer that runs scan-failed-colors.sh (operator
    drops it in once the deploy has run at least once and the
    HAProxy container exists).
  * The Prometheus reload — alert_rules.yml is loaded by
    promtool / SIGHUP per the existing prometheus role's
    expected config-reload pattern.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:45:27 +02:00
..
alertmanager feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10) 2026-04-28 01:30:34 +02:00
baremetal/apache state-ownership: delete unused optimisticStoreUpdates.ts file 2026-01-15 19:26:53 +01:00
caddy chore(cleanup): remove veza-chat-server directory and all operational references 2026-02-22 21:13:00 +01:00
docker chore(infra): J6 — mark 3 dormant docker-compose files as deprecated 2026-04-15 12:58:39 +02:00
grafana feat(redis): Sentinel HA + cache hit rate metrics (W3 Day 11) 2026-04-28 13:36:55 +02:00
haproxy feat(infra): blue-green deployment via HAProxy 2026-02-23 19:52:19 +01:00
incus chore(cleanup): remove veza-chat-server directory and all operational references 2026-02-22 21:13:00 +01:00
prometheus feat(observability): deploy alerts (4) + failed-color scanner script 2026-04-29 14:45:27 +02:00
ssl fix(infra): HAProxy HTTPS and stats security 2026-02-15 15:58:51 +01:00
env.example v0.9.5 2026-03-06 10:02:53 +01:00
logging.toml docs: add project documentation, logging config, status script 2026-03-18 11:36:36 +01:00
metrics.yaml BASE: completing the initial repo state 2025-12-03 22:56:50 +01:00
prometheus.yml feat(monitoring): add Alertmanager with Slack notifications 2026-02-23 19:54:55 +01:00