veza/config
senke 199a8efbfe feat(infra): MinIO cross-region replication + DR runbook (v1.0.10 ops item 8)
Closes the "single-region MinIO" gap. The 4-node EC:2 cluster
tolerates 2 simultaneous drive losses but a regional outage
(network partition, DC fire, operator error wiping the cluster)
remains a single point of data loss.

New Ansible role minio_replication :
- Wrapper script veza-minio-replicate.sh runs `mc mirror --preserve`
  from the local cluster to a remote S3-compatible target every 6h
  (configurable via OnCalendar).
- Writes textfile-collector metrics on each run :
    veza_minio_replication_last_run_timestamp_seconds
    veza_minio_replication_last_success_timestamp_seconds
    veza_minio_replication_last_duration_seconds
    veza_minio_replication_last_status (1/0)
    veza_minio_replication_target_bytes
- systemd timer with Persistent=true catches up missed runs after
  reboot (this is the disaster-recovery surface, can't afford to
  silently skip ticks).
- Idempotent : `mc alias set` re-applies cleanly, `mc mb
  --ignore-existing` for the target bucket.
- Refuses to run with vault placeholders to avoid accidental
  prod application against bogus credentials.

Why mc mirror, not MinIO native bucket replication : works against
any S3-compatible target (Wasabi, Backblaze B2, AWS S3) with just
an access key, where MinIO BR/SR requires the target to be
MinIO-managed and bidirectionally reachable. mc is the lowest-
common-denominator that lets us decouple from the choice of
target operator.

Alerts in alert_rules.yml veza_minio_backup group :
- MinioReplicationLastFailed (warning, single failed run)
- MinioReplicationStale (CRITICAL, no success in 12h — past RPO)
- MinioReplicationNeverSucceeded (warning, fresh deploy stuck)
- MinioReplicationTargetShrunk (CRITICAL, > 20% drop in 1h —
  operator-error guard rail)

Runbook docs/runbooks/minio-replication.md covers triage by alert,
common ops tasks (manual sync, pause, credential rotation), and
the manual restore procedure for DR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 00:04:25 +02:00
..
alertmanager feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10) 2026-04-28 01:30:34 +02:00
baremetal/apache state-ownership: delete unused optimisticStoreUpdates.ts file 2026-01-15 19:26:53 +01:00
caddy chore(cleanup): remove veza-chat-server directory and all operational references 2026-02-22 21:13:00 +01:00
docker chore(infra): J6 — mark 3 dormant docker-compose files as deprecated 2026-04-15 12:58:39 +02:00
grafana feat(redis): Sentinel HA + cache hit rate metrics (W3 Day 11) 2026-04-28 13:36:55 +02:00
haproxy feat(infra): blue-green deployment via HAProxy 2026-02-23 19:52:19 +01:00
incus chore(cleanup): remove veza-chat-server directory and all operational references 2026-02-22 21:13:00 +01:00
prometheus feat(infra): MinIO cross-region replication + DR runbook (v1.0.10 ops item 8) 2026-05-05 00:04:25 +02:00
ssl fix(infra): HAProxy HTTPS and stats security 2026-02-15 15:58:51 +01:00
env.example v0.9.5 2026-03-06 10:02:53 +01:00
logging.toml docs: add project documentation, logging config, status script 2026-03-18 11:36:36 +01:00
metrics.yaml BASE: completing the initial repo state 2025-12-03 22:56:50 +01:00
prometheus.yml feat(monitoring): add Alertmanager with Slack notifications 2026-02-23 19:54:55 +01:00