Closes the "single-region MinIO" gap. The 4-node EC:2 cluster
tolerates 2 simultaneous drive losses but a regional outage
(network partition, DC fire, operator error wiping the cluster)
remains a single point of data loss.
New Ansible role minio_replication :
- Wrapper script veza-minio-replicate.sh runs `mc mirror --preserve`
from the local cluster to a remote S3-compatible target every 6h
(configurable via OnCalendar).
- Writes textfile-collector metrics on each run :
veza_minio_replication_last_run_timestamp_seconds
veza_minio_replication_last_success_timestamp_seconds
veza_minio_replication_last_duration_seconds
veza_minio_replication_last_status (1/0)
veza_minio_replication_target_bytes
- systemd timer with Persistent=true catches up missed runs after
reboot (this is the disaster-recovery surface, can't afford to
silently skip ticks).
- Idempotent : `mc alias set` re-applies cleanly, `mc mb
--ignore-existing` for the target bucket.
- Refuses to run with vault placeholders to avoid accidental
prod application against bogus credentials.
Why mc mirror, not MinIO native bucket replication : works against
any S3-compatible target (Wasabi, Backblaze B2, AWS S3) with just
an access key, where MinIO BR/SR requires the target to be
MinIO-managed and bidirectionally reachable. mc is the lowest-
common-denominator that lets us decouple from the choice of
target operator.
Alerts in alert_rules.yml veza_minio_backup group :
- MinioReplicationLastFailed (warning, single failed run)
- MinioReplicationStale (CRITICAL, no success in 12h — past RPO)
- MinioReplicationNeverSucceeded (warning, fresh deploy stuck)
- MinioReplicationTargetShrunk (CRITICAL, > 20% drop in 1h —
operator-error guard rail)
Runbook docs/runbooks/minio-replication.md covers triage by alert,
common ops tasks (manual sync, pause, credential rotation), and
the manual restore procedure for DR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>