senke/veza

senke 199a8efbfe feat(infra): MinIO cross-region replication + DR runbook (v1.0.10 ops item 8)

Closes the "single-region MinIO" gap. The 4-node EC:2 cluster
tolerates 2 simultaneous drive losses but a regional outage
(network partition, DC fire, operator error wiping the cluster)
remains a single point of data loss.

New Ansible role minio_replication :
- Wrapper script veza-minio-replicate.sh runs `mc mirror --preserve`
  from the local cluster to a remote S3-compatible target every 6h
  (configurable via OnCalendar).
- Writes textfile-collector metrics on each run :
    veza_minio_replication_last_run_timestamp_seconds
    veza_minio_replication_last_success_timestamp_seconds
    veza_minio_replication_last_duration_seconds
    veza_minio_replication_last_status (1/0)
    veza_minio_replication_target_bytes
- systemd timer with Persistent=true catches up missed runs after
  reboot (this is the disaster-recovery surface, can't afford to
  silently skip ticks).
- Idempotent : `mc alias set` re-applies cleanly, `mc mb
  --ignore-existing` for the target bucket.
- Refuses to run with vault placeholders to avoid accidental
  prod application against bogus credentials.

Why mc mirror, not MinIO native bucket replication : works against
any S3-compatible target (Wasabi, Backblaze B2, AWS S3) with just
an access key, where MinIO BR/SR requires the target to be
MinIO-managed and bidirectionally reachable. mc is the lowest-
common-denominator that lets us decouple from the choice of
target operator.

Alerts in alert_rules.yml veza_minio_backup group :
- MinioReplicationLastFailed (warning, single failed run)
- MinioReplicationStale (CRITICAL, no success in 12h — past RPO)
- MinioReplicationNeverSucceeded (warning, fresh deploy stuck)
- MinioReplicationTargetShrunk (CRITICAL, > 20% drop in 1h —
  operator-error guard rail)

Runbook docs/runbooks/minio-replication.md covers triage by alert,
common ops tasks (manual sync, pause, credential rotation), and
the manual restore procedure for DR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-05 00:04:25 +02:00

5.9 KiB

Raw Blame History

Runbook — MinIO cross-region replication

v1.0.10 ops item 8. Owner : ops on-call. Severity routing : MinioReplicationStale and MinioReplicationTargetShrunk page ; the others are warnings.

Architecture (1 minute version)

Source : 4-node MinIO distributed cluster (EC:2), veza-prod-tracks bucket.
Target : remote S3-compatible bucket ({{ minio_remote_bucket }} — set in vault).
Mechanism : mc mirror --preserve driven by a systemd timer firing every 6h (veza-minio-replicate.timer).
Telemetry : textfile-collector metrics (veza_minio_replication_*).
RPO target : 6h. RTO target : 2h for ≤500 GB.

Triage by alert

`MinioReplicationLastFailed`

The last run returned non-zero. Single failures are usually transient (target endpoint blip, DNS hiccup) ; the next 6h tick retries automatically.

# 1. Inspect the last journal output (full mc-mirror error).
journalctl -u veza-minio-replicate --since "12 hours ago" --no-pager

# 2. Cross-check the timer is still scheduled.
systemctl list-timers veza-minio-replicate.timer

# 3. If the failure looks transient (network), trigger a manual run.
sudo systemctl start veza-minio-replicate.service
journalctl -fu veza-minio-replicate

# 4. Verify the post-run metric is back to 1.
grep last_status /var/lib/node_exporter/textfile/veza_minio_replication.prom

`MinioReplicationStale` (PAGES)

12h with no successful run = past RPO. This is a real incident.

Confirm scope. Open the dashboard panel MinIO replication freshness ; confirm the metric and that the last-success timestamp is what the alert says.
Check the timer + service status.
```
systemctl status veza-minio-replicate.timer
systemctl status veza-minio-replicate.service
```
If the timer is inactive (dead) or failed, see "Timer broke" below. If the service is failed, see "Script crashes" below.
Check the host. If the MinIO host is offline (e.g. taken down for maintenance) the timer is correctly idle ; this isn't a replication bug, it's a host-availability bug. Page the on-call for the host first.
Check the remote target. From the source host :
```
mc ls veza-backup/                  # list aliases — does the target alias exist ?
mc admin info veza-backup           # ping the target — auth works ?
```
If mc admin info fails with auth error → credentials rotated, update vault and re-apply minio_replication. If it fails with network error → target endpoint down, escalate to the target operator.

Manual restore of replication health :

sudo systemctl restart veza-minio-replicate.timer
sudo systemctl start   veza-minio-replicate.service
journalctl -fu veza-minio-replicate

If the manual run succeeds, the alert clears in ≤ 15min (next scrape + the alert's 30m for: window).

`MinioReplicationNeverSucceeded`

Fresh deploy of the role that has run at least once but never landed a green run. Almost always a config error.

Read the last journal output for veza-minio-replicate ; the error is usually a clean message from mc.
Common causes :
- Wrong remote endpoint : typo, https:// vs http://, port wrong.
- Bad credentials : key rotated post-deploy, vault not updated.
- Target bucket unable to be created : IAM policy on the remote denies mc mb. Pre-create the bucket on the remote side and re-apply the role.
After fixing, re-run the role with ansible-playbook -i inventory/prod -t minio_replication ; the playbook is idempotent.

`MinioReplicationTargetShrunk` (PAGES)

Critical — the backup we hold may have just lost data.

STOP THE TIMER FIRST. Don't let the next tick propagate the damage :
```
sudo systemctl stop veza-minio-replicate.timer
```

Investigate the target side :

mc ls --recursive veza-backup/<bucket> | wc -l         # object count vs yesterday
mc du veza-backup/<bucket>                              # current size

Cross-check the source size. If the source is intact and the target shrunk, the next run will re-mirror the missing objects (this is the recovery path) — but only after you've confirmed the source is the source of truth.
If the source is also empty, the data is gone (or nearly so). Escalate to the data-recovery runbook ; the latest pgbackrest + MinIO snapshots are the next layer.
Once root cause is identified and source data is verified, re-enable the timer :
```
sudo systemctl start veza-minio-replicate.timer
```

Common operational tasks

Trigger a one-off sync (manual catch-up)

sudo systemctl start veza-minio-replicate.service
journalctl -fu veza-minio-replicate

Pause replication (e.g. during a planned migration)

sudo systemctl stop    veza-minio-replicate.timer
# … do the work …
sudo systemctl start   veza-minio-replicate.timer

Rotate the remote credentials

Update the vault entry (minio_remote_access_key, minio_remote_secret_key).
Re-run the role : ansible-playbook -i inventory/prod -t minio_replication. The role re-applies the mc alias set with the new key (idempotent).
Trigger one manual run + verify success metric.

Quarterly DR drill (manual — recommended cadence)

Once per quarter, exercise the restore path on a throwaway MinIO cluster :

Provision a temporary single-node MinIO somewhere (Incus container, throwaway VM).
Run the restore commands from the role README's "Manual restore" section against this throwaway target.
Spot-check 5 random track playbacks via the API pointed at the new MinIO.
Document RTO observed in docs/dr-drill-log.md.
Tear down the throwaway.

Role : infra/ansible/roles/minio_replication/
Source role : infra/ansible/roles/minio_distributed/
Alert rules : config/prometheus/alert_rules.yml group veza_minio_backup
pgbackrest equivalent runbook : docs/runbooks/db-failover.md + infra/ansible/roles/pgbackrest/

5.9 KiB Raw Blame History