veza/docs/runbooks/minio-replication.md

# Runbook — MinIO cross-region replication

> v1.0.10 ops item 8.
> Owner : ops on-call.
> Severity routing : `MinioReplicationStale` and `MinioReplicationTargetShrunk` page ; the others are warnings.

## Architecture (1 minute version)

- Source : 4-node MinIO distributed cluster (EC:2), `veza-prod-tracks` bucket.
- Target : remote S3-compatible bucket (`{{ minio_remote_bucket }}` — set in vault).
- Mechanism : `mc mirror --preserve` driven by a systemd timer firing every 6h (`veza-minio-replicate.timer`).
- Telemetry : textfile-collector metrics (`veza_minio_replication_*`).
- RPO target : 6h. RTO target : 2h for ≤500 GB.

## Triage by alert

### `MinioReplicationLastFailed`

The last run returned non-zero. Single failures are usually transient (target endpoint blip, DNS hiccup) ; the next 6h tick retries automatically.

```bash
# 1. Inspect the last journal output (full mc-mirror error).
journalctl -u veza-minio-replicate --since "12 hours ago" --no-pager

# 2. Cross-check the timer is still scheduled.
systemctl list-timers veza-minio-replicate.timer

# 3. If the failure looks transient (network), trigger a manual run.
sudo systemctl start veza-minio-replicate.service
journalctl -fu veza-minio-replicate

# 4. Verify the post-run metric is back to 1.
grep last_status /var/lib/node_exporter/textfile/veza_minio_replication.prom
```

### `MinioReplicationStale` (PAGES)

> 12h with no successful run = past RPO. This is a real incident.

1. **Confirm scope.** Open the dashboard panel `MinIO replication freshness` ; confirm the metric and that the last-success timestamp is what the alert says.
2. **Check the timer + service status.**
    ```bash
    systemctl status veza-minio-replicate.timer
    systemctl status veza-minio-replicate.service
    ```
   If the timer is `inactive (dead)` or `failed`, see "Timer broke" below. If the service is `failed`, see "Script crashes" below.
3. **Check the host.** If the MinIO host is offline (e.g. taken down for maintenance) the timer is correctly idle ; this isn't a replication bug, it's a host-availability bug. Page the on-call for the host first.
4. **Check the remote target.** From the source host :
    ```bash
    mc ls veza-backup/                  # list aliases — does the target alias exist ?
    mc admin info veza-backup           # ping the target — auth works ?
    ```
   If `mc admin info` fails with auth error → credentials rotated, update vault and re-apply `minio_replication`.
   If it fails with network error → target endpoint down, escalate to the target operator.
5. **Manual restore of replication health** :
    ```bash
    sudo systemctl restart veza-minio-replicate.timer
    sudo systemctl start   veza-minio-replicate.service
    journalctl -fu veza-minio-replicate
    ```
6. If the manual run succeeds, the alert clears in ≤ 15min (next scrape + the alert's 30m for: window).

### `MinioReplicationNeverSucceeded`

Fresh deploy of the role that has run at least once but never landed a green run. Almost always a config error.

1. Read the last journal output for veza-minio-replicate ; the error is usually a clean message from `mc`.
2. Common causes :
    - **Wrong remote endpoint** : typo, `https://` vs `http://`, port wrong.
    - **Bad credentials** : key rotated post-deploy, vault not updated.
    - **Target bucket unable to be created** : IAM policy on the remote denies `mc mb`. Pre-create the bucket on the remote side and re-apply the role.
3. After fixing, re-run the role with `ansible-playbook -i inventory/prod -t minio_replication` ; the playbook is idempotent.

### `MinioReplicationTargetShrunk` (PAGES)

> Critical — the backup we hold may have just lost data.

1. **STOP THE TIMER FIRST.** Don't let the next tick propagate the damage :
    ```bash
    sudo systemctl stop veza-minio-replicate.timer
    ```
2. Investigate the target side :
    ```bash
    mc ls --recursive veza-backup/<bucket> | wc -l         # object count vs yesterday
    mc du veza-backup/<bucket>                              # current size
    ```
3. Cross-check the source size. If the source is intact and the target shrunk, the next run will re-mirror the missing objects (this is the recovery path) — but only after you've confirmed the source is the source of truth.
4. If the source is also empty, the data is gone (or nearly so). Escalate to the data-recovery runbook ; the latest pgbackrest + MinIO snapshots are the next layer.
5. Once root cause is identified and source data is verified, re-enable the timer :
    ```bash
    sudo systemctl start veza-minio-replicate.timer
    ```

## Common operational tasks

### Trigger a one-off sync (manual catch-up)
```bash
sudo systemctl start veza-minio-replicate.service
journalctl -fu veza-minio-replicate
```

### Pause replication (e.g. during a planned migration)
```bash
sudo systemctl stop    veza-minio-replicate.timer
# … do the work …
sudo systemctl start   veza-minio-replicate.timer
```

### Rotate the remote credentials
1. Update the vault entry (`minio_remote_access_key`, `minio_remote_secret_key`).
2. Re-run the role : `ansible-playbook -i inventory/prod -t minio_replication`. The role re-applies the `mc alias set` with the new key (idempotent).
3. Trigger one manual run + verify success metric.

### Quarterly DR drill (manual — recommended cadence)

Once per quarter, exercise the restore path on a throwaway MinIO cluster :

1. Provision a temporary single-node MinIO somewhere (Incus container, throwaway VM).
2. Run the restore commands from the role README's "Manual restore" section against this throwaway target.
3. Spot-check 5 random track playbacks via the API pointed at the new MinIO.
4. Document RTO observed in `docs/dr-drill-log.md`.
5. Tear down the throwaway.

## Related

- Role : `infra/ansible/roles/minio_replication/`
- Source role : `infra/ansible/roles/minio_distributed/`
- Alert rules : `config/prometheus/alert_rules.yml` group `veza_minio_backup`
- pgbackrest equivalent runbook : `docs/runbooks/db-failover.md` + `infra/ansible/roles/pgbackrest/`
feat(infra): MinIO cross-region replication + DR runbook (v1.0.10 ops item 8) Closes the "single-region MinIO" gap. The 4-node EC:2 cluster tolerates 2 simultaneous drive losses but a regional outage (network partition, DC fire, operator error wiping the cluster) remains a single point of data loss. New Ansible role minio_replication : - Wrapper script veza-minio-replicate.sh runs `mc mirror --preserve` from the local cluster to a remote S3-compatible target every 6h (configurable via OnCalendar). - Writes textfile-collector metrics on each run : veza_minio_replication_last_run_timestamp_seconds veza_minio_replication_last_success_timestamp_seconds veza_minio_replication_last_duration_seconds veza_minio_replication_last_status (1/0) veza_minio_replication_target_bytes - systemd timer with Persistent=true catches up missed runs after reboot (this is the disaster-recovery surface, can't afford to silently skip ticks). - Idempotent : `mc alias set` re-applies cleanly, `mc mb --ignore-existing` for the target bucket. - Refuses to run with vault placeholders to avoid accidental prod application against bogus credentials. Why mc mirror, not MinIO native bucket replication : works against any S3-compatible target (Wasabi, Backblaze B2, AWS S3) with just an access key, where MinIO BR/SR requires the target to be MinIO-managed and bidirectionally reachable. mc is the lowest- common-denominator that lets us decouple from the choice of target operator. Alerts in alert_rules.yml veza_minio_backup group : - MinioReplicationLastFailed (warning, single failed run) - MinioReplicationStale (CRITICAL, no success in 12h — past RPO) - MinioReplicationNeverSucceeded (warning, fresh deploy stuck) - MinioReplicationTargetShrunk (CRITICAL, > 20% drop in 1h — operator-error guard rail) Runbook docs/runbooks/minio-replication.md covers triage by alert, common ops tasks (manual sync, pause, credential rotation), and the manual restore procedure for DR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-05-04 22:04:25 +00:00			`# Runbook — MinIO cross-region replication`

			`> v1.0.10 ops item 8.`
			`> Owner : ops on-call.`
			> Severity routing : `MinioReplicationStale` and `MinioReplicationTargetShrunk` page ; the others are warnings.

			`## Architecture (1 minute version)`

			- Source : 4-node MinIO distributed cluster (EC:2), `veza-prod-tracks` bucket.
			- Target : remote S3-compatible bucket (`{{ minio_remote_bucket }}` — set in vault).
			- Mechanism : `mc mirror --preserve` driven by a systemd timer firing every 6h (`veza-minio-replicate.timer`).
			- Telemetry : textfile-collector metrics (`veza_minio_replication_*`).
			`- RPO target : 6h. RTO target : 2h for ≤500 GB.`

			`## Triage by alert`

			### `MinioReplicationLastFailed`

			`The last run returned non-zero. Single failures are usually transient (target endpoint blip, DNS hiccup) ; the next 6h tick retries automatically.`

			```bash
			`# 1. Inspect the last journal output (full mc-mirror error).`
			`journalctl -u veza-minio-replicate --since "12 hours ago" --no-pager`

			`# 2. Cross-check the timer is still scheduled.`
			`systemctl list-timers veza-minio-replicate.timer`

			`# 3. If the failure looks transient (network), trigger a manual run.`
			`sudo systemctl start veza-minio-replicate.service`
			`journalctl -fu veza-minio-replicate`

			`# 4. Verify the post-run metric is back to 1.`
			`grep last_status /var/lib/node_exporter/textfile/veza_minio_replication.prom`
			```

			### `MinioReplicationStale` (PAGES)

			`> 12h with no successful run = past RPO. This is a real incident.`

			1. Confirm scope. Open the dashboard panel `MinIO replication freshness` ; confirm the metric and that the last-success timestamp is what the alert says.
			`2. Check the timer + service status.`
			```bash
			`systemctl status veza-minio-replicate.timer`
			`systemctl status veza-minio-replicate.service`
			```
			If the timer is `inactive (dead)` or `failed`, see "Timer broke" below. If the service is `failed`, see "Script crashes" below.
			`3. Check the host. If the MinIO host is offline (e.g. taken down for maintenance) the timer is correctly idle ; this isn't a replication bug, it's a host-availability bug. Page the on-call for the host first.`
			`4. Check the remote target. From the source host :`
			```bash
			`mc ls veza-backup/ # list aliases — does the target alias exist ?`
			`mc admin info veza-backup # ping the target — auth works ?`
			```
			If `mc admin info` fails with auth error → credentials rotated, update vault and re-apply `minio_replication`.
			`If it fails with network error → target endpoint down, escalate to the target operator.`
			`5. Manual restore of replication health :`
			```bash
			`sudo systemctl restart veza-minio-replicate.timer`
			`sudo systemctl start veza-minio-replicate.service`
			`journalctl -fu veza-minio-replicate`
			```
			`6. If the manual run succeeds, the alert clears in ≤ 15min (next scrape + the alert's 30m for: window).`

			### `MinioReplicationNeverSucceeded`

			`Fresh deploy of the role that has run at least once but never landed a green run. Almost always a config error.`

			1. Read the last journal output for veza-minio-replicate ; the error is usually a clean message from `mc`.
			`2. Common causes :`
			- Wrong remote endpoint : typo, `https://` vs `http://`, port wrong.
			`- Bad credentials : key rotated post-deploy, vault not updated.`
			- Target bucket unable to be created : IAM policy on the remote denies `mc mb`. Pre-create the bucket on the remote side and re-apply the role.
			3. After fixing, re-run the role with `ansible-playbook -i inventory/prod -t minio_replication` ; the playbook is idempotent.

			### `MinioReplicationTargetShrunk` (PAGES)

			`> Critical — the backup we hold may have just lost data.`

			`1. STOP THE TIMER FIRST. Don't let the next tick propagate the damage :`
			```bash
			`sudo systemctl stop veza-minio-replicate.timer`
			```
			`2. Investigate the target side :`
			```bash
			`mc ls --recursive veza-backup/<bucket> \| wc -l # object count vs yesterday`
			`mc du veza-backup/<bucket> # current size`
			```
			`3. Cross-check the source size. If the source is intact and the target shrunk, the next run will re-mirror the missing objects (this is the recovery path) — but only after you've confirmed the source is the source of truth.`
			`4. If the source is also empty, the data is gone (or nearly so). Escalate to the data-recovery runbook ; the latest pgbackrest + MinIO snapshots are the next layer.`
			`5. Once root cause is identified and source data is verified, re-enable the timer :`
			```bash
			`sudo systemctl start veza-minio-replicate.timer`
			```

			`## Common operational tasks`

			`### Trigger a one-off sync (manual catch-up)`
			```bash
			`sudo systemctl start veza-minio-replicate.service`
			`journalctl -fu veza-minio-replicate`
			```

			`### Pause replication (e.g. during a planned migration)`
			```bash
			`sudo systemctl stop veza-minio-replicate.timer`
			`# … do the work …`
			`sudo systemctl start veza-minio-replicate.timer`
			```

			`### Rotate the remote credentials`
			1. Update the vault entry (`minio_remote_access_key`, `minio_remote_secret_key`).
			2. Re-run the role : `ansible-playbook -i inventory/prod -t minio_replication`. The role re-applies the `mc alias set` with the new key (idempotent).
			`3. Trigger one manual run + verify success metric.`

			`### Quarterly DR drill (manual — recommended cadence)`

			`Once per quarter, exercise the restore path on a throwaway MinIO cluster :`

			`1. Provision a temporary single-node MinIO somewhere (Incus container, throwaway VM).`
			`2. Run the restore commands from the role README's "Manual restore" section against this throwaway target.`
			`3. Spot-check 5 random track playbacks via the API pointed at the new MinIO.`
			4. Document RTO observed in `docs/dr-drill-log.md`.
			`5. Tear down the throwaway.`

			`## Related`

			- Role : `infra/ansible/roles/minio_replication/`
			- Source role : `infra/ansible/roles/minio_distributed/`
			- Alert rules : `config/prometheus/alert_rules.yml` group `veza_minio_backup`
			- pgbackrest equivalent runbook : `docs/runbooks/db-failover.md` + `infra/ansible/roles/pgbackrest/`