130 lines
5.9 KiB
Markdown
130 lines
5.9 KiB
Markdown
|
|
# Runbook — MinIO cross-region replication
|
||
|
|
|
||
|
|
> v1.0.10 ops item 8.
|
||
|
|
> Owner : ops on-call.
|
||
|
|
> Severity routing : `MinioReplicationStale` and `MinioReplicationTargetShrunk` page ; the others are warnings.
|
||
|
|
|
||
|
|
## Architecture (1 minute version)
|
||
|
|
|
||
|
|
- Source : 4-node MinIO distributed cluster (EC:2), `veza-prod-tracks` bucket.
|
||
|
|
- Target : remote S3-compatible bucket (`{{ minio_remote_bucket }}` — set in vault).
|
||
|
|
- Mechanism : `mc mirror --preserve` driven by a systemd timer firing every 6h (`veza-minio-replicate.timer`).
|
||
|
|
- Telemetry : textfile-collector metrics (`veza_minio_replication_*`).
|
||
|
|
- RPO target : 6h. RTO target : 2h for ≤500 GB.
|
||
|
|
|
||
|
|
## Triage by alert
|
||
|
|
|
||
|
|
### `MinioReplicationLastFailed`
|
||
|
|
|
||
|
|
The last run returned non-zero. Single failures are usually transient (target endpoint blip, DNS hiccup) ; the next 6h tick retries automatically.
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Inspect the last journal output (full mc-mirror error).
|
||
|
|
journalctl -u veza-minio-replicate --since "12 hours ago" --no-pager
|
||
|
|
|
||
|
|
# 2. Cross-check the timer is still scheduled.
|
||
|
|
systemctl list-timers veza-minio-replicate.timer
|
||
|
|
|
||
|
|
# 3. If the failure looks transient (network), trigger a manual run.
|
||
|
|
sudo systemctl start veza-minio-replicate.service
|
||
|
|
journalctl -fu veza-minio-replicate
|
||
|
|
|
||
|
|
# 4. Verify the post-run metric is back to 1.
|
||
|
|
grep last_status /var/lib/node_exporter/textfile/veza_minio_replication.prom
|
||
|
|
```
|
||
|
|
|
||
|
|
### `MinioReplicationStale` (PAGES)
|
||
|
|
|
||
|
|
> 12h with no successful run = past RPO. This is a real incident.
|
||
|
|
|
||
|
|
1. **Confirm scope.** Open the dashboard panel `MinIO replication freshness` ; confirm the metric and that the last-success timestamp is what the alert says.
|
||
|
|
2. **Check the timer + service status.**
|
||
|
|
```bash
|
||
|
|
systemctl status veza-minio-replicate.timer
|
||
|
|
systemctl status veza-minio-replicate.service
|
||
|
|
```
|
||
|
|
If the timer is `inactive (dead)` or `failed`, see "Timer broke" below. If the service is `failed`, see "Script crashes" below.
|
||
|
|
3. **Check the host.** If the MinIO host is offline (e.g. taken down for maintenance) the timer is correctly idle ; this isn't a replication bug, it's a host-availability bug. Page the on-call for the host first.
|
||
|
|
4. **Check the remote target.** From the source host :
|
||
|
|
```bash
|
||
|
|
mc ls veza-backup/ # list aliases — does the target alias exist ?
|
||
|
|
mc admin info veza-backup # ping the target — auth works ?
|
||
|
|
```
|
||
|
|
If `mc admin info` fails with auth error → credentials rotated, update vault and re-apply `minio_replication`.
|
||
|
|
If it fails with network error → target endpoint down, escalate to the target operator.
|
||
|
|
5. **Manual restore of replication health** :
|
||
|
|
```bash
|
||
|
|
sudo systemctl restart veza-minio-replicate.timer
|
||
|
|
sudo systemctl start veza-minio-replicate.service
|
||
|
|
journalctl -fu veza-minio-replicate
|
||
|
|
```
|
||
|
|
6. If the manual run succeeds, the alert clears in ≤ 15min (next scrape + the alert's 30m for: window).
|
||
|
|
|
||
|
|
### `MinioReplicationNeverSucceeded`
|
||
|
|
|
||
|
|
Fresh deploy of the role that has run at least once but never landed a green run. Almost always a config error.
|
||
|
|
|
||
|
|
1. Read the last journal output for veza-minio-replicate ; the error is usually a clean message from `mc`.
|
||
|
|
2. Common causes :
|
||
|
|
- **Wrong remote endpoint** : typo, `https://` vs `http://`, port wrong.
|
||
|
|
- **Bad credentials** : key rotated post-deploy, vault not updated.
|
||
|
|
- **Target bucket unable to be created** : IAM policy on the remote denies `mc mb`. Pre-create the bucket on the remote side and re-apply the role.
|
||
|
|
3. After fixing, re-run the role with `ansible-playbook -i inventory/prod -t minio_replication` ; the playbook is idempotent.
|
||
|
|
|
||
|
|
### `MinioReplicationTargetShrunk` (PAGES)
|
||
|
|
|
||
|
|
> Critical — the backup we hold may have just lost data.
|
||
|
|
|
||
|
|
1. **STOP THE TIMER FIRST.** Don't let the next tick propagate the damage :
|
||
|
|
```bash
|
||
|
|
sudo systemctl stop veza-minio-replicate.timer
|
||
|
|
```
|
||
|
|
2. Investigate the target side :
|
||
|
|
```bash
|
||
|
|
mc ls --recursive veza-backup/<bucket> | wc -l # object count vs yesterday
|
||
|
|
mc du veza-backup/<bucket> # current size
|
||
|
|
```
|
||
|
|
3. Cross-check the source size. If the source is intact and the target shrunk, the next run will re-mirror the missing objects (this is the recovery path) — but only after you've confirmed the source is the source of truth.
|
||
|
|
4. If the source is also empty, the data is gone (or nearly so). Escalate to the data-recovery runbook ; the latest pgbackrest + MinIO snapshots are the next layer.
|
||
|
|
5. Once root cause is identified and source data is verified, re-enable the timer :
|
||
|
|
```bash
|
||
|
|
sudo systemctl start veza-minio-replicate.timer
|
||
|
|
```
|
||
|
|
|
||
|
|
## Common operational tasks
|
||
|
|
|
||
|
|
### Trigger a one-off sync (manual catch-up)
|
||
|
|
```bash
|
||
|
|
sudo systemctl start veza-minio-replicate.service
|
||
|
|
journalctl -fu veza-minio-replicate
|
||
|
|
```
|
||
|
|
|
||
|
|
### Pause replication (e.g. during a planned migration)
|
||
|
|
```bash
|
||
|
|
sudo systemctl stop veza-minio-replicate.timer
|
||
|
|
# … do the work …
|
||
|
|
sudo systemctl start veza-minio-replicate.timer
|
||
|
|
```
|
||
|
|
|
||
|
|
### Rotate the remote credentials
|
||
|
|
1. Update the vault entry (`minio_remote_access_key`, `minio_remote_secret_key`).
|
||
|
|
2. Re-run the role : `ansible-playbook -i inventory/prod -t minio_replication`. The role re-applies the `mc alias set` with the new key (idempotent).
|
||
|
|
3. Trigger one manual run + verify success metric.
|
||
|
|
|
||
|
|
### Quarterly DR drill (manual — recommended cadence)
|
||
|
|
|
||
|
|
Once per quarter, exercise the restore path on a throwaway MinIO cluster :
|
||
|
|
|
||
|
|
1. Provision a temporary single-node MinIO somewhere (Incus container, throwaway VM).
|
||
|
|
2. Run the restore commands from the role README's "Manual restore" section against this throwaway target.
|
||
|
|
3. Spot-check 5 random track playbacks via the API pointed at the new MinIO.
|
||
|
|
4. Document RTO observed in `docs/dr-drill-log.md`.
|
||
|
|
5. Tear down the throwaway.
|
||
|
|
|
||
|
|
## Related
|
||
|
|
|
||
|
|
- Role : `infra/ansible/roles/minio_replication/`
|
||
|
|
- Source role : `infra/ansible/roles/minio_distributed/`
|
||
|
|
- Alert rules : `config/prometheus/alert_rules.yml` group `veza_minio_backup`
|
||
|
|
- pgbackrest equivalent runbook : `docs/runbooks/db-failover.md` + `infra/ansible/roles/pgbackrest/`
|