veza/docs/runbooks at 199a8efbfe5275e8c864944343f3eb610d819f05 - senke/veza

senke/veza

History

senke 199a8efbfe feat(infra): MinIO cross-region replication + DR runbook (v1.0.10 ops item 8) Closes the "single-region MinIO" gap. The 4-node EC:2 cluster tolerates 2 simultaneous drive losses but a regional outage (network partition, DC fire, operator error wiping the cluster) remains a single point of data loss. New Ansible role minio_replication : - Wrapper script veza-minio-replicate.sh runs `mc mirror --preserve` from the local cluster to a remote S3-compatible target every 6h (configurable via OnCalendar). - Writes textfile-collector metrics on each run : veza_minio_replication_last_run_timestamp_seconds veza_minio_replication_last_success_timestamp_seconds veza_minio_replication_last_duration_seconds veza_minio_replication_last_status (1/0) veza_minio_replication_target_bytes - systemd timer with Persistent=true catches up missed runs after reboot (this is the disaster-recovery surface, can't afford to silently skip ticks). - Idempotent : `mc alias set` re-applies cleanly, `mc mb --ignore-existing` for the target bucket. - Refuses to run with vault placeholders to avoid accidental prod application against bogus credentials. Why mc mirror, not MinIO native bucket replication : works against any S3-compatible target (Wasabi, Backblaze B2, AWS S3) with just an access key, where MinIO BR/SR requires the target to be MinIO-managed and bidirectionally reachable. mc is the lowest- common-denominator that lets us decouple from the choice of target operator. Alerts in alert_rules.yml veza_minio_backup group : - MinioReplicationLastFailed (warning, single failed run) - MinioReplicationStale (CRITICAL, no success in 12h — past RPO) - MinioReplicationNeverSucceeded (warning, fresh deploy stuck) - MinioReplicationTargetShrunk (CRITICAL, > 20% drop in 1h — operator-error guard rail) Runbook docs/runbooks/minio-replication.md covers triage by alert, common ops tasks (manual sync, pause, credential rotation), and the manual restore procedure for DR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-05-05 00:04:25 +02:00
..
game-days	docs(release): game day #2 prod session + v2.0.0-rc1 release notes (W6 Day 28)	2026-04-29 15:44:32 +02:00
api-availability-slo-burn.md	feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)	2026-04-28 01:30:34 +02:00
api-latency-slo-burn.md	feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)	2026-04-28 01:30:34 +02:00
cert-expiring-soon.md	feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)	2026-04-28 01:30:34 +02:00
db-failover.md	feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)	2026-04-28 01:30:34 +02:00
DEPLOYMENT.md	chore(release): v0.961 — Playbook (runbooks déploiement, rollback, incident)	2026-03-02 19:09:46 +01:00
disk-full.md	feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)	2026-04-28 01:30:34 +02:00
GRACEFUL_DEGRADATION.md	docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs	2026-05-01 04:13:55 +02:00
INCIDENT_RESPONSE.md	docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs	2026-05-01 04:13:55 +02:00
minio-replication.md	feat(infra): MinIO cross-region replication + DR runbook (v1.0.10 ops item 8)	2026-05-05 00:04:25 +02:00
payment-success-slo-burn.md	feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)	2026-04-28 01:30:34 +02:00
rabbitmq-down.md	fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook	2026-04-30 22:32:05 +02:00
redis-down.md	feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)	2026-04-28 01:30:34 +02:00
ROLLBACK.md	chore(release): v0.961 — Playbook (runbooks déploiement, rollback, incident)	2026-03-02 19:09:46 +01:00
SECRET_ROTATION.md	chore(release): v0.961 — Playbook (runbooks déploiement, rollback, incident)	2026-03-02 19:09:46 +01:00