veza/infra/ansible/roles/minio_distributed/README.md

# `minio_distributed` role — distributed MinIO with EC:2

Four Incus containers, each running one MinIO server. Single erasure set of 4 drives = 2 data + 2 parity. The cluster tolerates **2 simultaneous node failures** without data loss; storage efficiency is 50% (1 GB raw → 500 MB usable).

## Topology

```
                     S3 API on :9000
                            │
            ┌───────────────┼───────────────┐
            │               │               │
       ┌────▼────┐     ┌────▼────┐     ┌────▼────┐     ┌────▼────┐
       │ minio-1 │     │ minio-2 │     │ minio-3 │     │ minio-4 │
       │  /data  │     │  /data  │     │  /data  │     │  /data  │
       └─────────┘     └─────────┘     └─────────┘     └─────────┘
                  └─── single erasure set, EC:2 ───┘
```

Each node also runs the web console on `:9001`.

## Why EC:2 (not 4 or larger)

- **Recoverability ceiling.** EC:N tolerates N drive losses. With 4 drives, EC:4 is a 4-way mirror — 25% efficiency, lose-3 OK but with no functional gain over EC:2 in the failure modes we care about (concurrent node losses).
- **Write amplification.** EC:2 writes each object to 4 nodes (2 data + 2 parity). EC:4 would write to all 4 + a copy = 4-way replication. Doubling the wire cost for marginal durability isn't worth it on a 4-node cluster.
- **Future-proofing.** When we go to 6+ nodes (W3+), the natural upgrade is EC:3 across a 6-drive set, NOT growing EC on the same 4 drives.

## Defaults

| variable                                | default                            | meaning                                              |
| --------------------------------------- | ---------------------------------- | ---------------------------------------------------- |
| `minio_version`                         | `RELEASE.2025-09-07T16-13-09Z`     | matches docker-compose.yml — keep them locked together |
| `minio_port`                            | `9000`                             | S3 API                                               |
| `minio_console_port`                    | `9001`                             | web console                                          |
| `minio_data_path`                       | `/var/lib/minio`                   | drive root on each node                              |
| `minio_storage_class_standard`          | `EC:2`                             | parity count for STANDARD storage class              |
| `minio_bucket_tracks`                   | `veza-prod-tracks`                 | prod bucket created on first apply                   |
| `minio_noncurrent_version_expiry_days`  | `30`                               | delete old object versions after N days              |
| `minio_cold_tier_after_days`            | `90`                               | only effective if `minio_remote_tier_name` is set    |
| `minio_remote_tier_name`                | `""` (none)                        | future remote tier (Glacier / B2). v1.1 territory.   |
| `minio_root_user` / `minio_root_password`| (vault)                          | root credentials                                     |

## Vault setup

```yaml
# group_vars/minio_ha.vault.yml — encrypt with `ansible-vault encrypt`
minio_root_user: "<random 32-char access key>"
minio_root_password: "<random 32-char secret>"
```

The role asserts the placeholder values are gone before applying to anything other than `lab`.

## Backend integration

**No code change.** The backend's `internal/services/storage/s3*` already speaks the AWS SDK v2 ; pointing it at the new cluster is a config flip :

```env
AWS_S3_ENABLED=true
AWS_S3_BUCKET=veza-prod-tracks
AWS_S3_ENDPOINT=http://minio-1.lxd:9000   # or behind HAProxy
AWS_S3_REGION=us-east-1                   # MinIO default region
AWS_ACCESS_KEY_ID=<minio_root_user>
AWS_SECRET_ACCESS_KEY=<minio_root_password>
```

For prod, front the 4 nodes with HAProxy (round-robin, health-checked) so the backend sees a single endpoint and tolerates any 1-node loss without DNS edits. HAProxy config lives in `infra/haproxy/` (W4 day 19 ties this in).

## Migration from single-node

```bash
# On the old single-node host (or via mc on a workstation) :
mc alias set veza-current http://veza.fr:19000 <ACCESS> <SECRET>
mc alias set veza-distributed http://minio-1.lxd:9000 <NEW_ACCESS> <NEW_SECRET>

# Mirror : preserves versioning, ACLs, content-types.
mc mirror --preserve veza-current/veza-files veza-distributed/veza-prod-tracks

# Verify count + bytes match before flipping the AWS_S3_ENDPOINT in
# backend env :
mc ls --recursive veza-current/veza-files       | wc -l
mc ls --recursive veza-distributed/veza-prod-tracks | wc -l
```

The old bucket can be kept hot for ~ 1 week after the flip in case a rollback is needed, then `mc rm --recursive --force --dangerous` drops it.

## Operations

```bash
# Cluster health (admin info = info about each drive) :
mc admin info veza-distributed

# Per-node verbose state :
ssh minio-1 sudo journalctl -u minio -n 100 --no-pager

# Watch heal progress (after a node was offline / drive replaced) :
mc admin heal veza-distributed --recursive

# Check lifecycle policy :
mc ilm ls veza-distributed/veza-prod-tracks

# Console UI (per-node — pick any) :
open http://minio-1.lxd:9001
```

## Failover smoke test

```bash
MINIO_ROOT_USER=... MINIO_ROOT_PASSWORD=... \
  bash infra/ansible/tests/test_minio_resilience.sh
```

Sequence : upload 100 MB random file, kill 2 nodes, assert read still works, restart nodes, wait for self-heal, assert all 4 nodes report healthy.

## What this role does NOT cover

- **Cross-DC replication.** Single-host (lab) or single-region in v1.0. v1.1+ adds bucket replication to a remote cluster.
- **Site replication / federation.** Multi-tenant federation is out of scope.
- **Cold tier transitions.** `minio_remote_tier_name` is empty by default — no Glacier / B2 / second-cluster behind the lifecycle yet. Wire when needed.
- **mTLS.** `--tls-cert/key` is W4. The Incus bridge is the security boundary today.
feat(infra): MinIO distributed EC:2 + migration script (W3 Day 12) Four-node distributed MinIO cluster, single erasure set EC:2, tolerates 2 simultaneous node losses. 50% storage efficiency. Pinned to RELEASE.2025-09-07T16-13-09Z to match docker-compose so dev/prod parity is preserved. - infra/ansible/roles/minio_distributed/ : install pinned binary, systemd unit pointed at MINIO_VOLUMES with bracket-expansion form, EC:2 forced via MINIO_STORAGE_CLASS_STANDARD. Vault assertion blocks shipping placeholder credentials to staging/prod. - bucket init : creates veza-prod-tracks, enables versioning, applies lifecycle.json (30d noncurrent expiry + 7d abort-multipart). Cold-tier transition ready but inert until minio_remote_tier_name is set. - infra/ansible/playbooks/minio_distributed.yml : provisions the 4 containers, applies common baseline + role. - infra/ansible/inventory/lab.yml : new minio_nodes group. - infra/ansible/tests/test_minio_resilience.sh : kill 2 nodes, verify EC:2 reconstruction (read OK + checksum matches), restart, wait for self-heal. - scripts/minio-migrate-from-single.sh : mc mirror --preserve from the single-node bucket to the new cluster, count-verifies, prints rollout next-steps. - config/prometheus/alert_rules.yml : MinIODriveOffline (warn) + MinIONodesUnreachable (page) — page fires at >= 2 nodes unreachable because that's the redundancy ceiling for EC:2. - docs/ENV_VARIABLES.md §12 : MinIO migration cross-ref. Acceptance (Day 12) : EC:2 survives 2 concurrent kills + self-heals. Lab apply pending. No backend code change — interface stays AWS S3. W3 progress : Redis Sentinel ✓ (Day 11), MinIO distribué ✓ (this), CDN ⏳ Day 13, DMCA ⏳ Day 14, embed ⏳ Day 15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-04-28 11:46:42 +00:00			# `minio_distributed` role — distributed MinIO with EC:2

			`Four Incus containers, each running one MinIO server. Single erasure set of 4 drives = 2 data + 2 parity. The cluster tolerates 2 simultaneous node failures without data loss; storage efficiency is 50% (1 GB raw → 500 MB usable).`

			`## Topology`

			```
			`S3 API on :9000`
			`│`
			`┌───────────────┼───────────────┐`
			`│ │ │`
			`┌────▼────┐ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐`
			`│ minio-1 │ │ minio-2 │ │ minio-3 │ │ minio-4 │`
			`│ /data │ │ /data │ │ /data │ │ /data │`
			`└─────────┘ └─────────┘ └─────────┘ └─────────┘`
			`└─── single erasure set, EC:2 ───┘`
			```

			Each node also runs the web console on `:9001`.

			`## Why EC:2 (not 4 or larger)`

			`- Recoverability ceiling. EC:N tolerates N drive losses. With 4 drives, EC:4 is a 4-way mirror — 25% efficiency, lose-3 OK but with no functional gain over EC:2 in the failure modes we care about (concurrent node losses).`
			`- Write amplification. EC:2 writes each object to 4 nodes (2 data + 2 parity). EC:4 would write to all 4 + a copy = 4-way replication. Doubling the wire cost for marginal durability isn't worth it on a 4-node cluster.`
			`- Future-proofing. When we go to 6+ nodes (W3+), the natural upgrade is EC:3 across a 6-drive set, NOT growing EC on the same 4 drives.`

			`## Defaults`

			`\| variable \| default \| meaning \|`
			`\| --------------------------------------- \| ---------------------------------- \| ---------------------------------------------------- \|`
			\| `minio_version` \| `RELEASE.2025-09-07T16-13-09Z` \| matches docker-compose.yml — keep them locked together \|
			\| `minio_port` \| `9000` \| S3 API \|
			\| `minio_console_port` \| `9001` \| web console \|
			\| `minio_data_path` \| `/var/lib/minio` \| drive root on each node \|
			\| `minio_storage_class_standard` \| `EC:2` \| parity count for STANDARD storage class \|
			\| `minio_bucket_tracks` \| `veza-prod-tracks` \| prod bucket created on first apply \|
			\| `minio_noncurrent_version_expiry_days` \| `30` \| delete old object versions after N days \|
			\| `minio_cold_tier_after_days` \| `90` \| only effective if `minio_remote_tier_name` is set \|
			\| `minio_remote_tier_name` \| `""` (none) \| future remote tier (Glacier / B2). v1.1 territory. \|
			\| `minio_root_user` / `minio_root_password`\| (vault) \| root credentials \|

			`## Vault setup`

			```yaml
			# group_vars/minio_ha.vault.yml — encrypt with `ansible-vault encrypt`
			`minio_root_user: "<random 32-char access key>"`
			`minio_root_password: "<random 32-char secret>"`
			```

			The role asserts the placeholder values are gone before applying to anything other than `lab`.

			`## Backend integration`

			No code change. The backend's `internal/services/storage/s3*` already speaks the AWS SDK v2 ; pointing it at the new cluster is a config flip :

			```env
			`AWS_S3_ENABLED=true`
			`AWS_S3_BUCKET=veza-prod-tracks`
			`AWS_S3_ENDPOINT=http://minio-1.lxd:9000 # or behind HAProxy`
			`AWS_S3_REGION=us-east-1 # MinIO default region`
			`AWS_ACCESS_KEY_ID=<minio_root_user>`
			`AWS_SECRET_ACCESS_KEY=<minio_root_password>`
			```

			For prod, front the 4 nodes with HAProxy (round-robin, health-checked) so the backend sees a single endpoint and tolerates any 1-node loss without DNS edits. HAProxy config lives in `infra/haproxy/` (W4 day 19 ties this in).

			`## Migration from single-node`

			```bash
			`# On the old single-node host (or via mc on a workstation) :`
			`mc alias set veza-current http://veza.fr:19000 <ACCESS> <SECRET>`
			`mc alias set veza-distributed http://minio-1.lxd:9000 <NEW_ACCESS> <NEW_SECRET>`

			`# Mirror : preserves versioning, ACLs, content-types.`
			`mc mirror --preserve veza-current/veza-files veza-distributed/veza-prod-tracks`

			`# Verify count + bytes match before flipping the AWS_S3_ENDPOINT in`
			`# backend env :`
			`mc ls --recursive veza-current/veza-files \| wc -l`
			`mc ls --recursive veza-distributed/veza-prod-tracks \| wc -l`
			```

			The old bucket can be kept hot for ~ 1 week after the flip in case a rollback is needed, then `mc rm --recursive --force --dangerous` drops it.

			`## Operations`

			```bash
			`# Cluster health (admin info = info about each drive) :`
			`mc admin info veza-distributed`

			`# Per-node verbose state :`
			`ssh minio-1 sudo journalctl -u minio -n 100 --no-pager`

			`# Watch heal progress (after a node was offline / drive replaced) :`
			`mc admin heal veza-distributed --recursive`

			`# Check lifecycle policy :`
			`mc ilm ls veza-distributed/veza-prod-tracks`

			`# Console UI (per-node — pick any) :`
			`open http://minio-1.lxd:9001`
			```

			`## Failover smoke test`

			```bash
			`MINIO_ROOT_USER=... MINIO_ROOT_PASSWORD=... \`
			`bash infra/ansible/tests/test_minio_resilience.sh`
			```

			`Sequence : upload 100 MB random file, kill 2 nodes, assert read still works, restart nodes, wait for self-heal, assert all 4 nodes report healthy.`

			`## What this role does NOT cover`

			`- Cross-DC replication. Single-host (lab) or single-region in v1.0. v1.1+ adds bucket replication to a remote cluster.`
			`- Site replication / federation. Multi-tenant federation is out of scope.`
			- Cold tier transitions. `minio_remote_tier_name` is empty by default — no Glacier / B2 / second-cluster behind the lifecycle yet. Wire when needed.
			- mTLS. `--tls-cert/key` is W4. The Incus bridge is the security boundary today.