Three Incus containers, each running redis-server + redis-sentinel (co-located). redis-1 = master at first boot, redis-2/3 = replicas. Sentinel quorum=2 of 3 ; failover-timeout=30s satisfies the W3 acceptance criterion. - internal/config/redis_init.go : initRedis branches on REDIS_SENTINEL_ADDRS ; non-empty -> redis.NewFailoverClient with MasterName + SentinelAddrs + SentinelPassword. Empty -> existing single-instance NewClient (dev/local stays parametric). - internal/config/config.go : 3 new fields (RedisSentinelAddrs, RedisSentinelMasterName, RedisSentinelPassword) read from env. parseRedisSentinelAddrs trims+filters CSV. - internal/metrics/cache_hit_rate.go : new RecordCacheHit / Miss counters, labelled by subsystem. Cardinality bounded. - internal/middleware/rate_limiter.go : instrument 3 Eval call sites (DDoS, frontend log throttle, upload throttle). Hit = Redis answered, Miss = error -> in-memory fallback. - internal/services/chat_pubsub.go : instrument Publish + PublishPresence. - internal/websocket/chat/presence_service.go : instrument SetOnline / SetOffline / Heartbeat / GetPresence. redis.Nil counts as a hit (legitimate empty result). - infra/ansible/roles/redis_sentinel/ : install Redis 7 + Sentinel, render redis.conf + sentinel.conf, systemd units. Vault assertion prevents shipping placeholder passwords to staging/prod. - infra/ansible/playbooks/redis_sentinel.yml : provisions the 3 containers + applies common baseline + role. - infra/ansible/inventory/lab.yml : new groups redis_ha + redis_ha_master. - infra/ansible/tests/test_redis_failover.sh : kills the master container, polls Sentinel for the new master, asserts elapsed < 30s. - config/grafana/dashboards/redis-cache-overview.json : 3 hit-rate stats (rate_limiter / chat_pubsub / presence) + ops/s breakdown. - docs/ENV_VARIABLES.md §3 : 3 new REDIS_SENTINEL_* env vars. - veza-backend-api/.env.template : 3 placeholders (empty default). Acceptance (Day 11) : Sentinel failover < 30s ; cache hit-rate dashboard populated. Lab test pending Sentinel deployment. W3 verification gate progress : Redis Sentinel ✓ (this commit), MinIO EC4+2 ⏳ Day 12, CDN ⏳ Day 13, DMCA ⏳ Day 14, embed ⏳ Day 15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.2 KiB
redis_sentinel role — Redis 7 + Sentinel HA formation
Three Incus containers, one Redis + one Sentinel co-located per container. At first boot redis-1 is master, redis-2 and redis-3 are replicas. The 3 sentinels (quorum 2) handle failover when the master dies — promotion is bounded at 30s by failover-timeout.
Topology
┌─────────────┐
│ redis-1 │ master at first boot
│ • redis │
│ • sentinel │
└──────┬──────┘
│ replication
┌────────────┴────────────┐
▼ ▼
┌─────────────┐ ┌─────────────┐
│ redis-2 │ │ redis-3 │
│ • replica │ │ • replica │
│ • sentinel │ │ • sentinel │
└─────────────┘ └─────────────┘
The 3 sentinels gossip on port 26379 and elect a leader to drive each failover. Quorum = 2, so we tolerate one Sentinel crash without losing failover capability.
Why Sentinel and not Cluster
- We don't need sharding at v1.0 — total Redis dataset fits in 1 GB.
- Sentinel is dramatically simpler (no slot management, no resharding).
- The backend's
redis.NewFailoverClientspeaks Sentinel natively ; switching to Cluster would mean rewriting everyGet/Set/Evalcall site.
When Veza traffic forces sharding (probably v2+), we revisit.
Defaults
| variable | default | meaning |
|---|---|---|
redis_master_name |
veza-master |
Sentinel name. Backend uses this. |
redis_port |
6379 |
Redis port |
redis_sentinel_port |
26379 |
Sentinel port |
redis_sentinel_quorum |
2 |
sentinels that must agree to fail over |
redis_sentinel_down_after_ms |
5000 |
ms before "subjectively down" |
redis_sentinel_failover_timeout_ms |
30000 |
upper bound on a failover |
redis_password |
(vault) | data-plane auth |
redis_sentinel_password |
(vault) | sentinel-to-sentinel auth |
redis_maxmemory |
1gb |
hard cap |
redis_maxmemory_policy |
allkeys-lru |
eviction policy |
Vault setup
# group_vars/redis_ha.vault.yml — encrypt with `ansible-vault encrypt`
redis_password: "<random 32-char>"
redis_sentinel_password: "<random 32-char, distinct>"
The role asserts the placeholder values are gone before applying to anything other than lab.
Backend integration
The backend reads three new env vars at boot (handled by
internal/config/redis_init.go):
REDIS_SENTINEL_ADDRS=redis-1.lxd:26379,redis-2.lxd:26379,redis-3.lxd:26379
REDIS_SENTINEL_MASTER_NAME=veza-master
REDIS_SENTINEL_PASSWORD=<sentinel password>
REDIS_URL=redis://:<password>@dummy:6379/0 # password + DB still parsed off the URL
When REDIS_SENTINEL_ADDRS is empty, the backend falls back to a single-instance client (the dev/local pattern).
Operations
# Identify the current master :
redis-cli -h redis-1.lxd -p 26379 -a "$SENTINEL_PASS" SENTINEL get-master-addr-by-name veza-master
# Force a failover (manual ; for game-day drills) :
redis-cli -h redis-1.lxd -p 26379 -a "$SENTINEL_PASS" SENTINEL failover veza-master
# Check replication state from any node :
redis-cli -h redis-1.lxd -a "$REDIS_PASS" INFO replication
# Tail sentinel logs across all 3 :
for n in redis-1 redis-2 redis-3; do
echo "=== $n ==="
ssh "$n" sudo tail -50 /var/log/redis/redis-sentinel.log
done
Failover smoke test
bash infra/ansible/tests/test_redis_failover.sh
Sequence : kills the current master container, polls the sentinels until a new master is elected, asserts elapsed time < 30s, verifies INFO replication on the survivor shows it's now master. Suitable for the W2 verification gate + game-day day 24.
What this role does NOT cover
- TLS between client ↔ Redis —
tls-portis W4 territory. Today the Incus bridge is the security boundary. - Persistent data backups — RDB snapshots stay on the data node only. Redis state is reconstructible (sessions get re-issued, presence is ephemeral) so this is intentional.
- Cluster mode (sharding) — see "Why Sentinel and not Cluster" above. v2+.
- Cross-host replication — three containers on the same lab host today. Day 7 of W2 already moved Postgres to dedicated hosts ; the same host-split applies here when Hetzner standby is provisioned (W2 day 7+ note in
postgres_ha.yml).