veza/infra/ansible/roles/redis_sentinel/defaults/main.yml

# redis_sentinel defaults — Redis 7 + Sentinel co-located across 3
# Incus containers (redis-1 master at first boot, redis-2/redis-3
# replicas; one Sentinel per container = quorum 2 out of 3).
---
redis_version: "7"  # apt provides 7.x on Ubuntu 22.04
redis_master_name: "veza-master"
redis_port: 6379
redis_sentinel_port: 26379

# Replication / persistence — sane prod defaults. AOF on for durability,
# RDB snapshot still kept for fast restore.
redis_aof_enabled: true
redis_save_config: "3600 1 300 100 60 10000"

# Sentinel quorum — number of sentinels that must agree before declaring
# the master down. With 3 sentinels, quorum=2 tolerates one sentinel
# crash. Don't lower below 2 in prod, ever.
redis_sentinel_quorum: 2

# Failover thresholds — match Day 11 acceptance criterion (< 30s).
# down-after-milliseconds: how long a master must be unreachable before
#   a sentinel marks it as subjectively down.
# failover-timeout: max time to wait for replica promotion + reconfig
#   before another failover can be triggered.
redis_sentinel_down_after_ms: 5000     # 5s = sentinel quorum decision in ~6-7s
redis_sentinel_failover_timeout_ms: 30000  # 30s budget for the whole flip

# Auth — required in prod (the Sentinel API can re-route traffic, so
# unauth'd Sentinel = security hole). Override via Vault.
redis_password: "CHANGE_ME_VAULT"
redis_sentinel_password: "CHANGE_ME_VAULT_SENTINEL"

# bind / protected-mode — bind to the Incus bridge IP only (10.0.x.y).
# protected-mode is OFF because we set bind explicitly + auth is on.
redis_bind: "0.0.0.0"
redis_protected_mode: "no"

# Resource caps — overall memory limit + eviction policy. The eviction
# policy `allkeys-lru` is intentionally non-zero-data-loss : presence
# keys, sessions, rate-limit counters are all OK to evict under
# pressure. If we add cache lines that MUST persist we'll need a second
# DB with `noeviction`.
redis_maxmemory: "1gb"
redis_maxmemory_policy: "allkeys-lru"
feat(redis): Sentinel HA + cache hit rate metrics (W3 Day 11) Three Incus containers, each running redis-server + redis-sentinel (co-located). redis-1 = master at first boot, redis-2/3 = replicas. Sentinel quorum=2 of 3 ; failover-timeout=30s satisfies the W3 acceptance criterion. - internal/config/redis_init.go : initRedis branches on REDIS_SENTINEL_ADDRS ; non-empty -> redis.NewFailoverClient with MasterName + SentinelAddrs + SentinelPassword. Empty -> existing single-instance NewClient (dev/local stays parametric). - internal/config/config.go : 3 new fields (RedisSentinelAddrs, RedisSentinelMasterName, RedisSentinelPassword) read from env. parseRedisSentinelAddrs trims+filters CSV. - internal/metrics/cache_hit_rate.go : new RecordCacheHit / Miss counters, labelled by subsystem. Cardinality bounded. - internal/middleware/rate_limiter.go : instrument 3 Eval call sites (DDoS, frontend log throttle, upload throttle). Hit = Redis answered, Miss = error -> in-memory fallback. - internal/services/chat_pubsub.go : instrument Publish + PublishPresence. - internal/websocket/chat/presence_service.go : instrument SetOnline / SetOffline / Heartbeat / GetPresence. redis.Nil counts as a hit (legitimate empty result). - infra/ansible/roles/redis_sentinel/ : install Redis 7 + Sentinel, render redis.conf + sentinel.conf, systemd units. Vault assertion prevents shipping placeholder passwords to staging/prod. - infra/ansible/playbooks/redis_sentinel.yml : provisions the 3 containers + applies common baseline + role. - infra/ansible/inventory/lab.yml : new groups redis_ha + redis_ha_master. - infra/ansible/tests/test_redis_failover.sh : kills the master container, polls Sentinel for the new master, asserts elapsed < 30s. - config/grafana/dashboards/redis-cache-overview.json : 3 hit-rate stats (rate_limiter / chat_pubsub / presence) + ops/s breakdown. - docs/ENV_VARIABLES.md §3 : 3 new REDIS_SENTINEL_* env vars. - veza-backend-api/.env.template : 3 placeholders (empty default). Acceptance (Day 11) : Sentinel failover < 30s ; cache hit-rate dashboard populated. Lab test pending Sentinel deployment. W3 verification gate progress : Redis Sentinel ✓ (this commit), MinIO EC4+2 ⏳ Day 12, CDN ⏳ Day 13, DMCA ⏳ Day 14, embed ⏳ Day 15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-04-28 11:36:55 +00:00			`# redis_sentinel defaults — Redis 7 + Sentinel co-located across 3`
			`# Incus containers (redis-1 master at first boot, redis-2/redis-3`
			`# replicas; one Sentinel per container = quorum 2 out of 3).`
			`---`
			`redis_version: "7" # apt provides 7.x on Ubuntu 22.04`
			`redis_master_name: "veza-master"`
			`redis_port: 6379`
			`redis_sentinel_port: 26379`

			`# Replication / persistence — sane prod defaults. AOF on for durability,`
			`# RDB snapshot still kept for fast restore.`
			`redis_aof_enabled: true`
			`redis_save_config: "3600 1 300 100 60 10000"`

			`# Sentinel quorum — number of sentinels that must agree before declaring`
			`# the master down. With 3 sentinels, quorum=2 tolerates one sentinel`
			`# crash. Don't lower below 2 in prod, ever.`
			`redis_sentinel_quorum: 2`

			`# Failover thresholds — match Day 11 acceptance criterion (< 30s).`
			`# down-after-milliseconds: how long a master must be unreachable before`
			`# a sentinel marks it as subjectively down.`
			`# failover-timeout: max time to wait for replica promotion + reconfig`
			`# before another failover can be triggered.`
			`redis_sentinel_down_after_ms: 5000 # 5s = sentinel quorum decision in ~6-7s`
			`redis_sentinel_failover_timeout_ms: 30000 # 30s budget for the whole flip`

			`# Auth — required in prod (the Sentinel API can re-route traffic, so`
			`# unauth'd Sentinel = security hole). Override via Vault.`
			`redis_password: "CHANGE_ME_VAULT"`
			`redis_sentinel_password: "CHANGE_ME_VAULT_SENTINEL"`

			`# bind / protected-mode — bind to the Incus bridge IP only (10.0.x.y).`
			`# protected-mode is OFF because we set bind explicitly + auth is on.`
			`redis_bind: "0.0.0.0"`
			`redis_protected_mode: "no"`

			`# Resource caps — overall memory limit + eviction policy. The eviction`
			# policy `allkeys-lru` is intentionally non-zero-data-loss : presence
			`# keys, sessions, rate-limit counters are all OK to evict under`
			`# pressure. If we add cache lines that MUST persist we'll need a second`
			# DB with `noeviction`.
			`redis_maxmemory: "1gb"`
			`redis_maxmemory_policy: "allkeys-lru"`