senke/veza

Fork 0

senke a36d9b2d59

Veza CI / Backend (Go) (push) Failing after 8m56s

Details

Veza CI / Frontend (Web) (push) Has been cancelled

Details

E2E Playwright / e2e (full) (push) Has been cancelled

Details

Veza CI / Notify on failure (push) Blocked by required conditions

Details

Veza CI / Rust (Stream Server) (push) Successful in 5m3s

Details

Security Scan / Secret Scanning (gitleaks) (push) Failing after 53s

Details

feat(redis): Sentinel HA + cache hit rate metrics (W3 Day 11)

Three Incus containers, each running redis-server + redis-sentinel
(co-located). redis-1 = master at first boot, redis-2/3 = replicas.
Sentinel quorum=2 of 3 ; failover-timeout=30s satisfies the W3
acceptance criterion.

- internal/config/redis_init.go : initRedis branches on
  REDIS_SENTINEL_ADDRS ; non-empty -> redis.NewFailoverClient with
  MasterName + SentinelAddrs + SentinelPassword. Empty -> existing
  single-instance NewClient (dev/local stays parametric).
- internal/config/config.go : 3 new fields (RedisSentinelAddrs,
  RedisSentinelMasterName, RedisSentinelPassword) read from env.
  parseRedisSentinelAddrs trims+filters CSV.
- internal/metrics/cache_hit_rate.go : new RecordCacheHit / Miss
  counters, labelled by subsystem. Cardinality bounded.
- internal/middleware/rate_limiter.go : instrument 3 Eval call sites
  (DDoS, frontend log throttle, upload throttle). Hit = Redis answered,
  Miss = error -> in-memory fallback.
- internal/services/chat_pubsub.go : instrument Publish + PublishPresence.
- internal/websocket/chat/presence_service.go : instrument SetOnline /
  SetOffline / Heartbeat / GetPresence. redis.Nil counts as a hit
  (legitimate empty result).
- infra/ansible/roles/redis_sentinel/ : install Redis 7 + Sentinel,
  render redis.conf + sentinel.conf, systemd units. Vault assertion
  prevents shipping placeholder passwords to staging/prod.
- infra/ansible/playbooks/redis_sentinel.yml : provisions the 3
  containers + applies common baseline + role.
- infra/ansible/inventory/lab.yml : new groups redis_ha + redis_ha_master.
- infra/ansible/tests/test_redis_failover.sh : kills the master
  container, polls Sentinel for the new master, asserts elapsed < 30s.
- config/grafana/dashboards/redis-cache-overview.json : 3 hit-rate
  stats (rate_limiter / chat_pubsub / presence) + ops/s breakdown.
- docs/ENV_VARIABLES.md §3 : 3 new REDIS_SENTINEL_* env vars.
- veza-backend-api/.env.template : 3 placeholders (empty default).

Acceptance (Day 11) : Sentinel failover < 30s ; cache hit-rate
dashboard populated. Lab test pending Sentinel deployment.

W3 verification gate progress : Redis Sentinel ✓ (this commit),
MinIO EC4+2 ⏳ Day 12, CDN ⏳ Day 13, DMCA ⏳ Day 14, embed ⏳ Day 15.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 13:36:55 +02:00

5.2 KiB

Raw Blame History

`redis_sentinel` role — Redis 7 + Sentinel HA formation

Three Incus containers, one Redis + one Sentinel co-located per container. At first boot redis-1 is master, redis-2 and redis-3 are replicas. The 3 sentinels (quorum 2) handle failover when the master dies — promotion is bounded at 30s by failover-timeout.

Topology

                    ┌─────────────┐
                    │ redis-1     │  master at first boot
                    │ • redis     │
                    │ • sentinel  │
                    └──────┬──────┘
                           │ replication
              ┌────────────┴────────────┐
              ▼                         ▼
       ┌─────────────┐           ┌─────────────┐
       │ redis-2     │           │ redis-3     │
       │ • replica   │           │ • replica   │
       │ • sentinel  │           │ • sentinel  │
       └─────────────┘           └─────────────┘

The 3 sentinels gossip on port 26379 and elect a leader to drive each failover. Quorum = 2, so we tolerate one Sentinel crash without losing failover capability.

Why Sentinel and not Cluster

We don't need sharding at v1.0 — total Redis dataset fits in 1 GB.
Sentinel is dramatically simpler (no slot management, no resharding).
The backend's redis.NewFailoverClient speaks Sentinel natively ; switching to Cluster would mean rewriting every Get/Set/Eval call site.

When Veza traffic forces sharding (probably v2+), we revisit.

Defaults

variable	default	meaning
`redis_master_name`	`veza-master`	Sentinel name. Backend uses this.
`redis_port`	`6379`	Redis port
`redis_sentinel_port`	`26379`	Sentinel port
`redis_sentinel_quorum`	`2`	sentinels that must agree to fail over
`redis_sentinel_down_after_ms`	`5000`	ms before "subjectively down"
`redis_sentinel_failover_timeout_ms`	`30000`	upper bound on a failover
`redis_password`	(vault)	data-plane auth
`redis_sentinel_password`	(vault)	sentinel-to-sentinel auth
`redis_maxmemory`	`1gb`	hard cap
`redis_maxmemory_policy`	`allkeys-lru`	eviction policy

Vault setup

# group_vars/redis_ha.vault.yml — encrypt with `ansible-vault encrypt`
redis_password: "<random 32-char>"
redis_sentinel_password: "<random 32-char, distinct>"

The role asserts the placeholder values are gone before applying to anything other than lab.

Backend integration

The backend reads three new env vars at boot (handled by internal/config/redis_init.go):

REDIS_SENTINEL_ADDRS=redis-1.lxd:26379,redis-2.lxd:26379,redis-3.lxd:26379
REDIS_SENTINEL_MASTER_NAME=veza-master
REDIS_SENTINEL_PASSWORD=<sentinel password>
REDIS_URL=redis://:<password>@dummy:6379/0   # password + DB still parsed off the URL

When REDIS_SENTINEL_ADDRS is empty, the backend falls back to a single-instance client (the dev/local pattern).

Operations

# Identify the current master :
redis-cli -h redis-1.lxd -p 26379 -a "$SENTINEL_PASS" SENTINEL get-master-addr-by-name veza-master

# Force a failover (manual ; for game-day drills) :
redis-cli -h redis-1.lxd -p 26379 -a "$SENTINEL_PASS" SENTINEL failover veza-master

# Check replication state from any node :
redis-cli -h redis-1.lxd -a "$REDIS_PASS" INFO replication

# Tail sentinel logs across all 3 :
for n in redis-1 redis-2 redis-3; do
  echo "=== $n ==="
  ssh "$n" sudo tail -50 /var/log/redis/redis-sentinel.log
done

Failover smoke test

bash infra/ansible/tests/test_redis_failover.sh

Sequence : kills the current master container, polls the sentinels until a new master is elected, asserts elapsed time < 30s, verifies INFO replication on the survivor shows it's now master. Suitable for the W2 verification gate + game-day day 24.

What this role does NOT cover

TLS between client ↔ Redis — tls-port is W4 territory. Today the Incus bridge is the security boundary.
Persistent data backups — RDB snapshots stay on the data node only. Redis state is reconstructible (sessions get re-issued, presence is ephemeral) so this is intentional.
Cluster mode (sharding) — see "Why Sentinel and not Cluster" above. v2+.
Cross-host replication — three containers on the same lab host today. Day 7 of W2 already moved Postgres to dedicated hosts ; the same host-split applies here when Hetzner standby is provisioned (W2 day 7+ note in postgres_ha.yml).

5.2 KiB Raw Blame History

redis_sentinel role — Redis 7 + Sentinel HA formation

Topology

Why Sentinel and not Cluster

Defaults

Vault setup

Backend integration

Operations

Failover smoke test

What this role does NOT cover

5.2 KiB

Raw Blame History

`redis_sentinel` role — Redis 7 + Sentinel HA formation