Some checks failed
Veza CI / Backend (Go) (push) Failing after 8m56s
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Rust (Stream Server) (push) Successful in 5m3s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 53s
Three Incus containers, each running redis-server + redis-sentinel (co-located). redis-1 = master at first boot, redis-2/3 = replicas. Sentinel quorum=2 of 3 ; failover-timeout=30s satisfies the W3 acceptance criterion. - internal/config/redis_init.go : initRedis branches on REDIS_SENTINEL_ADDRS ; non-empty -> redis.NewFailoverClient with MasterName + SentinelAddrs + SentinelPassword. Empty -> existing single-instance NewClient (dev/local stays parametric). - internal/config/config.go : 3 new fields (RedisSentinelAddrs, RedisSentinelMasterName, RedisSentinelPassword) read from env. parseRedisSentinelAddrs trims+filters CSV. - internal/metrics/cache_hit_rate.go : new RecordCacheHit / Miss counters, labelled by subsystem. Cardinality bounded. - internal/middleware/rate_limiter.go : instrument 3 Eval call sites (DDoS, frontend log throttle, upload throttle). Hit = Redis answered, Miss = error -> in-memory fallback. - internal/services/chat_pubsub.go : instrument Publish + PublishPresence. - internal/websocket/chat/presence_service.go : instrument SetOnline / SetOffline / Heartbeat / GetPresence. redis.Nil counts as a hit (legitimate empty result). - infra/ansible/roles/redis_sentinel/ : install Redis 7 + Sentinel, render redis.conf + sentinel.conf, systemd units. Vault assertion prevents shipping placeholder passwords to staging/prod. - infra/ansible/playbooks/redis_sentinel.yml : provisions the 3 containers + applies common baseline + role. - infra/ansible/inventory/lab.yml : new groups redis_ha + redis_ha_master. - infra/ansible/tests/test_redis_failover.sh : kills the master container, polls Sentinel for the new master, asserts elapsed < 30s. - config/grafana/dashboards/redis-cache-overview.json : 3 hit-rate stats (rate_limiter / chat_pubsub / presence) + ops/s breakdown. - docs/ENV_VARIABLES.md §3 : 3 new REDIS_SENTINEL_* env vars. - veza-backend-api/.env.template : 3 placeholders (empty default). Acceptance (Day 11) : Sentinel failover < 30s ; cache hit-rate dashboard populated. Lab test pending Sentinel deployment. W3 verification gate progress : Redis Sentinel ✓ (this commit), MinIO EC4+2 ⏳ Day 12, CDN ⏳ Day 13, DMCA ⏳ Day 14, embed ⏳ Day 15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
53 lines
1.9 KiB
Go
53 lines
1.9 KiB
Go
package metrics
|
|
|
|
// Cache hit/miss counters per subsystem (v1.0.9 W3 Day 11).
|
|
//
|
|
// Three call-sites instrumented in v1.0.9:
|
|
// - rate_limiter — Redis INCR result classified as "hit" if the key
|
|
// already existed in the window (in-window request),
|
|
// "miss" if it was a new window (key just created).
|
|
// - chat_pubsub — "hit" on a successful Publish/Subscribe round-trip,
|
|
// "miss" on connection error (Redis unreachable).
|
|
// - presence — "hit" on a successful Get/Set/Del, "miss" on a key
|
|
// that didn't exist (presence stale or never set) or
|
|
// on an underlying Redis error.
|
|
//
|
|
// Subsystems are passed as labels rather than baked into separate metrics
|
|
// so dashboards can pivot. Cardinality is fixed at the three values above
|
|
// (plus future additions in W3+); never label by user_id / room_id /
|
|
// per-key — that would explode cardinality.
|
|
|
|
import (
|
|
"github.com/prometheus/client_golang/prometheus"
|
|
"github.com/prometheus/client_golang/prometheus/promauto"
|
|
)
|
|
|
|
var (
|
|
cacheHits = promauto.NewCounterVec(
|
|
prometheus.CounterOpts{
|
|
Name: "veza_cache_hits_total",
|
|
Help: "Total cache hits per subsystem",
|
|
},
|
|
[]string{"subsystem"},
|
|
)
|
|
|
|
cacheMisses = promauto.NewCounterVec(
|
|
prometheus.CounterOpts{
|
|
Name: "veza_cache_misses_total",
|
|
Help: "Total cache misses per subsystem",
|
|
},
|
|
[]string{"subsystem"},
|
|
)
|
|
)
|
|
|
|
// RecordCacheHit increments the hit counter for a subsystem. Subsystem
|
|
// must be one of the bounded set documented at file-level — adding a
|
|
// new value is a deliberate choice that should also update Grafana.
|
|
func RecordCacheHit(subsystem string) {
|
|
cacheHits.WithLabelValues(subsystem).Inc()
|
|
}
|
|
|
|
// RecordCacheMiss increments the miss counter for a subsystem.
|
|
func RecordCacheMiss(subsystem string) {
|
|
cacheMisses.WithLabelValues(subsystem).Inc()
|
|
}
|