veza/docs/runbooks/redis-down.md
senke c78bf1b765
Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m4s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s
Veza CI / Backend (Go) (push) Failing after 15m45s
Veza CI / Frontend (Web) (push) Successful in 18m7s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Successful in 24m9s
feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)
Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
  * SLO_API_AVAILABILITY  : 99.5% on read (GET) endpoints
  * SLO_API_LATENCY       : 99% writes p95 < 500ms
  * SLO_PAYMENT_SUCCESS   : 99.5% on POST /api/v1/orders -> 2xx

Each SLO has two alerts :
  * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
  * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)

- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
  check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
  + PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
  + db-failover, redis-down, disk-full, cert-expiring-soon : one stub
  per likely page. Each lists first moves under 5min + common causes.

Acceptance (Day 10) : promtool check rules vert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 01:30:34 +02:00

3.6 KiB

Runbook — Redis unavailable

Alert : RedisUnreachable (existing, in alert_rules.yml). Owner : infra on-call.

What breaks when Redis is down

Veza uses Redis for several distinct concerns ; the impact differs by callsite.

Subsystem Effect when Redis is gone Severity
Session storage / refresh tokens Login/refresh fail — users log out on next request HIGH — most users notice within minutes
Rate limiter (UserRateLimiter) Fails-open — requests stop being rate-limited MEDIUM — capacity risk if Redis stays down for hours
JWT revocation Revoked tokens accepted again SECURITY — silent failure, no user-visible signal
Cache (track lookups, feed pages) Slow but works — falls back to Postgres LOW — surfaces as elevated p95
Queue (RabbitMQ-fronted jobs) Independent — RabbitMQ is the queue, Redis is just metrics NONE

First moves

  1. Confirm Redis is actually down, not "just unreachable from one host" :
    redis-cli -h redis.lxd ping
    
  2. If it's a single-host issue, skip ahead to "Backend can't reach Redis" below.

Redis instance is down

# Check the systemd state on whichever host owns Redis :
sudo systemctl status redis

# If "failed", inspect logs :
sudo journalctl -u redis -n 200 --no-pager

# Disk full ? Dump dir is /var/lib/redis :
df -h /var/lib/redis

Common causes :

  • OOM-killed by RDB snapshot. maxmemory reached, no eviction policy, snapshot fork doubled the RSS. Set maxmemory-policy allkeys-lru and bump maxmemory.
  • Disk full. AOF or RDB filling /var/lib/redis. Truncate AOF (BGREWRITEAOF) or move the dir.
  • Process crashed. Bring it back up : sudo systemctl restart redis.

Backend can't reach Redis

Network/DNS issue, not a Redis fault. Check :

# From the API container :
nc -zv redis.lxd 6379

# DNS resolution :
getent hosts redis.lxd

Likely culprits : Incus bridge restart, security group change on the API host, stale DNS cache.

Mitigation while Redis is down

The backend's internal/cache/redis_cache.go already has fallback logic for the cache path. The session and rate-limiter paths fail loud. If recovery is going to take > 5 min :

  1. Drain new logins by surfacing a maintenance banner on the frontend : flip MAINTENANCE_MODE=true in the API env and restart. (existing — set in internal/middleware/maintenance.go).
  2. Do NOT drop the rate limiter to "always allow" — temporarily switch it to "always deny" via env (RATELIMIT_FAIL_CLOSED=true) so abuse can't ride the outage.

Recovery

Once Redis is back :

  1. Verify connectivity from each backend instance :
    docker exec veza-backend-api redis-cli -u "$REDIS_URL" ping
    
  2. Existing sessions stay valid — refresh tokens were lost, but access tokens (5 min lifetime) keep working until expiry. Users will be prompted to log in again as their access tokens roll over.
  3. Cache is cold — the next 5-15 min of traffic hits Postgres harder. Monitor "Veza API Overview" → "p95 latency" panel.

Postmortem trigger

Any Redis outage > 10 min triggers a postmortem. The session loss UX is bad; we want to know the time-to-detect and time-to-recover.

Future-proofing

Redis Sentinel HA is W3 day 11 on the launch roadmap. Once that's in, this runbook's "instance is down" section reduces to "the failover happened, verify the new master."