veza/docs/runbooks/redis-down.md
senke c78bf1b765
Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m4s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s
Veza CI / Backend (Go) (push) Failing after 15m45s
Veza CI / Frontend (Web) (push) Successful in 18m7s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Successful in 24m9s
feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)
Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
  * SLO_API_AVAILABILITY  : 99.5% on read (GET) endpoints
  * SLO_API_LATENCY       : 99% writes p95 < 500ms
  * SLO_PAYMENT_SUCCESS   : 99.5% on POST /api/v1/orders -> 2xx

Each SLO has two alerts :
  * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
  * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)

- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
  check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
  + PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
  + db-failover, redis-down, disk-full, cert-expiring-soon : one stub
  per likely page. Each lists first moves under 5min + common causes.

Acceptance (Day 10) : promtool check rules vert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 01:30:34 +02:00

83 lines
3.6 KiB
Markdown

# Runbook — Redis unavailable
> **Alert** : `RedisUnreachable` (existing, in `alert_rules.yml`).
> **Owner** : infra on-call.
## What breaks when Redis is down
Veza uses Redis for several distinct concerns ; the impact differs by callsite.
| Subsystem | Effect when Redis is gone | Severity |
| --------------------------------- | --------------------------------------------------- | -------- |
| Session storage / refresh tokens | Login/refresh fail — users log out on next request | **HIGH** — most users notice within minutes |
| Rate limiter (`UserRateLimiter`) | Fails-open — requests stop being rate-limited | MEDIUM — capacity risk if Redis stays down for hours |
| JWT revocation | Revoked tokens accepted again | **SECURITY** — silent failure, no user-visible signal |
| Cache (track lookups, feed pages) | Slow but works — falls back to Postgres | LOW — surfaces as elevated p95 |
| Queue (RabbitMQ-fronted jobs) | Independent — RabbitMQ is the queue, Redis is just metrics | NONE |
## First moves
1. **Confirm Redis is actually down**, not "just unreachable from one host" :
```bash
redis-cli -h redis.lxd ping
```
2. If it's a single-host issue, skip ahead to "Backend can't reach Redis" below.
## Redis instance is down
```bash
# Check the systemd state on whichever host owns Redis :
sudo systemctl status redis
# If "failed", inspect logs :
sudo journalctl -u redis -n 200 --no-pager
# Disk full ? Dump dir is /var/lib/redis :
df -h /var/lib/redis
```
Common causes :
- **OOM-killed by RDB snapshot.** `maxmemory` reached, no eviction policy, snapshot fork doubled the RSS. Set `maxmemory-policy allkeys-lru` and bump `maxmemory`.
- **Disk full.** AOF or RDB filling `/var/lib/redis`. Truncate AOF (`BGREWRITEAOF`) or move the dir.
- **Process crashed.** Bring it back up : `sudo systemctl restart redis`.
## Backend can't reach Redis
Network/DNS issue, not a Redis fault. Check :
```bash
# From the API container :
nc -zv redis.lxd 6379
# DNS resolution :
getent hosts redis.lxd
```
Likely culprits : Incus bridge restart, security group change on the API host, stale DNS cache.
## Mitigation while Redis is down
The backend's `internal/cache/redis_cache.go` already has fallback logic for the cache path. The session and rate-limiter paths fail loud. If recovery is going to take > 5 min :
1. **Drain new logins** by surfacing a maintenance banner on the frontend : flip `MAINTENANCE_MODE=true` in the API env and restart. (existing — set in `internal/middleware/maintenance.go`).
2. **Do NOT drop the rate limiter to "always allow"** — temporarily switch it to "always deny" via env (`RATELIMIT_FAIL_CLOSED=true`) so abuse can't ride the outage.
## Recovery
Once Redis is back :
1. Verify connectivity from each backend instance :
```bash
docker exec veza-backend-api redis-cli -u "$REDIS_URL" ping
```
2. Existing sessions stay valid — refresh tokens were lost, but access tokens (5 min lifetime) keep working until expiry. Users will be prompted to log in again as their access tokens roll over.
3. Cache is cold — the next 5-15 min of traffic hits Postgres harder. Monitor "Veza API Overview" → "p95 latency" panel.
## Postmortem trigger
Any Redis outage > 10 min triggers a postmortem. The session loss UX is bad; we want to know the time-to-detect and time-to-recover.
## Future-proofing
Redis Sentinel HA is **W3 day 11** on the launch roadmap. Once that's in, this runbook's "instance is down" section reduces to "the failover happened, verify the new master."