veza/docs/runbooks/redis-down.md

# Runbook — Redis unavailable

> **Alert** : `RedisUnreachable` (existing, in `alert_rules.yml`).
> **Owner** : infra on-call.

## What breaks when Redis is down

Veza uses Redis for several distinct concerns ; the impact differs by callsite.

| Subsystem                         | Effect when Redis is gone                           | Severity |
| --------------------------------- | --------------------------------------------------- | -------- |
| Session storage / refresh tokens  | Login/refresh fail — users log out on next request | **HIGH** — most users notice within minutes |
| Rate limiter (`UserRateLimiter`)  | Fails-open — requests stop being rate-limited      | MEDIUM — capacity risk if Redis stays down for hours |
| JWT revocation                    | Revoked tokens accepted again                      | **SECURITY** — silent failure, no user-visible signal |
| Cache (track lookups, feed pages) | Slow but works — falls back to Postgres            | LOW — surfaces as elevated p95 |
| Queue (RabbitMQ-fronted jobs)     | Independent — RabbitMQ is the queue, Redis is just metrics | NONE |

## First moves

1. **Confirm Redis is actually down**, not "just unreachable from one host" :
   ```bash
   redis-cli -h redis.lxd ping
   ```
2. If it's a single-host issue, skip ahead to "Backend can't reach Redis" below.

## Redis instance is down

```bash
# Check the systemd state on whichever host owns Redis :
sudo systemctl status redis

# If "failed", inspect logs :
sudo journalctl -u redis -n 200 --no-pager

# Disk full ? Dump dir is /var/lib/redis :
df -h /var/lib/redis
```

Common causes :

- **OOM-killed by RDB snapshot.** `maxmemory` reached, no eviction policy, snapshot fork doubled the RSS. Set `maxmemory-policy allkeys-lru` and bump `maxmemory`.
- **Disk full.** AOF or RDB filling `/var/lib/redis`. Truncate AOF (`BGREWRITEAOF`) or move the dir.
- **Process crashed.** Bring it back up : `sudo systemctl restart redis`.

## Backend can't reach Redis

Network/DNS issue, not a Redis fault. Check :

```bash
# From the API container :
nc -zv redis.lxd 6379

# DNS resolution :
getent hosts redis.lxd
```

Likely culprits : Incus bridge restart, security group change on the API host, stale DNS cache.

## Mitigation while Redis is down

The backend's `internal/cache/redis_cache.go` already has fallback logic for the cache path. The session and rate-limiter paths fail loud. If recovery is going to take > 5 min :

1. **Drain new logins** by surfacing a maintenance banner on the frontend : flip `MAINTENANCE_MODE=true` in the API env and restart. (existing — set in `internal/middleware/maintenance.go`).
2. **Do NOT drop the rate limiter to "always allow"** — temporarily switch it to "always deny" via env (`RATELIMIT_FAIL_CLOSED=true`) so abuse can't ride the outage.

## Recovery

Once Redis is back :

1. Verify connectivity from each backend instance :
   ```bash
   docker exec veza-backend-api redis-cli -u "$REDIS_URL" ping
   ```
2. Existing sessions stay valid — refresh tokens were lost, but access tokens (5 min lifetime) keep working until expiry. Users will be prompted to log in again as their access tokens roll over.
3. Cache is cold — the next 5-15 min of traffic hits Postgres harder. Monitor "Veza API Overview" → "p95 latency" panel.

## Postmortem trigger

Any Redis outage > 10 min triggers a postmortem. The session loss UX is bad; we want to know the time-to-detect and time-to-recover.

## Future-proofing

Redis Sentinel HA is **W3 day 11** on the launch roadmap. Once that's in, this runbook's "instance is down" section reduces to "the failover happened, verify the new master."