Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m4s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s
Veza CI / Backend (Go) (push) Failing after 15m45s
Veza CI / Frontend (Web) (push) Successful in 18m7s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Successful in 24m9s
Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
* SLO_API_AVAILABILITY : 99.5% on read (GET) endpoints
* SLO_API_LATENCY : 99% writes p95 < 500ms
* SLO_PAYMENT_SUCCESS : 99.5% on POST /api/v1/orders -> 2xx
Each SLO has two alerts :
* <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
* <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)
- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
+ PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
+ db-failover, redis-down, disk-full, cert-expiring-soon : one stub
per likely page. Each lists first moves under 5min + common causes.
Acceptance (Day 10) : promtool check rules vert.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
83 lines
3.6 KiB
Markdown
83 lines
3.6 KiB
Markdown
# Runbook — Redis unavailable
|
|
|
|
> **Alert** : `RedisUnreachable` (existing, in `alert_rules.yml`).
|
|
> **Owner** : infra on-call.
|
|
|
|
## What breaks when Redis is down
|
|
|
|
Veza uses Redis for several distinct concerns ; the impact differs by callsite.
|
|
|
|
| Subsystem | Effect when Redis is gone | Severity |
|
|
| --------------------------------- | --------------------------------------------------- | -------- |
|
|
| Session storage / refresh tokens | Login/refresh fail — users log out on next request | **HIGH** — most users notice within minutes |
|
|
| Rate limiter (`UserRateLimiter`) | Fails-open — requests stop being rate-limited | MEDIUM — capacity risk if Redis stays down for hours |
|
|
| JWT revocation | Revoked tokens accepted again | **SECURITY** — silent failure, no user-visible signal |
|
|
| Cache (track lookups, feed pages) | Slow but works — falls back to Postgres | LOW — surfaces as elevated p95 |
|
|
| Queue (RabbitMQ-fronted jobs) | Independent — RabbitMQ is the queue, Redis is just metrics | NONE |
|
|
|
|
## First moves
|
|
|
|
1. **Confirm Redis is actually down**, not "just unreachable from one host" :
|
|
```bash
|
|
redis-cli -h redis.lxd ping
|
|
```
|
|
2. If it's a single-host issue, skip ahead to "Backend can't reach Redis" below.
|
|
|
|
## Redis instance is down
|
|
|
|
```bash
|
|
# Check the systemd state on whichever host owns Redis :
|
|
sudo systemctl status redis
|
|
|
|
# If "failed", inspect logs :
|
|
sudo journalctl -u redis -n 200 --no-pager
|
|
|
|
# Disk full ? Dump dir is /var/lib/redis :
|
|
df -h /var/lib/redis
|
|
```
|
|
|
|
Common causes :
|
|
|
|
- **OOM-killed by RDB snapshot.** `maxmemory` reached, no eviction policy, snapshot fork doubled the RSS. Set `maxmemory-policy allkeys-lru` and bump `maxmemory`.
|
|
- **Disk full.** AOF or RDB filling `/var/lib/redis`. Truncate AOF (`BGREWRITEAOF`) or move the dir.
|
|
- **Process crashed.** Bring it back up : `sudo systemctl restart redis`.
|
|
|
|
## Backend can't reach Redis
|
|
|
|
Network/DNS issue, not a Redis fault. Check :
|
|
|
|
```bash
|
|
# From the API container :
|
|
nc -zv redis.lxd 6379
|
|
|
|
# DNS resolution :
|
|
getent hosts redis.lxd
|
|
```
|
|
|
|
Likely culprits : Incus bridge restart, security group change on the API host, stale DNS cache.
|
|
|
|
## Mitigation while Redis is down
|
|
|
|
The backend's `internal/cache/redis_cache.go` already has fallback logic for the cache path. The session and rate-limiter paths fail loud. If recovery is going to take > 5 min :
|
|
|
|
1. **Drain new logins** by surfacing a maintenance banner on the frontend : flip `MAINTENANCE_MODE=true` in the API env and restart. (existing — set in `internal/middleware/maintenance.go`).
|
|
2. **Do NOT drop the rate limiter to "always allow"** — temporarily switch it to "always deny" via env (`RATELIMIT_FAIL_CLOSED=true`) so abuse can't ride the outage.
|
|
|
|
## Recovery
|
|
|
|
Once Redis is back :
|
|
|
|
1. Verify connectivity from each backend instance :
|
|
```bash
|
|
docker exec veza-backend-api redis-cli -u "$REDIS_URL" ping
|
|
```
|
|
2. Existing sessions stay valid — refresh tokens were lost, but access tokens (5 min lifetime) keep working until expiry. Users will be prompted to log in again as their access tokens roll over.
|
|
3. Cache is cold — the next 5-15 min of traffic hits Postgres harder. Monitor "Veza API Overview" → "p95 latency" panel.
|
|
|
|
## Postmortem trigger
|
|
|
|
Any Redis outage > 10 min triggers a postmortem. The session loss UX is bad; we want to know the time-to-detect and time-to-recover.
|
|
|
|
## Future-proofing
|
|
|
|
Redis Sentinel HA is **W3 day 11** on the launch roadmap. Once that's in, this runbook's "instance is down" section reduces to "the failover happened, verify the new master."
|