Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
* SLO_API_AVAILABILITY : 99.5% on read (GET) endpoints
* SLO_API_LATENCY : 99% writes p95 < 500ms
* SLO_PAYMENT_SUCCESS : 99.5% on POST /api/v1/orders -> 2xx
Each SLO has two alerts :
* <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
* <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)
- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
+ PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
+ db-failover, redis-down, disk-full, cert-expiring-soon : one stub
per likely page. Each lists first moves under 5min + common causes.
Acceptance (Day 10) : promtool check rules vert.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.6 KiB
Runbook — Redis unavailable
Alert :
RedisUnreachable(existing, inalert_rules.yml). Owner : infra on-call.
What breaks when Redis is down
Veza uses Redis for several distinct concerns ; the impact differs by callsite.
| Subsystem | Effect when Redis is gone | Severity |
|---|---|---|
| Session storage / refresh tokens | Login/refresh fail — users log out on next request | HIGH — most users notice within minutes |
Rate limiter (UserRateLimiter) |
Fails-open — requests stop being rate-limited | MEDIUM — capacity risk if Redis stays down for hours |
| JWT revocation | Revoked tokens accepted again | SECURITY — silent failure, no user-visible signal |
| Cache (track lookups, feed pages) | Slow but works — falls back to Postgres | LOW — surfaces as elevated p95 |
| Queue (RabbitMQ-fronted jobs) | Independent — RabbitMQ is the queue, Redis is just metrics | NONE |
First moves
- Confirm Redis is actually down, not "just unreachable from one host" :
redis-cli -h redis.lxd ping - If it's a single-host issue, skip ahead to "Backend can't reach Redis" below.
Redis instance is down
# Check the systemd state on whichever host owns Redis :
sudo systemctl status redis
# If "failed", inspect logs :
sudo journalctl -u redis -n 200 --no-pager
# Disk full ? Dump dir is /var/lib/redis :
df -h /var/lib/redis
Common causes :
- OOM-killed by RDB snapshot.
maxmemoryreached, no eviction policy, snapshot fork doubled the RSS. Setmaxmemory-policy allkeys-lruand bumpmaxmemory. - Disk full. AOF or RDB filling
/var/lib/redis. Truncate AOF (BGREWRITEAOF) or move the dir. - Process crashed. Bring it back up :
sudo systemctl restart redis.
Backend can't reach Redis
Network/DNS issue, not a Redis fault. Check :
# From the API container :
nc -zv redis.lxd 6379
# DNS resolution :
getent hosts redis.lxd
Likely culprits : Incus bridge restart, security group change on the API host, stale DNS cache.
Mitigation while Redis is down
The backend's internal/cache/redis_cache.go already has fallback logic for the cache path. The session and rate-limiter paths fail loud. If recovery is going to take > 5 min :
- Drain new logins by surfacing a maintenance banner on the frontend : flip
MAINTENANCE_MODE=truein the API env and restart. (existing — set ininternal/middleware/maintenance.go). - Do NOT drop the rate limiter to "always allow" — temporarily switch it to "always deny" via env (
RATELIMIT_FAIL_CLOSED=true) so abuse can't ride the outage.
Recovery
Once Redis is back :
- Verify connectivity from each backend instance :
docker exec veza-backend-api redis-cli -u "$REDIS_URL" ping - Existing sessions stay valid — refresh tokens were lost, but access tokens (5 min lifetime) keep working until expiry. Users will be prompted to log in again as their access tokens roll over.
- Cache is cold — the next 5-15 min of traffic hits Postgres harder. Monitor "Veza API Overview" → "p95 latency" panel.
Postmortem trigger
Any Redis outage > 10 min triggers a postmortem. The session loss UX is bad; we want to know the time-to-detect and time-to-recover.
Future-proofing
Redis Sentinel HA is W3 day 11 on the launch roadmap. Once that's in, this runbook's "instance is down" section reduces to "the failover happened, verify the new master."