senke/veza

Fork 0

senke c78bf1b765

Veza CI / Rust (Stream Server) (push) Successful in 5m4s

Details

Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s

Details

Veza CI / Backend (Go) (push) Failing after 15m45s

Details

Veza CI / Frontend (Web) (push) Successful in 18m7s

Details

Veza CI / Notify on failure (push) Successful in 6s

Details

E2E Playwright / e2e (full) (push) Successful in 24m9s

Details

feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)

Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
  * SLO_API_AVAILABILITY  : 99.5% on read (GET) endpoints
  * SLO_API_LATENCY       : 99% writes p95 < 500ms
  * SLO_PAYMENT_SUCCESS   : 99.5% on POST /api/v1/orders -> 2xx

Each SLO has two alerts :
  * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
  * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)

- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
  check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
  + PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
  + db-failover, redis-down, disk-full, cert-expiring-soon : one stub
  per likely page. Each lists first moves under 5min + common causes.

Acceptance (Day 10) : promtool check rules vert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 01:30:34 +02:00

3.6 KiB

Raw Blame History

Runbook — Redis unavailable

Alert : RedisUnreachable (existing, in alert_rules.yml). Owner : infra on-call.

What breaks when Redis is down

Veza uses Redis for several distinct concerns ; the impact differs by callsite.

Subsystem	Effect when Redis is gone	Severity
Session storage / refresh tokens	Login/refresh fail — users log out on next request	HIGH — most users notice within minutes
Rate limiter (`UserRateLimiter`)	Fails-open — requests stop being rate-limited	MEDIUM — capacity risk if Redis stays down for hours
JWT revocation	Revoked tokens accepted again	SECURITY — silent failure, no user-visible signal
Cache (track lookups, feed pages)	Slow but works — falls back to Postgres	LOW — surfaces as elevated p95
Queue (RabbitMQ-fronted jobs)	Independent — RabbitMQ is the queue, Redis is just metrics	NONE

First moves

Confirm Redis is actually down, not "just unreachable from one host" :
```
redis-cli -h redis.lxd ping
```
If it's a single-host issue, skip ahead to "Backend can't reach Redis" below.

Redis instance is down

# Check the systemd state on whichever host owns Redis :
sudo systemctl status redis

# If "failed", inspect logs :
sudo journalctl -u redis -n 200 --no-pager

# Disk full ? Dump dir is /var/lib/redis :
df -h /var/lib/redis

Common causes :

OOM-killed by RDB snapshot. maxmemory reached, no eviction policy, snapshot fork doubled the RSS. Set maxmemory-policy allkeys-lru and bump maxmemory.
Disk full. AOF or RDB filling /var/lib/redis. Truncate AOF (BGREWRITEAOF) or move the dir.
Process crashed. Bring it back up : sudo systemctl restart redis.

Backend can't reach Redis

Network/DNS issue, not a Redis fault. Check :

# From the API container :
nc -zv redis.lxd 6379

# DNS resolution :
getent hosts redis.lxd

Likely culprits : Incus bridge restart, security group change on the API host, stale DNS cache.

Mitigation while Redis is down

The backend's internal/cache/redis_cache.go already has fallback logic for the cache path. The session and rate-limiter paths fail loud. If recovery is going to take > 5 min :

Drain new logins by surfacing a maintenance banner on the frontend : flip MAINTENANCE_MODE=true in the API env and restart. (existing — set in internal/middleware/maintenance.go).
Do NOT drop the rate limiter to "always allow" — temporarily switch it to "always deny" via env (RATELIMIT_FAIL_CLOSED=true) so abuse can't ride the outage.

Recovery

Once Redis is back :

Verify connectivity from each backend instance :

docker exec veza-backend-api redis-cli -u "$REDIS_URL" ping

Existing sessions stay valid — refresh tokens were lost, but access tokens (5 min lifetime) keep working until expiry. Users will be prompted to log in again as their access tokens roll over.
Cache is cold — the next 5-15 min of traffic hits Postgres harder. Monitor "Veza API Overview" → "p95 latency" panel.

Postmortem trigger

Any Redis outage > 10 min triggers a postmortem. The session loss UX is bad; we want to know the time-to-detect and time-to-recover.

Future-proofing

Redis Sentinel HA is W3 day 11 on the launch roadmap. Once that's in, this runbook's "instance is down" section reduces to "the failover happened, verify the new master."

3.6 KiB Raw Blame History

Runbook — Redis unavailable

What breaks when Redis is down

First moves

Redis instance is down

Backend can't reach Redis

Mitigation while Redis is down

Recovery

Postmortem trigger

Future-proofing

3.6 KiB

Raw Blame History