Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m4s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s
Veza CI / Backend (Go) (push) Failing after 15m45s
Veza CI / Frontend (Web) (push) Successful in 18m7s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Successful in 24m9s
Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
* SLO_API_AVAILABILITY : 99.5% on read (GET) endpoints
* SLO_API_LATENCY : 99% writes p95 < 500ms
* SLO_PAYMENT_SUCCESS : 99.5% on POST /api/v1/orders -> 2xx
Each SLO has two alerts :
* <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
* <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)
- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
+ PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
+ db-failover, redis-down, disk-full, cert-expiring-soon : one stub
per likely page. Each lists first moves under 5min + common causes.
Acceptance (Day 10) : promtool check rules vert.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
123 lines
3.8 KiB
Markdown
123 lines
3.8 KiB
Markdown
# Runbook — Disk full / `/` filesystem at threshold
|
|
|
|
> **Alerts** : `DiskSpaceLow` (warning at 85%) · `DiskSpaceCritical` (page at 95%).
|
|
> **Owner** : infra on-call.
|
|
|
|
## Hosts to watch
|
|
|
|
| Host | What fills the disk |
|
|
| --------------------- | -------------------------------------------------------------------- |
|
|
| `pgaf-primary` | WAL + autovacuum bloat. WAL fills if pgBackRest archive falls behind. |
|
|
| `pgaf-replica` | Replication lag → WAL not replayed; same WAL accumulation. |
|
|
| `pgaf-pgbouncer` | Logs in `/var/log/postgresql/pgbouncer.log` if log_disconnections=on. |
|
|
| `tempo` | Trace blocks under `/var/lib/tempo`. Default retention 14d. |
|
|
| `otel-collector` | Almost never — no on-disk state by default. |
|
|
| API/web hosts (k8s) | Container images, log rotation, build caches. |
|
|
| `minio-*` | Object data — lifecycle policy supposed to manage this. |
|
|
|
|
## First moves (under 2 minutes)
|
|
|
|
```bash
|
|
df -h
|
|
# Identify the mount that's tight, then :
|
|
sudo du -h --max-depth=2 -x /var/lib | sort -hr | head -20
|
|
sudo du -h --max-depth=2 -x /var/log | sort -hr | head -20
|
|
```
|
|
|
|
## Postgres data nodes (`pgaf-*`)
|
|
|
|
### A. WAL piling up
|
|
|
|
If `/var/lib/postgresql/16/main/pg_wal` is the offender :
|
|
|
|
```bash
|
|
# Is pgBackRest shipping ?
|
|
sudo -u postgres pgbackrest --stanza=veza info | tail -20
|
|
|
|
# Last WAL push time should be < 1 minute ago.
|
|
```
|
|
|
|
If pgBackRest is stuck (S3 unreachable, credentials rotated) :
|
|
|
|
1. **Don't** force `pg_resetwal` — that's data loss.
|
|
2. Fix the upstream (network, credentials), then push pending WAL :
|
|
```bash
|
|
sudo -u postgres pgbackrest --stanza=veza archive-push <wal_file>
|
|
```
|
|
|
|
### B. Autovacuum bloat
|
|
|
|
```bash
|
|
sudo -u postgres psql -c "
|
|
SELECT relname, n_live_tup, n_dead_tup,
|
|
pg_size_pretty(pg_total_relation_size(relid)) AS size
|
|
FROM pg_stat_user_tables
|
|
ORDER BY n_dead_tup DESC LIMIT 10;
|
|
"
|
|
```
|
|
|
|
Manual vacuum on the worst offender :
|
|
|
|
```bash
|
|
sudo -u postgres psql -c "VACUUM (VERBOSE, ANALYZE) <table>;"
|
|
# Or VACUUM FULL if you have the downtime — it rewrites the table.
|
|
```
|
|
|
|
## Tempo host
|
|
|
|
Trace blocks default to 14d retention. If the host is full anyway, the lifecycle compactor isn't keeping up :
|
|
|
|
```bash
|
|
sudo systemctl status tempo
|
|
sudo journalctl -u tempo -n 200 --no-pager | grep -i compact
|
|
```
|
|
|
|
Emergency recovery — drop oldest blocks manually :
|
|
|
|
```bash
|
|
sudo -u tempo find /var/lib/tempo/blocks -mindepth 1 -maxdepth 1 -type d -mtime +14 -exec rm -rf {} +
|
|
```
|
|
|
|
(This is safe because the blocks are write-once, append-only ; the index in `wal/` is rebuilt at restart.)
|
|
|
|
## API/web hosts (Kubernetes)
|
|
|
|
```bash
|
|
# Images :
|
|
kubectl describe node <node> | grep -A 5 "Allocated resources"
|
|
|
|
# Container logs (rotation should be handling this — check):
|
|
sudo du -h --max-depth=1 /var/log/pods | sort -hr | head -10
|
|
|
|
# If a single pod is logging GB/min, that's a regression. Restart it
|
|
# and grep its previous logs for the loop signature.
|
|
```
|
|
|
|
## MinIO
|
|
|
|
If the storage bucket is full :
|
|
|
|
```bash
|
|
mc admin info veza-minio
|
|
mc du veza-minio/ --depth=2
|
|
```
|
|
|
|
Check the lifecycle policy is applied :
|
|
|
|
```bash
|
|
mc ilm rule list veza-minio/veza-tracks
|
|
```
|
|
|
|
## Recovery verification
|
|
|
|
Once free space is back :
|
|
|
|
- Postgres : confirm `pg_wal` size is bounded (should be < `wal_keep_size` + ~ 64MB).
|
|
- Tempo : `df -h /var/lib/tempo` is below 70%.
|
|
- The disk-space alert clears within one Prometheus scrape interval (~ 30s).
|
|
|
|
## Long-term prevention
|
|
|
|
- pgBackRest archive lag → fix the alert (currently only `BackupRestoreDrillStale`, doesn't catch this) ; W3 backlog.
|
|
- Tempo retention spilling → migrate Tempo to S3-backed (`tempo_storage_backend: s3`). W3 day 12 covers this.
|
|
- API log volume → tighten log levels in prod (`LOG_LEVEL=INFO`).
|