veza/docs/runbooks/disk-full.md
senke c78bf1b765
Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m4s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s
Veza CI / Backend (Go) (push) Failing after 15m45s
Veza CI / Frontend (Web) (push) Successful in 18m7s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Successful in 24m9s
feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)
Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
  * SLO_API_AVAILABILITY  : 99.5% on read (GET) endpoints
  * SLO_API_LATENCY       : 99% writes p95 < 500ms
  * SLO_PAYMENT_SUCCESS   : 99.5% on POST /api/v1/orders -> 2xx

Each SLO has two alerts :
  * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
  * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)

- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
  check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
  + PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
  + db-failover, redis-down, disk-full, cert-expiring-soon : one stub
  per likely page. Each lists first moves under 5min + common causes.

Acceptance (Day 10) : promtool check rules vert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 01:30:34 +02:00

3.8 KiB

Runbook — Disk full / / filesystem at threshold

Alerts : DiskSpaceLow (warning at 85%) · DiskSpaceCritical (page at 95%). Owner : infra on-call.

Hosts to watch

Host What fills the disk
pgaf-primary WAL + autovacuum bloat. WAL fills if pgBackRest archive falls behind.
pgaf-replica Replication lag → WAL not replayed; same WAL accumulation.
pgaf-pgbouncer Logs in /var/log/postgresql/pgbouncer.log if log_disconnections=on.
tempo Trace blocks under /var/lib/tempo. Default retention 14d.
otel-collector Almost never — no on-disk state by default.
API/web hosts (k8s) Container images, log rotation, build caches.
minio-* Object data — lifecycle policy supposed to manage this.

First moves (under 2 minutes)

df -h
# Identify the mount that's tight, then :
sudo du -h --max-depth=2 -x /var/lib | sort -hr | head -20
sudo du -h --max-depth=2 -x /var/log | sort -hr | head -20

Postgres data nodes (pgaf-*)

A. WAL piling up

If /var/lib/postgresql/16/main/pg_wal is the offender :

# Is pgBackRest shipping ?
sudo -u postgres pgbackrest --stanza=veza info | tail -20

# Last WAL push time should be < 1 minute ago.

If pgBackRest is stuck (S3 unreachable, credentials rotated) :

  1. Don't force pg_resetwal — that's data loss.
  2. Fix the upstream (network, credentials), then push pending WAL :
    sudo -u postgres pgbackrest --stanza=veza archive-push <wal_file>
    

B. Autovacuum bloat

sudo -u postgres psql -c "
  SELECT relname, n_live_tup, n_dead_tup,
         pg_size_pretty(pg_total_relation_size(relid)) AS size
  FROM pg_stat_user_tables
  ORDER BY n_dead_tup DESC LIMIT 10;
"

Manual vacuum on the worst offender :

sudo -u postgres psql -c "VACUUM (VERBOSE, ANALYZE) <table>;"
# Or VACUUM FULL if you have the downtime — it rewrites the table.

Tempo host

Trace blocks default to 14d retention. If the host is full anyway, the lifecycle compactor isn't keeping up :

sudo systemctl status tempo
sudo journalctl -u tempo -n 200 --no-pager | grep -i compact

Emergency recovery — drop oldest blocks manually :

sudo -u tempo find /var/lib/tempo/blocks -mindepth 1 -maxdepth 1 -type d -mtime +14 -exec rm -rf {} +

(This is safe because the blocks are write-once, append-only ; the index in wal/ is rebuilt at restart.)

API/web hosts (Kubernetes)

# Images :
kubectl describe node <node> | grep -A 5 "Allocated resources"

# Container logs (rotation should be handling this — check):
sudo du -h --max-depth=1 /var/log/pods | sort -hr | head -10

# If a single pod is logging GB/min, that's a regression. Restart it
# and grep its previous logs for the loop signature.

MinIO

If the storage bucket is full :

mc admin info veza-minio
mc du veza-minio/ --depth=2

Check the lifecycle policy is applied :

mc ilm rule list veza-minio/veza-tracks

Recovery verification

Once free space is back :

  • Postgres : confirm pg_wal size is bounded (should be < wal_keep_size + ~ 64MB).
  • Tempo : df -h /var/lib/tempo is below 70%.
  • The disk-space alert clears within one Prometheus scrape interval (~ 30s).

Long-term prevention

  • pgBackRest archive lag → fix the alert (currently only BackupRestoreDrillStale, doesn't catch this) ; W3 backlog.
  • Tempo retention spilling → migrate Tempo to S3-backed (tempo_storage_backend: s3). W3 day 12 covers this.
  • API log volume → tighten log levels in prod (LOG_LEVEL=INFO).