# Runbook — Disk full / `/` filesystem at threshold > **Alerts** : `DiskSpaceLow` (warning at 85%) · `DiskSpaceCritical` (page at 95%). > **Owner** : infra on-call. ## Hosts to watch | Host | What fills the disk | | --------------------- | -------------------------------------------------------------------- | | `pgaf-primary` | WAL + autovacuum bloat. WAL fills if pgBackRest archive falls behind. | | `pgaf-replica` | Replication lag → WAL not replayed; same WAL accumulation. | | `pgaf-pgbouncer` | Logs in `/var/log/postgresql/pgbouncer.log` if log_disconnections=on. | | `tempo` | Trace blocks under `/var/lib/tempo`. Default retention 14d. | | `otel-collector` | Almost never — no on-disk state by default. | | API/web hosts (k8s) | Container images, log rotation, build caches. | | `minio-*` | Object data — lifecycle policy supposed to manage this. | ## First moves (under 2 minutes) ```bash df -h # Identify the mount that's tight, then : sudo du -h --max-depth=2 -x /var/lib | sort -hr | head -20 sudo du -h --max-depth=2 -x /var/log | sort -hr | head -20 ``` ## Postgres data nodes (`pgaf-*`) ### A. WAL piling up If `/var/lib/postgresql/16/main/pg_wal` is the offender : ```bash # Is pgBackRest shipping ? sudo -u postgres pgbackrest --stanza=veza info | tail -20 # Last WAL push time should be < 1 minute ago. ``` If pgBackRest is stuck (S3 unreachable, credentials rotated) : 1. **Don't** force `pg_resetwal` — that's data loss. 2. Fix the upstream (network, credentials), then push pending WAL : ```bash sudo -u postgres pgbackrest --stanza=veza archive-push ``` ### B. Autovacuum bloat ```bash sudo -u postgres psql -c " SELECT relname, n_live_tup, n_dead_tup, pg_size_pretty(pg_total_relation_size(relid)) AS size FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10; " ``` Manual vacuum on the worst offender : ```bash sudo -u postgres psql -c "VACUUM (VERBOSE, ANALYZE) ;" # Or VACUUM FULL if you have the downtime — it rewrites the table. ``` ## Tempo host Trace blocks default to 14d retention. If the host is full anyway, the lifecycle compactor isn't keeping up : ```bash sudo systemctl status tempo sudo journalctl -u tempo -n 200 --no-pager | grep -i compact ``` Emergency recovery — drop oldest blocks manually : ```bash sudo -u tempo find /var/lib/tempo/blocks -mindepth 1 -maxdepth 1 -type d -mtime +14 -exec rm -rf {} + ``` (This is safe because the blocks are write-once, append-only ; the index in `wal/` is rebuilt at restart.) ## API/web hosts (Kubernetes) ```bash # Images : kubectl describe node | grep -A 5 "Allocated resources" # Container logs (rotation should be handling this — check): sudo du -h --max-depth=1 /var/log/pods | sort -hr | head -10 # If a single pod is logging GB/min, that's a regression. Restart it # and grep its previous logs for the loop signature. ``` ## MinIO If the storage bucket is full : ```bash mc admin info veza-minio mc du veza-minio/ --depth=2 ``` Check the lifecycle policy is applied : ```bash mc ilm rule list veza-minio/veza-tracks ``` ## Recovery verification Once free space is back : - Postgres : confirm `pg_wal` size is bounded (should be < `wal_keep_size` + ~ 64MB). - Tempo : `df -h /var/lib/tempo` is below 70%. - The disk-space alert clears within one Prometheus scrape interval (~ 30s). ## Long-term prevention - pgBackRest archive lag → fix the alert (currently only `BackupRestoreDrillStale`, doesn't catch this) ; W3 backlog. - Tempo retention spilling → migrate Tempo to S3-backed (`tempo_storage_backend: s3`). W3 day 12 covers this. - API log volume → tighten log levels in prod (`LOG_LEVEL=INFO`).