veza/docs/runbooks/disk-full.md

# Runbook — Disk full / `/` filesystem at threshold

> **Alerts** : `DiskSpaceLow` (warning at 85%) · `DiskSpaceCritical` (page at 95%).
> **Owner** : infra on-call.

## Hosts to watch

| Host                  | What fills the disk                                                  |
| --------------------- | -------------------------------------------------------------------- |
| `pgaf-primary`        | WAL + autovacuum bloat. WAL fills if pgBackRest archive falls behind. |
| `pgaf-replica`        | Replication lag → WAL not replayed; same WAL accumulation.           |
| `pgaf-pgbouncer`      | Logs in `/var/log/postgresql/pgbouncer.log` if log_disconnections=on. |
| `tempo`               | Trace blocks under `/var/lib/tempo`. Default retention 14d.          |
| `otel-collector`      | Almost never — no on-disk state by default.                          |
| API/web hosts (k8s)   | Container images, log rotation, build caches.                        |
| `minio-*`             | Object data — lifecycle policy supposed to manage this.              |

## First moves (under 2 minutes)

```bash
df -h
# Identify the mount that's tight, then :
sudo du -h --max-depth=2 -x /var/lib | sort -hr | head -20
sudo du -h --max-depth=2 -x /var/log | sort -hr | head -20
```

## Postgres data nodes (`pgaf-*`)

### A. WAL piling up

If `/var/lib/postgresql/16/main/pg_wal` is the offender :

```bash
# Is pgBackRest shipping ?
sudo -u postgres pgbackrest --stanza=veza info | tail -20

# Last WAL push time should be < 1 minute ago.
```

If pgBackRest is stuck (S3 unreachable, credentials rotated) :

1. **Don't** force `pg_resetwal` — that's data loss.
2. Fix the upstream (network, credentials), then push pending WAL :
   ```bash
   sudo -u postgres pgbackrest --stanza=veza archive-push <wal_file>
   ```

### B. Autovacuum bloat

```bash
sudo -u postgres psql -c "
  SELECT relname, n_live_tup, n_dead_tup,
         pg_size_pretty(pg_total_relation_size(relid)) AS size
  FROM pg_stat_user_tables
  ORDER BY n_dead_tup DESC LIMIT 10;
"
```

Manual vacuum on the worst offender :

```bash
sudo -u postgres psql -c "VACUUM (VERBOSE, ANALYZE) <table>;"
# Or VACUUM FULL if you have the downtime — it rewrites the table.
```

## Tempo host

Trace blocks default to 14d retention. If the host is full anyway, the lifecycle compactor isn't keeping up :

```bash
sudo systemctl status tempo
sudo journalctl -u tempo -n 200 --no-pager | grep -i compact
```

Emergency recovery — drop oldest blocks manually :

```bash
sudo -u tempo find /var/lib/tempo/blocks -mindepth 1 -maxdepth 1 -type d -mtime +14 -exec rm -rf {} +
```

(This is safe because the blocks are write-once, append-only ; the index in `wal/` is rebuilt at restart.)

## API/web hosts (Kubernetes)

```bash
# Images :
kubectl describe node <node> | grep -A 5 "Allocated resources"

# Container logs (rotation should be handling this — check):
sudo du -h --max-depth=1 /var/log/pods | sort -hr | head -10

# If a single pod is logging GB/min, that's a regression. Restart it
# and grep its previous logs for the loop signature.
```

## MinIO

If the storage bucket is full :

```bash
mc admin info veza-minio
mc du veza-minio/ --depth=2
```

Check the lifecycle policy is applied :

```bash
mc ilm rule list veza-minio/veza-tracks
```

## Recovery verification

Once free space is back :

- Postgres : confirm `pg_wal` size is bounded (should be < `wal_keep_size` + ~ 64MB).
- Tempo : `df -h /var/lib/tempo` is below 70%.
- The disk-space alert clears within one Prometheus scrape interval (~ 30s).

## Long-term prevention

- pgBackRest archive lag → fix the alert (currently only `BackupRestoreDrillStale`, doesn't catch this) ; W3 backlog.
- Tempo retention spilling → migrate Tempo to S3-backed (`tempo_storage_backend: s3`). W3 day 12 covers this.
- API log volume → tighten log levels in prod (`LOG_LEVEL=INFO`).
feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10) Three SLOs with multi-window burn-rate alerts (Google SRE workbook methodology) : * SLO_API_AVAILABILITY : 99.5% on read (GET) endpoints * SLO_API_LATENCY : 99% writes p95 < 500ms * SLO_PAYMENT_SUCCESS : 99.5% on POST /api/v1/orders -> 2xx Each SLO has two alerts : * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows) * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m) - config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool check rules => SUCCESS: 18 rules found. - config/alertmanager/routes.yml : routing tree splits page-oncall (slack + PagerDuty) from ticket-oncall (slack only). - docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md + db-failover, redis-down, disk-full, cert-expiring-soon : one stub per likely page. Each lists first moves under 5min + common causes. Acceptance (Day 10) : promtool check rules vert. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-04-27 23:30:34 +00:00			# Runbook — Disk full / `/` filesystem at threshold

			> Alerts : `DiskSpaceLow` (warning at 85%) · `DiskSpaceCritical` (page at 95%).
			`> Owner : infra on-call.`

			`## Hosts to watch`

			`\| Host \| What fills the disk \|`
			`\| --------------------- \| -------------------------------------------------------------------- \|`
			\| `pgaf-primary` \| WAL + autovacuum bloat. WAL fills if pgBackRest archive falls behind. \|
			\| `pgaf-replica` \| Replication lag → WAL not replayed; same WAL accumulation. \|
			\| `pgaf-pgbouncer` \| Logs in `/var/log/postgresql/pgbouncer.log` if log_disconnections=on. \|
			\| `tempo` \| Trace blocks under `/var/lib/tempo`. Default retention 14d. \|
			\| `otel-collector` \| Almost never — no on-disk state by default. \|
			`\| API/web hosts (k8s) \| Container images, log rotation, build caches. \|`
			\| `minio-*` \| Object data — lifecycle policy supposed to manage this. \|

			`## First moves (under 2 minutes)`

			```bash
			`df -h`
			`# Identify the mount that's tight, then :`
			`sudo du -h --max-depth=2 -x /var/lib \| sort -hr \| head -20`
			`sudo du -h --max-depth=2 -x /var/log \| sort -hr \| head -20`
			```

			## Postgres data nodes (`pgaf-*`)

			`### A. WAL piling up`

			If `/var/lib/postgresql/16/main/pg_wal` is the offender :

			```bash
			`# Is pgBackRest shipping ?`
			`sudo -u postgres pgbackrest --stanza=veza info \| tail -20`

			`# Last WAL push time should be < 1 minute ago.`
			```

			`If pgBackRest is stuck (S3 unreachable, credentials rotated) :`

			1. Don't force `pg_resetwal` — that's data loss.
			`2. Fix the upstream (network, credentials), then push pending WAL :`
			```bash
			`sudo -u postgres pgbackrest --stanza=veza archive-push <wal_file>`
			```

			`### B. Autovacuum bloat`

			```bash
			`sudo -u postgres psql -c "`
			`SELECT relname, n_live_tup, n_dead_tup,`
			`pg_size_pretty(pg_total_relation_size(relid)) AS size`
			`FROM pg_stat_user_tables`
			`ORDER BY n_dead_tup DESC LIMIT 10;`
			`"`
			```

			`Manual vacuum on the worst offender :`

			```bash
			`sudo -u postgres psql -c "VACUUM (VERBOSE, ANALYZE) <table>;"`
			`# Or VACUUM FULL if you have the downtime — it rewrites the table.`
			```

			`## Tempo host`

			`Trace blocks default to 14d retention. If the host is full anyway, the lifecycle compactor isn't keeping up :`

			```bash
			`sudo systemctl status tempo`
			`sudo journalctl -u tempo -n 200 --no-pager \| grep -i compact`
			```

			`Emergency recovery — drop oldest blocks manually :`

			```bash
			`sudo -u tempo find /var/lib/tempo/blocks -mindepth 1 -maxdepth 1 -type d -mtime +14 -exec rm -rf {} +`
			```

			(This is safe because the blocks are write-once, append-only ; the index in `wal/` is rebuilt at restart.)

			`## API/web hosts (Kubernetes)`

			```bash
			`# Images :`
			`kubectl describe node <node> \| grep -A 5 "Allocated resources"`

			`# Container logs (rotation should be handling this — check):`
			`sudo du -h --max-depth=1 /var/log/pods \| sort -hr \| head -10`

			`# If a single pod is logging GB/min, that's a regression. Restart it`
			`# and grep its previous logs for the loop signature.`
			```

			`## MinIO`

			`If the storage bucket is full :`

			```bash
			`mc admin info veza-minio`
			`mc du veza-minio/ --depth=2`
			```

			`Check the lifecycle policy is applied :`

			```bash
			`mc ilm rule list veza-minio/veza-tracks`
			```

			`## Recovery verification`

			`Once free space is back :`

			- Postgres : confirm `pg_wal` size is bounded (should be < `wal_keep_size` + ~ 64MB).
			- Tempo : `df -h /var/lib/tempo` is below 70%.
			`- The disk-space alert clears within one Prometheus scrape interval (~ 30s).`

			`## Long-term prevention`

			- pgBackRest archive lag → fix the alert (currently only `BackupRestoreDrillStale`, doesn't catch this) ; W3 backlog.
			- Tempo retention spilling → migrate Tempo to S3-backed (`tempo_storage_backend: s3`). W3 day 12 covers this.
			- API log volume → tighten log levels in prod (`LOG_LEVEL=INFO`).