veza/veza-backend-api/internal/database/pool_stats_exporter.go
senke ccf3e64d9a feat(observability): DB pool monitoring + N+1 detection (v1.0.10 ops item 11)
Two complementary signals : pool-side (do we have enough connections
for the load?) and per-request side (does any single handler quietly
run hundreds of queries?). Both feed Prometheus + Grafana + alert
rules.

Pool stats exporter (internal/database/pool_stats_exporter.go) :
- Background goroutine ticks every 15s and feeds the existing
  veza_db_connections{state} gauges. Before this, the gauges only
  refreshed when /health/deep was hit, so PoolExhaustionImminent
  evaluated against stale data.
- Wired into cmd/api/main.go alongside the ledger sampler with a
  shutdown hook for clean cancellation.

N+1 detector (internal/database/n1_detector.go +
internal/middleware/n1_query_counter.go) :
- Per-request *int64 counter attached to ctx by the gin
  middleware ; GORM after-callback (Query/Create/Update/Delete/
  Row/Raw) atomic-adds.
- Cost : one pointer load + one atomic add per query.
- Cardinality bounded by c.FullPath() (templated route, not URL).
- Threshold default 50, override via VEZA_N1_THRESHOLD.
- Histogram veza_db_request_query_count + counter
  veza_db_n1_suspicions_total.

Alerts in alert_rules.yml veza_db_pool_n1 group :
- PoolExhaustionImminent (in_use ≥ 90% for 5m)
- PoolStatsExporterStuck (gauges frozen for 10m despite traffic)
- N1QuerySpike (> 3% of requests over threshold for 15m)
- SlowQuerySustained (slow query rate > 2/min for 15m on same op+table)

Tests : 8 detector tests + 4 middleware tests, all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:53:37 +02:00

75 lines
2.3 KiB
Go
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

package database
// Periodic DB pool stats exporter — v1.0.10 ops item 11.
//
// Before this file the pool gauges (veza_db_connections{state}) were
// only updated when the /health/deep endpoint was hit. That meant
// Prometheus scrapes between health checks saw stale numbers. The
// existing gauges + UpdateDBConnections() were already there ; this
// just feeds them on a schedule so PoolExhaustionImminent and the
// "in_use ≥ MaxOpen × 0.9" alert have fresh data to evaluate.
//
// Why a goroutine and not a Prometheus collector callback : sql.DB
// doesn't expose a cheap polling hook, and the callback model would
// also force us to hold the DB handle in the metrics package which
// is the wrong layering.
import (
"context"
"time"
"go.uber.org/zap"
"gorm.io/gorm"
)
// StartPoolStatsExporter launches a background goroutine that calls
// GetPoolStats(db) on the given interval. The goroutine exits cleanly
// when ctx is cancelled (typical: server shutdown). interval ≤ 0
// falls back to 15s — short enough that pool exhaustion shows up
// before the alert's 5m for: window, long enough that the overhead
// is negligible (a single sql.DB.Stats() call costs nanoseconds).
//
// Returns immediately ; the caller is responsible for cancelling
// ctx on shutdown.
func StartPoolStatsExporter(ctx context.Context, db *gorm.DB, interval time.Duration, logger *zap.Logger) {
if db == nil {
if logger != nil {
logger.Warn("pool stats exporter not started: db is nil")
}
return
}
if interval <= 0 {
interval = 15 * time.Second
}
go func() {
ticker := time.NewTicker(interval)
defer ticker.Stop()
// Emit once immediately so the first scrape after startup
// has a non-stale value (otherwise the gauge sits at 0
// for up-to-`interval` seconds and the alert's
// availability-vs-saturation calc reads wrong).
if _, err := GetPoolStats(db); err != nil && logger != nil {
logger.Debug("pool stats initial sample failed", zap.Error(err))
}
for {
select {
case <-ctx.Done():
if logger != nil {
logger.Debug("pool stats exporter stopped")
}
return
case <-ticker.C:
if _, err := GetPoolStats(db); err != nil && logger != nil {
logger.Debug("pool stats sample failed", zap.Error(err))
}
}
}
}()
if logger != nil {
logger.Info("pool stats exporter started", zap.Duration("interval", interval))
}
}