Two complementary signals : pool-side (do we have enough connections
for the load?) and per-request side (does any single handler quietly
run hundreds of queries?). Both feed Prometheus + Grafana + alert
rules.
Pool stats exporter (internal/database/pool_stats_exporter.go) :
- Background goroutine ticks every 15s and feeds the existing
veza_db_connections{state} gauges. Before this, the gauges only
refreshed when /health/deep was hit, so PoolExhaustionImminent
evaluated against stale data.
- Wired into cmd/api/main.go alongside the ledger sampler with a
shutdown hook for clean cancellation.
N+1 detector (internal/database/n1_detector.go +
internal/middleware/n1_query_counter.go) :
- Per-request *int64 counter attached to ctx by the gin
middleware ; GORM after-callback (Query/Create/Update/Delete/
Row/Raw) atomic-adds.
- Cost : one pointer load + one atomic add per query.
- Cardinality bounded by c.FullPath() (templated route, not URL).
- Threshold default 50, override via VEZA_N1_THRESHOLD.
- Histogram veza_db_request_query_count + counter
veza_db_n1_suspicions_total.
Alerts in alert_rules.yml veza_db_pool_n1 group :
- PoolExhaustionImminent (in_use ≥ 90% for 5m)
- PoolStatsExporterStuck (gauges frozen for 10m despite traffic)
- N1QuerySpike (> 3% of requests over threshold for 15m)
- SlowQuerySustained (slow query rate > 2/min for 15m on same op+table)
Tests : 8 detector tests + 4 middleware tests, all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
75 lines
2.3 KiB
Go
75 lines
2.3 KiB
Go
package database
|
||
|
||
// Periodic DB pool stats exporter — v1.0.10 ops item 11.
|
||
//
|
||
// Before this file the pool gauges (veza_db_connections{state}) were
|
||
// only updated when the /health/deep endpoint was hit. That meant
|
||
// Prometheus scrapes between health checks saw stale numbers. The
|
||
// existing gauges + UpdateDBConnections() were already there ; this
|
||
// just feeds them on a schedule so PoolExhaustionImminent and the
|
||
// "in_use ≥ MaxOpen × 0.9" alert have fresh data to evaluate.
|
||
//
|
||
// Why a goroutine and not a Prometheus collector callback : sql.DB
|
||
// doesn't expose a cheap polling hook, and the callback model would
|
||
// also force us to hold the DB handle in the metrics package which
|
||
// is the wrong layering.
|
||
|
||
import (
|
||
"context"
|
||
"time"
|
||
|
||
"go.uber.org/zap"
|
||
"gorm.io/gorm"
|
||
)
|
||
|
||
// StartPoolStatsExporter launches a background goroutine that calls
|
||
// GetPoolStats(db) on the given interval. The goroutine exits cleanly
|
||
// when ctx is cancelled (typical: server shutdown). interval ≤ 0
|
||
// falls back to 15s — short enough that pool exhaustion shows up
|
||
// before the alert's 5m for: window, long enough that the overhead
|
||
// is negligible (a single sql.DB.Stats() call costs nanoseconds).
|
||
//
|
||
// Returns immediately ; the caller is responsible for cancelling
|
||
// ctx on shutdown.
|
||
func StartPoolStatsExporter(ctx context.Context, db *gorm.DB, interval time.Duration, logger *zap.Logger) {
|
||
if db == nil {
|
||
if logger != nil {
|
||
logger.Warn("pool stats exporter not started: db is nil")
|
||
}
|
||
return
|
||
}
|
||
if interval <= 0 {
|
||
interval = 15 * time.Second
|
||
}
|
||
|
||
go func() {
|
||
ticker := time.NewTicker(interval)
|
||
defer ticker.Stop()
|
||
|
||
// Emit once immediately so the first scrape after startup
|
||
// has a non-stale value (otherwise the gauge sits at 0
|
||
// for up-to-`interval` seconds and the alert's
|
||
// availability-vs-saturation calc reads wrong).
|
||
if _, err := GetPoolStats(db); err != nil && logger != nil {
|
||
logger.Debug("pool stats initial sample failed", zap.Error(err))
|
||
}
|
||
|
||
for {
|
||
select {
|
||
case <-ctx.Done():
|
||
if logger != nil {
|
||
logger.Debug("pool stats exporter stopped")
|
||
}
|
||
return
|
||
case <-ticker.C:
|
||
if _, err := GetPoolStats(db); err != nil && logger != nil {
|
||
logger.Debug("pool stats sample failed", zap.Error(err))
|
||
}
|
||
}
|
||
}
|
||
}()
|
||
|
||
if logger != nil {
|
||
logger.Info("pool stats exporter started", zap.Duration("interval", interval))
|
||
}
|
||
}
|