veza/veza-backend-api/internal/database/n1_detector.go
senke ccf3e64d9a feat(observability): DB pool monitoring + N+1 detection (v1.0.10 ops item 11)
Two complementary signals : pool-side (do we have enough connections
for the load?) and per-request side (does any single handler quietly
run hundreds of queries?). Both feed Prometheus + Grafana + alert
rules.

Pool stats exporter (internal/database/pool_stats_exporter.go) :
- Background goroutine ticks every 15s and feeds the existing
  veza_db_connections{state} gauges. Before this, the gauges only
  refreshed when /health/deep was hit, so PoolExhaustionImminent
  evaluated against stale data.
- Wired into cmd/api/main.go alongside the ledger sampler with a
  shutdown hook for clean cancellation.

N+1 detector (internal/database/n1_detector.go +
internal/middleware/n1_query_counter.go) :
- Per-request *int64 counter attached to ctx by the gin
  middleware ; GORM after-callback (Query/Create/Update/Delete/
  Row/Raw) atomic-adds.
- Cost : one pointer load + one atomic add per query.
- Cardinality bounded by c.FullPath() (templated route, not URL).
- Threshold default 50, override via VEZA_N1_THRESHOLD.
- Histogram veza_db_request_query_count + counter
  veza_db_n1_suspicions_total.

Alerts in alert_rules.yml veza_db_pool_n1 group :
- PoolExhaustionImminent (in_use ≥ 90% for 5m)
- PoolStatsExporterStuck (gauges frozen for 10m despite traffic)
- N1QuerySpike (> 3% of requests over threshold for 15m)
- SlowQuerySustained (slow query rate > 2/min for 15m on same op+table)

Tests : 8 detector tests + 4 middleware tests, all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:53:37 +02:00

127 lines
4.8 KiB
Go

package database
// N+1 query detector — v1.0.10 ops item 11.
//
// N+1 is the classic GORM trap : load N parent rows, then loop and
// load child rows individually instead of using Preload / a single
// JOIN. The handler runs fine in dev (small table) and quietly
// destroys prod once N is in the thousands.
//
// This file plugs into GORM's callback API to count queries per
// request and compare to a threshold. The counter lives in the
// request context (a simple int pointer) so it doesn't leak across
// goroutines or requests. When a request finishes with a count
// above threshold we log a warning + bump a Prometheus counter,
// which feeds an alert rule and a Grafana panel.
//
// Threshold rationale : 50 queries per request is generous —
// legitimate flows (admin dashboards aggregating multiple tables)
// can exceed it, but the canonical "fetch list, loop, fetch each"
// N+1 pattern with N ≥ 50 is rare-but-real. Tunable via
// VEZA_N1_THRESHOLD ; set to 0 to disable detection without
// removing the callback (which still feeds RecordDBQuery).
import (
"context"
"sync/atomic"
"veza-backend-api/internal/metrics"
"go.uber.org/zap"
"gorm.io/gorm"
)
// queryCounterKeyType is a private type so context-key collisions
// with other packages are impossible (Go's context value lookup
// uses the key's identity, not its string).
type queryCounterKeyType struct{}
// QueryCounterKey is the context key used to thread the per-request
// query counter through GORM's callback. Exported because the
// middleware (in internal/middleware) needs to put a fresh counter
// in each request's ctx.
var QueryCounterKey = queryCounterKeyType{}
// AttachCounter returns a new context with a fresh int64 query
// counter. Call from the gin middleware on every request ; the
// returned ctx must be propagated into all DB calls (via
// db.WithContext(ctx)) for the counter to fire.
func AttachCounter(ctx context.Context) (context.Context, *int64) {
var counter int64
return context.WithValue(ctx, QueryCounterKey, &counter), &counter
}
// CounterFromContext returns the *int64 query counter, or nil if
// the request did not pass through AttachCounter (background
// jobs, WebSocket handlers, etc.). Returning nil instead of
// allocating-on-miss is intentional : we don't want background
// flows quietly accumulating into a counter no one reads.
func CounterFromContext(ctx context.Context) *int64 {
v := ctx.Value(QueryCounterKey)
if v == nil {
return nil
}
if c, ok := v.(*int64); ok {
return c
}
return nil
}
// N1DetectorConfig configures the GORM-callback side of the
// detector. Threshold = 0 disables the warn-log path but keeps
// counting (so the metric is always available for ad-hoc
// investigation).
type N1DetectorConfig struct {
Logger *zap.Logger
Threshold int64 // queries per request that triggers a warn log
Enabled bool
}
// RegisterN1Callbacks attaches the GORM callback that increments
// the per-request counter. It mirrors PerformanceMonitor's
// after-callback hook layout so both can coexist.
//
// The callback itself does almost nothing : grab the counter from
// the statement context, atomic.AddInt64. Cost is one pointer
// load + one atomic add per query. At 1k QPS that's ~5µs total
// CPU, negligible.
func RegisterN1Callbacks(db *gorm.DB) {
if db == nil {
return
}
bump := func(d *gorm.DB) {
if d.Statement == nil || d.Statement.Context == nil {
return
}
if c := CounterFromContext(d.Statement.Context); c != nil {
atomic.AddInt64(c, 1)
}
}
// Hook every query type ; SELECT-heavy patterns are the most
// common cause but UPDATE-in-loop is also a real pattern.
_ = db.Callback().Query().After("gorm:query").Register("n1_detector:query", bump)
_ = db.Callback().Create().After("gorm:create").Register("n1_detector:create", bump)
_ = db.Callback().Update().After("gorm:update").Register("n1_detector:update", bump)
_ = db.Callback().Delete().After("gorm:delete").Register("n1_detector:delete", bump)
_ = db.Callback().Row().After("gorm:row").Register("n1_detector:row", bump)
_ = db.Callback().Raw().After("gorm:raw").Register("n1_detector:raw", bump)
}
// ReportRequestQueryCount is the second half : called from the gin
// middleware after the handler finishes. It reads the counter, emits
// the histogram, logs a warning if over threshold, and bumps the
// "n+1 suspicious" counter that drives the alert rule. Idempotent —
// safe to call from a deferred middleware that fires on panic too.
func ReportRequestQueryCount(cfg N1DetectorConfig, route string, count int64) {
if !cfg.Enabled {
return
}
metrics.RecordRequestQueryCount(route, count)
if cfg.Threshold > 0 && count > cfg.Threshold && cfg.Logger != nil {
cfg.Logger.Warn("possible N+1 query pattern",
zap.String("route", route),
zap.Int64("query_count", count),
zap.Int64("threshold", cfg.Threshold))
metrics.RecordN1Suspicion(route)
}
}