main.go's config-load failure path silently os.Exit(1)s, which means lumberjack's file-rotation buffer never flushes before exit and the journal only sees \"started → exited 1\" with zero diagnostic. Last deploy run's app log had only the \"Logger initialized\" line; the actual NewConfig error never made it to disk because os.Exit doesn't run defers. A plain fmt.Fprintf to stderr → goes to systemd journal synchronously → the next probe rescue dump will show what's actually failing. The original \"don't write to stderr to avoid broken pipe with journald\" comment cited a concern that doesn't apply at this point in startup: there's no parent to break the pipe to, and journald accepts arbitrary bytes on stderr. Keep the os.Exit but print first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
137 lines
5.7 KiB
Go
137 lines
5.7 KiB
Go
package monitoring
|
|
|
|
// Business KPI metrics — v1.0.10 ops item 10.
|
|
//
|
|
// The platform already has technical metrics (request rate, latency,
|
|
// error rate, queue depth, etc.) wired through the http middleware,
|
|
// the SLO recording rules in config/prometheus/slo.yml, and a few
|
|
// per-feature counters (TracksUploadedTotal, UsersRegisteredTotal,
|
|
// PlaylistsCreatedTotal in metrics.go). Those answer "is the
|
|
// platform healthy ?" but not "is the business healthy ?".
|
|
//
|
|
// This file adds the business-side counters that drive the alert
|
|
// rules in alert_rules.yml `veza_business` group : login pass/fail
|
|
// rate (account-takeover signal), order lifecycle, revenue. A
|
|
// signup drop of -50% in 1h vs the same hour last week is a real
|
|
// product / signup-flow incident that the existing infra alerts
|
|
// don't catch.
|
|
//
|
|
// Naming convention : `veza_business_*_total` for counters,
|
|
// matching the existing `veza_*_total` style. Labels are bounded
|
|
// (status enum) so cardinality stays low — Prometheus tolerates
|
|
// thousands of distinct label value combinations but bills CPU
|
|
// for them.
|
|
|
|
import (
|
|
"github.com/prometheus/client_golang/prometheus"
|
|
"github.com/prometheus/client_golang/prometheus/promauto"
|
|
)
|
|
|
|
var (
|
|
// BusinessLoginsTotal counts every login attempt by outcome.
|
|
// Outcomes : "success", "failure_credentials" (wrong pwd),
|
|
// "failure_unverified" (email not verified), "failure_locked"
|
|
// (account locked), "failure_2fa" (2FA failed), "failure_other".
|
|
// The granularity matches the auth handler's branching ; new
|
|
// outcomes need a new label value AND a new alert if the rate
|
|
// matters.
|
|
BusinessLoginsTotal = promauto.NewCounterVec(
|
|
prometheus.CounterOpts{
|
|
Name: "veza_business_logins_total",
|
|
Help: "Logins broken down by outcome. Drives the auth-failure-spike alert.",
|
|
},
|
|
[]string{"outcome"},
|
|
)
|
|
|
|
// BusinessOrdersTotal counts every marketplace order state
|
|
// transition. Statuses : "created" (CreateOrder), "completed"
|
|
// (payment webhook), "refunded" (refund webhook), "failed"
|
|
// (payment provider rejected). The "created → completed" gap
|
|
// (over time) is the funnel ; alerts page when it widens.
|
|
BusinessOrdersTotal = promauto.NewCounterVec(
|
|
prometheus.CounterOpts{
|
|
Name: "veza_business_orders_total",
|
|
Help: "Marketplace orders by status transition.",
|
|
},
|
|
[]string{"status"},
|
|
)
|
|
|
|
// BusinessRevenueCentsTotal accumulates platform revenue in
|
|
// minor currency units (cents/centimes). Labelled by currency
|
|
// because EUR + USD orders go to the same counter and Grafana
|
|
// needs to split for display. Cents (int) avoids float
|
|
// precision drift across millions of orders.
|
|
BusinessRevenueCentsTotal = promauto.NewCounterVec(
|
|
prometheus.CounterOpts{
|
|
Name: "veza_business_revenue_cents_total",
|
|
Help: "Cumulative platform revenue in minor units (e.g. EUR cents).",
|
|
},
|
|
[]string{"currency"},
|
|
)
|
|
|
|
// BusinessAccountDeletionsTotal counts hard-delete + GDPR-erasure
|
|
// account closures. Spikes are a churn signal ; sustained drops
|
|
// to zero indicate the deletion endpoint is broken (which is
|
|
// also a problem — RGPD requires the surface to be reachable).
|
|
BusinessAccountDeletionsTotal = promauto.NewCounter(
|
|
prometheus.CounterOpts{
|
|
Name: "veza_business_account_deletions_total",
|
|
Help: "Hard-delete + RGPD account erasures.",
|
|
},
|
|
)
|
|
)
|
|
|
|
// RecordLogin increments the appropriate login outcome counter.
|
|
// outcome must be one of the pre-declared label values ; passing an
|
|
// unknown string explodes the cardinality and is a bug. Helpers
|
|
// below cover the common cases.
|
|
func RecordLogin(outcome string) {
|
|
BusinessLoginsTotal.WithLabelValues(outcome).Inc()
|
|
}
|
|
|
|
// RecordLoginSuccess and the failure-* helpers are provided so
|
|
// call sites don't have to remember the exact label string.
|
|
func RecordLoginSuccess() { RecordLogin("success") }
|
|
func RecordLoginFailureCredentials() { RecordLogin("failure_credentials") }
|
|
func RecordLoginFailureUnverified() { RecordLogin("failure_unverified") }
|
|
func RecordLoginFailureLocked() { RecordLogin("failure_locked") }
|
|
func RecordLoginFailure2FA() { RecordLogin("failure_2fa") }
|
|
func RecordLoginFailureOther() { RecordLogin("failure_other") }
|
|
|
|
// RecordOrderEvent records a marketplace order state transition.
|
|
// status must be one of "created", "completed", "refunded",
|
|
// "failed". The handler / webhook flow is responsible for
|
|
// idempotency (Prometheus counters are not transactional, but a
|
|
// duplicate webhook causing a duplicate count is a known
|
|
// approximation — doesn't affect the trend, which is what the
|
|
// alert reads).
|
|
func RecordOrderEvent(status string) {
|
|
BusinessOrdersTotal.WithLabelValues(status).Inc()
|
|
}
|
|
|
|
// RecordRevenue adds amountCents to the running revenue total for
|
|
// the given currency. Negative values are accepted (refunds) so
|
|
// the gauge tracks net revenue. Prometheus counter semantics
|
|
// require monotonic increase, so refunds are tracked as a
|
|
// separate `BusinessOrdersTotal{status="refunded"}` event ; net
|
|
// revenue is computed in PromQL by subtracting the refund-amount
|
|
// counter (TODO if needed — for now `currency=EUR-refund` is the
|
|
// pragmatic shortcut).
|
|
//
|
|
// Practical : call this from the payment webhook on the
|
|
// `completed` transition with the amount actually settled by the
|
|
// PSP, not the order's nominal price (the two can differ on
|
|
// partial refunds, currency conversion fees, etc.).
|
|
func RecordRevenue(amountCents int64, currency string) {
|
|
if amountCents <= 0 || currency == "" {
|
|
return
|
|
}
|
|
BusinessRevenueCentsTotal.WithLabelValues(currency).Add(float64(amountCents))
|
|
}
|
|
|
|
// RecordAccountDeletion increments the deletion counter. Called
|
|
// from the GDPR erasure handler + the user-initiated account-
|
|
// delete endpoint.
|
|
func RecordAccountDeletion() {
|
|
BusinessAccountDeletionsTotal.Inc()
|
|
}
|