senke
|
a36d9b2d59
|
feat(redis): Sentinel HA + cache hit rate metrics (W3 Day 11)
Veza CI / Backend (Go) (push) Failing after 8m56s
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Notify on failure (push) Blocked by required conditions
Veza CI / Rust (Stream Server) (push) Successful in 5m3s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 53s
Three Incus containers, each running redis-server + redis-sentinel
(co-located). redis-1 = master at first boot, redis-2/3 = replicas.
Sentinel quorum=2 of 3 ; failover-timeout=30s satisfies the W3
acceptance criterion.
- internal/config/redis_init.go : initRedis branches on
REDIS_SENTINEL_ADDRS ; non-empty -> redis.NewFailoverClient with
MasterName + SentinelAddrs + SentinelPassword. Empty -> existing
single-instance NewClient (dev/local stays parametric).
- internal/config/config.go : 3 new fields (RedisSentinelAddrs,
RedisSentinelMasterName, RedisSentinelPassword) read from env.
parseRedisSentinelAddrs trims+filters CSV.
- internal/metrics/cache_hit_rate.go : new RecordCacheHit / Miss
counters, labelled by subsystem. Cardinality bounded.
- internal/middleware/rate_limiter.go : instrument 3 Eval call sites
(DDoS, frontend log throttle, upload throttle). Hit = Redis answered,
Miss = error -> in-memory fallback.
- internal/services/chat_pubsub.go : instrument Publish + PublishPresence.
- internal/websocket/chat/presence_service.go : instrument SetOnline /
SetOffline / Heartbeat / GetPresence. redis.Nil counts as a hit
(legitimate empty result).
- infra/ansible/roles/redis_sentinel/ : install Redis 7 + Sentinel,
render redis.conf + sentinel.conf, systemd units. Vault assertion
prevents shipping placeholder passwords to staging/prod.
- infra/ansible/playbooks/redis_sentinel.yml : provisions the 3
containers + applies common baseline + role.
- infra/ansible/inventory/lab.yml : new groups redis_ha + redis_ha_master.
- infra/ansible/tests/test_redis_failover.sh : kills the master
container, polls Sentinel for the new master, asserts elapsed < 30s.
- config/grafana/dashboards/redis-cache-overview.json : 3 hit-rate
stats (rate_limiter / chat_pubsub / presence) + ops/s breakdown.
- docs/ENV_VARIABLES.md §3 : 3 new REDIS_SENTINEL_* env vars.
- veza-backend-api/.env.template : 3 placeholders (empty default).
Acceptance (Day 11) : Sentinel failover < 30s ; cache hit-rate
dashboard populated. Lab test pending Sentinel deployment.
W3 verification gate progress : Redis Sentinel ✓ (this commit),
MinIO EC4+2 ⏳ Day 12, CDN ⏳ Day 13, DMCA ⏳ Day 14, embed ⏳ Day 15.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-28 13:36:55 +02:00 |
|
senke
|
84e92a75e2
|
feat(observability): OTel SDK + collector + Tempo + 4 hot path spans (W2 Day 9)
Veza CI / Notify on failure (push) Blocked by required conditions
Security Scan / Secret Scanning (gitleaks) (push) Waiting to run
Veza CI / Backend (Go) (push) Has been cancelled
Veza CI / Rust (Stream Server) (push) Has been cancelled
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Wires distributed tracing end-to-end. Backend exports OTLP/gRPC to a
collector, which tail-samples (errors + slow always, 10% rest) and
ships to Tempo. Grafana service-map dashboard pivots on the 4
instrumented hot paths.
- internal/tracing/otlp_exporter.go : InitOTLPTracer + Provider.Shutdown,
BatchSpanProcessor (5s/512 batch), ParentBased(TraceIDRatio) sampler,
W3C trace-context + baggage propagators. OTEL_SDK_DISABLED=true
short-circuits to a no-op. Failure to dial collector is non-fatal.
- cmd/api/main.go : init at boot, defer Shutdown(5s) on exit. appVersion
ldflag-overridable for resource attributes.
- 4 hot paths instrumented :
* handlers/auth.go::Login → "auth.login"
* core/track/track_upload_handler.go::InitiateChunkedUpload → "track.upload.initiate"
* core/marketplace/service.go::ProcessPaymentWebhook → "payment.webhook"
* handlers/search_handlers.go::Search → "search.query"
PII guarded — email masked, query content not recorded (length only).
- infra/ansible/roles/otel_collector : pin v0.116.1 contrib build,
systemd unit, tail-sampling config (errors + > 500ms always kept).
- infra/ansible/roles/tempo : pin v2.7.1 monolithic, local-disk backend
(S3 deferred to v1.1), 14d retention.
- infra/ansible/playbooks/observability.yml : provisions both Incus
containers + applies common baseline + roles in order.
- inventory/lab.yml : new groups observability, otel_collectors, tempo.
- config/grafana/dashboards/service-map.json : node graph + 4 hot-path
span tables + collector throughput/queue panels.
- docs/ENV_VARIABLES.md §30 : 4 OTEL_* env vars documented.
Acceptance criterion (Day 9) : login → span visible in Tempo UI. Lab
deployment to validate with `ansible-playbook -i inventory/lab.yml
playbooks/observability.yml` once roles/postgres_ha is up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-28 01:15:11 +02:00 |
|
senke
|
94dfc80b73
|
feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F
Five Prometheus gauges + reconciler metrics + Grafana dashboard +
three alert rules. Closes axis-1 P1.8 and adds observability for
item C's reconciler (user review: "F should include reconciler_*
metrics, otherwise tag is blind on the worker we just shipped").
Gauges (veza_ledger_, sampled every 60s):
* orphan_refund_rows — THE canary. Pending refunds with empty
hyperswitch_refund_id older than 5m = Phase 2 crash in
RefundOrder. Alert: > 0 for 5m → page.
* stuck_orders_pending — order pending > 30m with non-empty
payment_id. Alert: > 0 for 10m → page.
* stuck_refunds_pending — refund pending > 30m with hs_id.
* failed_transfers_at_max_retry — permanently_failed rows.
* reversal_pending_transfers — item B rows stuck > 30m.
Reconciler metrics (veza_reconciler_):
* actions_total{phase} — counter by phase.
* orphan_refunds_total — two-phase-bug canary.
* sweep_duration_seconds — exponential histogram.
* last_run_timestamp — alert: stale > 2h → page (worker dead).
Implementation notes:
* Sampler thresholds hardcoded to match reconciler defaults —
intentional mismatch allowed (alerts fire while reconciler
already working = correct behavior).
* Query error sets gauge to -1 (sentinel for "sampler broken").
* marketplace package routes through monitoring recorders so it
doesn't import prometheus directly.
* Sampler runs regardless of Hyperswitch enablement; gauges
default 0 when pipeline idle.
* Graceful shutdown wired in cmd/api/main.go.
Alert rules in config/alertmanager/ledger.yml with runbook
pointers + detailed descriptions — each alert explains WHAT
happened, WHY the reconciler may not resolve it, and WHERE to
look first.
Grafana dashboard config/grafana/dashboards/ledger-health.json —
top row = 5 stat panels (orphan first, color-coded red on > 0),
middle row = trend timeseries + reconciler action rate by phase,
bottom row = sweep duration p50/p95/p99 + seconds-since-last-tick
+ orphan cumulative.
Tests — 6 cases, all green (sqlite :memory:):
* CountsStuckOrdersPending (includes the filter on
non-empty payment_id)
* StuckOrdersZeroWhenAllCompleted
* CountsOrphanRefunds (THE canary)
* CountsStuckRefundsWithHsID (gauge-orthogonality check)
* CountsFailedAndReversalPendingTransfers
* ReconcilerRecorders (counter + gauge shape)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-18 03:40:14 +02:00 |
|
senke
|
65375a61aa
|
chore(release): v0.952 — Observe (Grafana v1-overview, Prometheus alert_rules_v1)
|
2026-03-02 19:08:55 +01:00 |
|
senke
|
83ed4f315b
|
chore(release): v0.602 — Payout, Dette Technique & Tests E2E
Backend API CI / test-unit (push) Failing after 0s
Backend API CI / test-integration (push) Failing after 0s
Frontend CI / test (push) Failing after 0s
Storybook Audit / Build & audit Storybook (push) Failing after 0s
- Stripe Connect: onboarding, balance, SellerDashboardView
- Interceptors: auth.ts, error.ts extracted, facade
- Grafana: dashboards enriched (p50, top endpoints, 4xx, WS, commerce)
- E2E commerce: product->order->review->invoice
- SMOKE_TEST_V0602, RETROSPECTIVE_V0602, PAYOUT_MANUAL
- Archive V0_602 scope, V0_603 placeholder, SCOPE_CONTROL v0.603
- Fix sanitizer regex (Go no backreferences)
- Marketplace test schema: product_licenses, product_images, orders, licenses
|
2026-02-23 22:32:01 +01:00 |
|
senke
|
c002e74031
|
feat(monitoring): add 3 Grafana dashboards (API, Chat, Commerce)
- api-overview.json: request rate, p95 latency, 5xx errors, DB pool
- chat-overview.json: WebSocket upgrade rate, chat API
- commerce-overview.json: marketplace/commerce/orders metrics
- system-overview.json: replaces veza-dashboard.json
|
2026-02-23 19:54:01 +01:00 |
|
okinrev
|
327ac36a30
|
BASE: completing the initial repo state
|
2025-12-03 22:56:50 +01:00 |
|