Some checks failed
Veza CI / Notify on failure (push) Blocked by required conditions
Security Scan / Secret Scanning (gitleaks) (push) Waiting to run
Veza CI / Backend (Go) (push) Has been cancelled
Veza CI / Rust (Stream Server) (push) Has been cancelled
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Wires distributed tracing end-to-end. Backend exports OTLP/gRPC to a
collector, which tail-samples (errors + slow always, 10% rest) and
ships to Tempo. Grafana service-map dashboard pivots on the 4
instrumented hot paths.
- internal/tracing/otlp_exporter.go : InitOTLPTracer + Provider.Shutdown,
BatchSpanProcessor (5s/512 batch), ParentBased(TraceIDRatio) sampler,
W3C trace-context + baggage propagators. OTEL_SDK_DISABLED=true
short-circuits to a no-op. Failure to dial collector is non-fatal.
- cmd/api/main.go : init at boot, defer Shutdown(5s) on exit. appVersion
ldflag-overridable for resource attributes.
- 4 hot paths instrumented :
* handlers/auth.go::Login → "auth.login"
* core/track/track_upload_handler.go::InitiateChunkedUpload → "track.upload.initiate"
* core/marketplace/service.go::ProcessPaymentWebhook → "payment.webhook"
* handlers/search_handlers.go::Search → "search.query"
PII guarded — email masked, query content not recorded (length only).
- infra/ansible/roles/otel_collector : pin v0.116.1 contrib build,
systemd unit, tail-sampling config (errors + > 500ms always kept).
- infra/ansible/roles/tempo : pin v2.7.1 monolithic, local-disk backend
(S3 deferred to v1.1), 14d retention.
- infra/ansible/playbooks/observability.yml : provisions both Incus
containers + applies common baseline + roles in order.
- inventory/lab.yml : new groups observability, otel_collectors, tempo.
- config/grafana/dashboards/service-map.json : node graph + 4 hot-path
span tables + collector throughput/queue panels.
- docs/ENV_VARIABLES.md §30 : 4 OTEL_* env vars documented.
Acceptance criterion (Day 9) : login → span visible in Tempo UI. Lab
deployment to validate with `ansible-playbook -i inventory/lab.yml
playbooks/observability.yml` once roles/postgres_ha is up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
74 lines
2.5 KiB
YAML
74 lines
2.5 KiB
YAML
# Lab inventory — the R720's local lab Incus container used to dry-run
|
|
# role changes before they touch staging or prod. Override
|
|
# ansible_host / ansible_user / ansible_port in `host_vars/<host>.yml`
|
|
# (gitignored if it carries credentials, otherwise plain values).
|
|
#
|
|
# Usage:
|
|
# ansible-playbook -i inventory/lab.yml playbooks/site.yml --check
|
|
# ansible-playbook -i inventory/lab.yml playbooks/site.yml
|
|
#
|
|
# v1.0.9 Day 6: postgres_ha group added. The 3 containers
|
|
# (pgaf-monitor, pgaf-primary, pgaf-replica) live ON the veza-lab
|
|
# host and are addressed via the `community.general.incus`
|
|
# connection plugin — no SSH setup needed inside the containers.
|
|
all:
|
|
hosts:
|
|
veza-lab:
|
|
ansible_host: 10.0.20.150
|
|
ansible_user: senke
|
|
ansible_python_interpreter: /usr/bin/python3
|
|
children:
|
|
incus_hosts:
|
|
hosts:
|
|
veza-lab:
|
|
veza_lab:
|
|
hosts:
|
|
veza-lab:
|
|
postgres_ha:
|
|
hosts:
|
|
pgaf-monitor:
|
|
pg_auto_failover_role: monitor
|
|
pgaf-primary:
|
|
pg_auto_failover_role: node
|
|
pgaf-replica:
|
|
pg_auto_failover_role: node
|
|
vars:
|
|
# Containers reached via Incus exec on the parent host. The
|
|
# plugin lives in the community.general collection — install
|
|
# with `ansible-galaxy collection install community.general`
|
|
# before running this playbook.
|
|
ansible_connection: community.general.incus
|
|
ansible_python_interpreter: /usr/bin/python3
|
|
postgres_ha_monitor:
|
|
hosts:
|
|
pgaf-monitor:
|
|
postgres_ha_nodes:
|
|
# Order matters — primary first so it registers as primary; replica
|
|
# second so it joins as standby.
|
|
hosts:
|
|
pgaf-primary:
|
|
pgaf-replica:
|
|
# v1.0.9 Day 7: pgbouncer fronts the formation. Same
|
|
# community.general.incus connection plugin as postgres_ha.
|
|
pgbouncer:
|
|
hosts:
|
|
pgaf-pgbouncer:
|
|
vars:
|
|
ansible_connection: community.general.incus
|
|
ansible_python_interpreter: /usr/bin/python3
|
|
# v1.0.9 Day 9: otel-collector + Tempo for distributed tracing.
|
|
# Each runs in its own Incus container; the API on the host points
|
|
# at otel-collector.lxd:4317 via OTEL_EXPORTER_OTLP_ENDPOINT.
|
|
observability:
|
|
hosts:
|
|
otel-collector:
|
|
tempo:
|
|
vars:
|
|
ansible_connection: community.general.incus
|
|
ansible_python_interpreter: /usr/bin/python3
|
|
otel_collectors:
|
|
hosts:
|
|
otel-collector:
|
|
tempo:
|
|
hosts:
|
|
tempo:
|