Some checks failed
Veza CI / Notify on failure (push) Blocked by required conditions
Security Scan / Secret Scanning (gitleaks) (push) Waiting to run
Veza CI / Backend (Go) (push) Has been cancelled
Veza CI / Rust (Stream Server) (push) Has been cancelled
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Wires distributed tracing end-to-end. Backend exports OTLP/gRPC to a
collector, which tail-samples (errors + slow always, 10% rest) and
ships to Tempo. Grafana service-map dashboard pivots on the 4
instrumented hot paths.
- internal/tracing/otlp_exporter.go : InitOTLPTracer + Provider.Shutdown,
BatchSpanProcessor (5s/512 batch), ParentBased(TraceIDRatio) sampler,
W3C trace-context + baggage propagators. OTEL_SDK_DISABLED=true
short-circuits to a no-op. Failure to dial collector is non-fatal.
- cmd/api/main.go : init at boot, defer Shutdown(5s) on exit. appVersion
ldflag-overridable for resource attributes.
- 4 hot paths instrumented :
* handlers/auth.go::Login → "auth.login"
* core/track/track_upload_handler.go::InitiateChunkedUpload → "track.upload.initiate"
* core/marketplace/service.go::ProcessPaymentWebhook → "payment.webhook"
* handlers/search_handlers.go::Search → "search.query"
PII guarded — email masked, query content not recorded (length only).
- infra/ansible/roles/otel_collector : pin v0.116.1 contrib build,
systemd unit, tail-sampling config (errors + > 500ms always kept).
- infra/ansible/roles/tempo : pin v2.7.1 monolithic, local-disk backend
(S3 deferred to v1.1), 14d retention.
- infra/ansible/playbooks/observability.yml : provisions both Incus
containers + applies common baseline + roles in order.
- inventory/lab.yml : new groups observability, otel_collectors, tempo.
- config/grafana/dashboards/service-map.json : node graph + 4 hot-path
span tables + collector throughput/queue panels.
- docs/ENV_VARIABLES.md §30 : 4 OTEL_* env vars documented.
Acceptance criterion (Day 9) : login → span visible in Tempo UI. Lab
deployment to validate with `ansible-playbook -i inventory/lab.yml
playbooks/observability.yml` once roles/postgres_ha is up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
50 lines
2.1 KiB
Markdown
50 lines
2.1 KiB
Markdown
# `tempo` role — Grafana Tempo trace backend
|
|
|
|
Single-binary Tempo (monolithic mode), local-disk storage, ~14 day retention. Receives OTLP/gRPC from `roles/otel_collector`, exposes the query API on `:3200` for Grafana.
|
|
|
|
## Topology
|
|
|
|
```
|
|
otel-collector ──OTLP/gRPC:4319──▶ tempo ──HTTP:3200──▶ Grafana data source
|
|
│
|
|
└─── /var/lib/tempo (blocks + WAL)
|
|
```
|
|
|
|
## Defaults
|
|
|
|
| variable | default | meaning |
|
|
| --------------------------- | -------------------- | ---------------------------- |
|
|
| `tempo_version` | `2.7.1` | release tag |
|
|
| `tempo_otlp_grpc_port` | `4319` | OTLP/gRPC listener |
|
|
| `tempo_http_port` | `3200` | query API |
|
|
| `tempo_storage_backend` | `local` | `local` (v1.0) or `s3` (v1.1+) |
|
|
| `tempo_storage_local_path` | `/var/lib/tempo` | block + WAL root |
|
|
| `tempo_retention_h` | `336` (14d) | block retention |
|
|
|
|
## Operations
|
|
|
|
```bash
|
|
# Status:
|
|
sudo systemctl status tempo
|
|
sudo journalctl -u tempo -f
|
|
|
|
# Health:
|
|
curl -fsS http://tempo.lxd:3200/ready
|
|
curl -fsS http://tempo.lxd:3200/metrics | grep tempo_
|
|
|
|
# Query a trace by ID:
|
|
curl -fsS "http://tempo.lxd:3200/api/traces/<trace_id>"
|
|
|
|
# Search recent traces by service:
|
|
curl -fsS "http://tempo.lxd:3200/api/search?tags=service.name=veza-backend-api"
|
|
```
|
|
|
|
## Grafana data source
|
|
|
|
In Grafana, add a Tempo data source pointing at `http://tempo.lxd:3200`. The service map in `config/grafana/dashboards/service-map.json` (W2 Day 9) is wired to this data source by name `tempo`.
|
|
|
|
## What this role does NOT cover
|
|
|
|
- **S3-backed storage.** v1.0 = local disk, single-host. v1.1 swaps `storage.trace.backend: s3` to ship blocks to MinIO so Tempo can run multi-replica.
|
|
- **Multi-tenancy.** Single tenant (`single-tenant`) until v1.2 brings hosted multi-tenancy in.
|
|
- **Metrics generator.** Service-map metrics are computed in the collector pipeline (cheaper than Tempo's `metrics_generator`).
|