Some checks failed
Veza CI / Notify on failure (push) Blocked by required conditions
Security Scan / Secret Scanning (gitleaks) (push) Waiting to run
Veza CI / Backend (Go) (push) Has been cancelled
Veza CI / Rust (Stream Server) (push) Has been cancelled
Veza CI / Frontend (Web) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Wires distributed tracing end-to-end. Backend exports OTLP/gRPC to a
collector, which tail-samples (errors + slow always, 10% rest) and
ships to Tempo. Grafana service-map dashboard pivots on the 4
instrumented hot paths.
- internal/tracing/otlp_exporter.go : InitOTLPTracer + Provider.Shutdown,
BatchSpanProcessor (5s/512 batch), ParentBased(TraceIDRatio) sampler,
W3C trace-context + baggage propagators. OTEL_SDK_DISABLED=true
short-circuits to a no-op. Failure to dial collector is non-fatal.
- cmd/api/main.go : init at boot, defer Shutdown(5s) on exit. appVersion
ldflag-overridable for resource attributes.
- 4 hot paths instrumented :
* handlers/auth.go::Login → "auth.login"
* core/track/track_upload_handler.go::InitiateChunkedUpload → "track.upload.initiate"
* core/marketplace/service.go::ProcessPaymentWebhook → "payment.webhook"
* handlers/search_handlers.go::Search → "search.query"
PII guarded — email masked, query content not recorded (length only).
- infra/ansible/roles/otel_collector : pin v0.116.1 contrib build,
systemd unit, tail-sampling config (errors + > 500ms always kept).
- infra/ansible/roles/tempo : pin v2.7.1 monolithic, local-disk backend
(S3 deferred to v1.1), 14d retention.
- infra/ansible/playbooks/observability.yml : provisions both Incus
containers + applies common baseline + roles in order.
- inventory/lab.yml : new groups observability, otel_collectors, tempo.
- config/grafana/dashboards/service-map.json : node graph + 4 hot-path
span tables + collector throughput/queue panels.
- docs/ENV_VARIABLES.md §30 : 4 OTEL_* env vars documented.
Acceptance criterion (Day 9) : login → span visible in Tempo UI. Lab
deployment to validate with `ansible-playbook -i inventory/lab.yml
playbooks/observability.yml` once roles/postgres_ha is up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
96 lines
3.2 KiB
YAML
96 lines
3.2 KiB
YAML
# otel_collector role — installs opentelemetry-collector-contrib as a
|
|
# tarball under /opt, drops the systemd unit, renders the config, and
|
|
# starts it. Idempotent. Designed to run in an Incus container so the
|
|
# collector can be restarted independently of the API process.
|
|
---
|
|
- name: Ensure /opt/otelcol-contrib exists
|
|
ansible.builtin.file:
|
|
path: /opt/otelcol-contrib
|
|
state: directory
|
|
owner: root
|
|
group: root
|
|
mode: "0755"
|
|
tags: [otel_collector, install]
|
|
|
|
- name: Check installed otelcol version
|
|
ansible.builtin.stat:
|
|
path: "/opt/otelcol-contrib/otelcol-contrib-{{ otel_collector_version }}"
|
|
register: otelcol_installed
|
|
tags: [otel_collector, install]
|
|
|
|
- name: Download opentelemetry-collector-contrib tarball
|
|
ansible.builtin.get_url:
|
|
url: "https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v{{ otel_collector_version }}/otelcol-contrib_{{ otel_collector_version }}_linux_{{ otel_collector_arch }}.tar.gz"
|
|
dest: "/tmp/otelcol-contrib-{{ otel_collector_version }}.tar.gz"
|
|
mode: "0644"
|
|
when: not otelcol_installed.stat.exists
|
|
tags: [otel_collector, install]
|
|
|
|
- name: Extract collector binary into versioned slot
|
|
ansible.builtin.unarchive:
|
|
src: "/tmp/otelcol-contrib-{{ otel_collector_version }}.tar.gz"
|
|
dest: /opt/otelcol-contrib
|
|
remote_src: true
|
|
creates: "/opt/otelcol-contrib/otelcol-contrib-{{ otel_collector_version }}"
|
|
extra_opts:
|
|
- "--transform=s|^otelcol-contrib$|otelcol-contrib-{{ otel_collector_version }}|"
|
|
when: not otelcol_installed.stat.exists
|
|
tags: [otel_collector, install]
|
|
|
|
# /usr/local/bin/otelcol-contrib symlink → versioned binary. Lets us
|
|
# bump the version by changing only `otel_collector_version` and
|
|
# re-running the role; systemd unit doesn't change.
|
|
- name: Symlink /usr/local/bin/otelcol-contrib → versioned binary
|
|
ansible.builtin.file:
|
|
src: "/opt/otelcol-contrib/otelcol-contrib-{{ otel_collector_version }}"
|
|
dest: /usr/local/bin/otelcol-contrib
|
|
state: link
|
|
force: true
|
|
notify: Restart otel-collector
|
|
tags: [otel_collector, install]
|
|
|
|
- name: Create otel-collector system user
|
|
ansible.builtin.user:
|
|
name: otelcol
|
|
system: true
|
|
home: /var/lib/otel-collector
|
|
shell: /usr/sbin/nologin
|
|
create_home: true
|
|
tags: [otel_collector, install]
|
|
|
|
- name: Ensure /etc/otel-collector exists
|
|
ansible.builtin.file:
|
|
path: /etc/otel-collector
|
|
state: directory
|
|
owner: root
|
|
group: otelcol
|
|
mode: "0750"
|
|
tags: [otel_collector, config]
|
|
|
|
- name: Render collector config
|
|
ansible.builtin.template:
|
|
src: otel-collector.yaml.j2
|
|
dest: /etc/otel-collector/otel-collector.yaml
|
|
owner: root
|
|
group: otelcol
|
|
mode: "0640"
|
|
notify: Restart otel-collector
|
|
tags: [otel_collector, config]
|
|
|
|
- name: Render systemd unit
|
|
ansible.builtin.template:
|
|
src: otel-collector.service.j2
|
|
dest: /etc/systemd/system/otel-collector.service
|
|
owner: root
|
|
group: root
|
|
mode: "0644"
|
|
notify: Restart otel-collector
|
|
tags: [otel_collector, service]
|
|
|
|
- name: Enable + start otel-collector
|
|
ansible.builtin.systemd:
|
|
name: otel-collector
|
|
state: started
|
|
enabled: true
|
|
daemon_reload: true
|
|
tags: [otel_collector, service]
|