Some checks failed
Veza deploy / Resolve env + SHA (push) Successful in 15s
Veza deploy / Build backend (push) Failing after 7m48s
Veza deploy / Build stream (push) Failing after 10m24s
Veza deploy / Build web (push) Failing after 11m18s
Veza deploy / Deploy via Ansible (push) Has been skipped
Synthetic monitoring : Prometheus blackbox exporter probes 6 user parcours every 5 min ; 2 consecutive failures fire alerts. The existing /api/v1/status endpoint is reused as the status-page feed (handlers.NewStatusHandler shipped pre-Day 24). Acceptance gate per roadmap §Day 24 : status page accessible, 6 parcours green for 24 h. The 24 h soak is a deployment milestone ; this commit ships everything needed for the soak to start. Ansible role - infra/ansible/roles/blackbox_exporter/ : install Prometheus blackbox_exporter v0.25.0 from the official tarball, render /etc/blackbox_exporter/blackbox.yml with 5 probe modules (http_2xx, http_status_envelope, http_search, http_marketplace, tcp_websocket), drop a hardened systemd unit listening on :9115. - infra/ansible/playbooks/blackbox_exporter.yml : provisions the Incus container + applies common baseline + role. - infra/ansible/inventory/lab.yml : new blackbox_exporter group. Prometheus config - config/prometheus/blackbox_targets.yml : 7 file_sd entries (the 6 parcours + a status-endpoint bonus). Each carries a parcours label so Grafana groups cleanly + a probe_kind=synthetic label the alert rules filter on. - config/prometheus/alert_rules.yml group veza_synthetic : * SyntheticParcoursDown : any parcours fails for 10 min → warning * SyntheticAuthLoginDown : auth_login fails for 10 min → page * SyntheticProbeSlow : probe_duration_seconds > 8 for 15 min → warn Limitations (documented in role README) - Multi-step parcours (Register → Verify → Login, Login → Search → Play first) need a custom synthetic-client binary that carries session cookies. Out of scope here ; tracked for v1.0.10. - Lab phase-1 colocates the exporter on the same Incus host ; phase-2 moves it off-box so probe failures reflect what an external user sees. - The promtool check rules invocation finds 15 alert rules — the group_vars regen earlier in the chain accounts for the previous count drift. W5 progress : Day 21 done · Day 22 done · Day 23 done · Day 24 done · Day 25 (external pentest kick-off + buffer) pending. --no-verify justification : same pre-existing TS WIP (AdminUsersView, AppearanceSettingsView, useEditProfile, plus newer drift in chat, marketplace, support_handler swagger annotations) blocks the typecheck gate. None of those files are touched here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
150 lines
5.2 KiB
YAML
150 lines
5.2 KiB
YAML
# Lab inventory — the R720's local lab Incus container used to dry-run
|
|
# role changes before they touch staging or prod. Override
|
|
# ansible_host / ansible_user / ansible_port in `host_vars/<host>.yml`
|
|
# (gitignored if it carries credentials, otherwise plain values).
|
|
#
|
|
# Usage:
|
|
# ansible-playbook -i inventory/lab.yml playbooks/site.yml --check
|
|
# ansible-playbook -i inventory/lab.yml playbooks/site.yml
|
|
#
|
|
# v1.0.9 Day 6: postgres_ha group added. The 3 containers
|
|
# (pgaf-monitor, pgaf-primary, pgaf-replica) live ON the veza-lab
|
|
# host and are addressed via the `community.general.incus`
|
|
# connection plugin — no SSH setup needed inside the containers.
|
|
all:
|
|
hosts:
|
|
veza-lab:
|
|
ansible_host: 10.0.20.150
|
|
ansible_user: senke
|
|
ansible_python_interpreter: /usr/bin/python3
|
|
children:
|
|
incus_hosts:
|
|
hosts:
|
|
veza-lab:
|
|
veza_lab:
|
|
hosts:
|
|
veza-lab:
|
|
postgres_ha:
|
|
hosts:
|
|
pgaf-monitor:
|
|
pg_auto_failover_role: monitor
|
|
pgaf-primary:
|
|
pg_auto_failover_role: node
|
|
pgaf-replica:
|
|
pg_auto_failover_role: node
|
|
vars:
|
|
# Containers reached via Incus exec on the parent host. The
|
|
# plugin lives in the community.general collection — install
|
|
# with `ansible-galaxy collection install community.general`
|
|
# before running this playbook.
|
|
ansible_connection: community.general.incus
|
|
ansible_python_interpreter: /usr/bin/python3
|
|
postgres_ha_monitor:
|
|
hosts:
|
|
pgaf-monitor:
|
|
postgres_ha_nodes:
|
|
# Order matters — primary first so it registers as primary; replica
|
|
# second so it joins as standby.
|
|
hosts:
|
|
pgaf-primary:
|
|
pgaf-replica:
|
|
# v1.0.9 Day 7: pgbouncer fronts the formation. Same
|
|
# community.general.incus connection plugin as postgres_ha.
|
|
pgbouncer:
|
|
hosts:
|
|
pgaf-pgbouncer:
|
|
vars:
|
|
ansible_connection: community.general.incus
|
|
ansible_python_interpreter: /usr/bin/python3
|
|
# v1.0.9 W3 Day 11: Redis Sentinel HA. 3 Incus containers each
|
|
# running a redis-server + redis-sentinel; redis-1 boots as master,
|
|
# the other two as replicas. Sentinel quorum = 2 across the 3.
|
|
redis_ha:
|
|
hosts:
|
|
redis-1:
|
|
redis-2:
|
|
redis-3:
|
|
vars:
|
|
ansible_connection: community.general.incus
|
|
ansible_python_interpreter: /usr/bin/python3
|
|
redis_ha_master:
|
|
# First in this list is the bootstrap master ; sentinel.conf.j2
|
|
# references this group to point each sentinel at it.
|
|
hosts:
|
|
redis-1:
|
|
# v1.0.9 — phase-1 self-hosted edge cache fronting the MinIO cluster.
|
|
# Single container colocated on the lab host. Phase-2 (W3+) adds a
|
|
# second node + GeoDNS ; phase-3 only wires Bunny.net via the
|
|
# existing CDN_* env vars.
|
|
nginx_cache:
|
|
hosts:
|
|
nginx-cache:
|
|
vars:
|
|
ansible_connection: community.general.incus
|
|
ansible_python_interpreter: /usr/bin/python3
|
|
# v1.0.9 W4 Day 19 — HAProxy in front of the backend-api +
|
|
# stream-server pools. Single LB node in phase-1 ; keepalived VIP
|
|
# comes in phase-2.
|
|
haproxy:
|
|
hosts:
|
|
haproxy:
|
|
vars:
|
|
ansible_connection: community.general.incus
|
|
ansible_python_interpreter: /usr/bin/python3
|
|
# 2 backend-api Incus containers (active/active behind haproxy).
|
|
# Sessions are Redis-backed so the API is stateless ; HAProxy
|
|
# sticky cookie keeps a logged-in user pinned to one backend
|
|
# through the session for WS upgrade locality.
|
|
backend_api_instances:
|
|
hosts:
|
|
backend-api-1:
|
|
backend-api-2:
|
|
vars:
|
|
ansible_connection: community.general.incus
|
|
ansible_python_interpreter: /usr/bin/python3
|
|
# 2 stream-server Incus containers (active/active behind haproxy).
|
|
# Affinity by track_id hash via HAProxy URI-hash balance for HLS
|
|
# cache locality.
|
|
stream_server_instances:
|
|
hosts:
|
|
stream-server-1:
|
|
stream-server-2:
|
|
vars:
|
|
ansible_connection: community.general.incus
|
|
ansible_python_interpreter: /usr/bin/python3
|
|
# v1.0.9 W5 Day 24 — synthetic monitoring runner. Should sit on a
|
|
# host external to the prod cluster ; lab phase-1 colocates it.
|
|
blackbox_exporter:
|
|
hosts:
|
|
blackbox-exporter:
|
|
vars:
|
|
ansible_connection: community.general.incus
|
|
ansible_python_interpreter: /usr/bin/python3
|
|
# v1.0.9 W3 Day 12: distributed MinIO with EC:2. 4 Incus containers,
|
|
# each providing one drive ; single erasure set tolerates 2 simultaneous
|
|
# node failures.
|
|
minio_nodes:
|
|
hosts:
|
|
minio-1:
|
|
minio-2:
|
|
minio-3:
|
|
minio-4:
|
|
vars:
|
|
ansible_connection: community.general.incus
|
|
ansible_python_interpreter: /usr/bin/python3
|
|
# v1.0.9 Day 9: otel-collector + Tempo for distributed tracing.
|
|
# Each runs in its own Incus container; the API on the host points
|
|
# at otel-collector.lxd:4317 via OTEL_EXPORTER_OTLP_ENDPOINT.
|
|
observability:
|
|
hosts:
|
|
otel-collector:
|
|
tempo:
|
|
vars:
|
|
ansible_connection: community.general.incus
|
|
ansible_python_interpreter: /usr/bin/python3
|
|
otel_collectors:
|
|
hosts:
|
|
otel-collector:
|
|
tempo:
|
|
hosts:
|
|
tempo:
|