veza/docs/MONITORING_SETUP.md
senke b103a09a25 chore: consolidate CI, E2E, backend and frontend updates
- CI: workflows updates (cd, ci), remove playwright.yml
- E2E: global-setup, auth/playlists/profile specs
- Remove playwright-report and test-results artifacts from tracking
- Backend: auth, handlers, services, workers, migrations
- Frontend: components, features, vite config
- Add e2e-results.json to gitignore
- Docs: REMEDIATION_PROGRESS, audit archive
- Rust: chat-server, stream-server updates
2026-02-17 16:43:21 +01:00

2.9 KiB

External Uptime Monitoring Setup

This guide describes how to configure external uptime monitoring for the Veza platform. Use this to get notified when services become unavailable.

Endpoints to Monitor

Endpoint Service Purpose
GET /health or GET /healthz Backend API Basic liveness
GET /readyz Backend API Readiness (DB, Redis)
GET /api/v1/health Backend API API health (if different from root)
GET /health Stream Server Stream service liveness
GET /health Chat Server Chat service liveness

Example URLs (replace with your domain):

  • https://api.veza.com/healthz
  • https://api.veza.com/readyz
  • https://api.veza.com/api/v1/health
  • https://stream.veza.com/health
  • https://chat.veza.com/health

UptimeRobot Configuration

1. Create Monitors

  1. Log in to UptimeRobot
  2. Add Monitor → HTTP(s)
  3. For each endpoint:
    • Friendly Name: e.g. "Veza API Health"
    • URL: e.g. https://api.veza.com/healthz
    • Monitoring Interval: 5 minutes
    • Monitor Type: HTTP(s)

2. Configure Alert Contacts

  1. My Settings → Alert Contacts
  2. Add Email: your-team@example.com
  3. Add Slack (optional): webhook URL for #alerts channel

3. Alert Settings

  • Default: Alert when 2 consecutive checks fail
  • Alert frequency: Every 5 minutes until resolved (or configure as needed)

Alert Procedure

  1. On failure: UptimeRobot sends alert to configured contacts
  2. Check: Visit the dashboard to see which endpoint failed
  3. Investigate: Check logs, Prometheus metrics, Grafana
  4. Resolve: Restart service, fix deployment, or rollback
  5. Post-mortem: Document root cause and preventive actions

Checklist

  • Monitors created for all critical endpoints
  • Alert contacts configured (email, Slack)
  • Alert threshold: 2 consecutive failures
  • Monitoring interval: 5 minutes
  • Runbook or escalation path documented

Integration with Prometheus

If you use Prometheus Blackbox Exporter:

# prometheus.yml
scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://api.veza.com/healthz
        - https://api.veza.com/readyz
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Configure alerts in Grafana or Alertmanager for probe failures.