senke/veza

senke b103a09a25 chore: consolidate CI, E2E, backend and frontend updates

- CI: workflows updates (cd, ci), remove playwright.yml
- E2E: global-setup, auth/playlists/profile specs
- Remove playwright-report and test-results artifacts from tracking
- Backend: auth, handlers, services, workers, migrations
- Frontend: components, features, vite config
- Add e2e-results.json to gitignore
- Docs: REMEDIATION_PROGRESS, audit archive
- Rust: chat-server, stream-server updates

2026-02-17 16:43:21 +01:00

2.9 KiB

Raw Blame History

External Uptime Monitoring Setup

This guide describes how to configure external uptime monitoring for the Veza platform. Use this to get notified when services become unavailable.

Recommended Tools

UptimeRobot (free tier: 50 monitors) — uptimerobot.com
Better Uptime — betteruptime.com
Pingdom — pingdom.com
Prometheus Blackbox Exporter (self-hosted) — if all infra is self-hosted

Endpoints to Monitor

Endpoint	Service	Purpose
`GET /health` or `GET /healthz`	Backend API	Basic liveness
`GET /readyz`	Backend API	Readiness (DB, Redis)
`GET /api/v1/health`	Backend API	API health (if different from root)
`GET /health`	Stream Server	Stream service liveness
`GET /health`	Chat Server	Chat service liveness

Example URLs (replace with your domain):

https://api.veza.com/healthz
https://api.veza.com/readyz
https://api.veza.com/api/v1/health
https://stream.veza.com/health
https://chat.veza.com/health

UptimeRobot Configuration

1. Create Monitors

Log in to UptimeRobot
Add Monitor → HTTP(s)
For each endpoint:
- Friendly Name: e.g. "Veza API Health"
- URL: e.g. https://api.veza.com/healthz
- Monitoring Interval: 5 minutes
- Monitor Type: HTTP(s)

2. Configure Alert Contacts

My Settings → Alert Contacts
Add Email: your-team@example.com
Add Slack (optional): webhook URL for #alerts channel

3. Alert Settings

Default: Alert when 2 consecutive checks fail
Alert frequency: Every 5 minutes until resolved (or configure as needed)

Alert Procedure

On failure: UptimeRobot sends alert to configured contacts
Check: Visit the dashboard to see which endpoint failed
Investigate: Check logs, Prometheus metrics, Grafana
Resolve: Restart service, fix deployment, or rollback
Post-mortem: Document root cause and preventive actions

Checklist

Monitors created for all critical endpoints
Alert contacts configured (email, Slack)
Alert threshold: 2 consecutive failures
Monitoring interval: 5 minutes
Runbook or escalation path documented

Integration with Prometheus

If you use Prometheus Blackbox Exporter:

# prometheus.yml
scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://api.veza.com/healthz
        - https://api.veza.com/readyz
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Configure alerts in Grafana or Alertmanager for probe failures.

2.9 KiB Raw Blame History