- CI: workflows updates (cd, ci), remove playwright.yml - E2E: global-setup, auth/playlists/profile specs - Remove playwright-report and test-results artifacts from tracking - Backend: auth, handlers, services, workers, migrations - Frontend: components, features, vite config - Add e2e-results.json to gitignore - Docs: REMEDIATION_PROGRESS, audit archive - Rust: chat-server, stream-server updates
2.9 KiB
2.9 KiB
External Uptime Monitoring Setup
This guide describes how to configure external uptime monitoring for the Veza platform. Use this to get notified when services become unavailable.
Recommended Tools
- UptimeRobot (free tier: 50 monitors) — uptimerobot.com
- Better Uptime — betteruptime.com
- Pingdom — pingdom.com
- Prometheus Blackbox Exporter (self-hosted) — if all infra is self-hosted
Endpoints to Monitor
| Endpoint | Service | Purpose |
|---|---|---|
GET /health or GET /healthz |
Backend API | Basic liveness |
GET /readyz |
Backend API | Readiness (DB, Redis) |
GET /api/v1/health |
Backend API | API health (if different from root) |
GET /health |
Stream Server | Stream service liveness |
GET /health |
Chat Server | Chat service liveness |
Example URLs (replace with your domain):
https://api.veza.com/healthzhttps://api.veza.com/readyzhttps://api.veza.com/api/v1/healthhttps://stream.veza.com/healthhttps://chat.veza.com/health
UptimeRobot Configuration
1. Create Monitors
- Log in to UptimeRobot
- Add Monitor → HTTP(s)
- For each endpoint:
- Friendly Name: e.g. "Veza API Health"
- URL: e.g.
https://api.veza.com/healthz - Monitoring Interval: 5 minutes
- Monitor Type: HTTP(s)
2. Configure Alert Contacts
- My Settings → Alert Contacts
- Add Email: your-team@example.com
- Add Slack (optional): webhook URL for
#alertschannel
3. Alert Settings
- Default: Alert when 2 consecutive checks fail
- Alert frequency: Every 5 minutes until resolved (or configure as needed)
Alert Procedure
- On failure: UptimeRobot sends alert to configured contacts
- Check: Visit the dashboard to see which endpoint failed
- Investigate: Check logs, Prometheus metrics, Grafana
- Resolve: Restart service, fix deployment, or rollback
- Post-mortem: Document root cause and preventive actions
Checklist
- Monitors created for all critical endpoints
- Alert contacts configured (email, Slack)
- Alert threshold: 2 consecutive failures
- Monitoring interval: 5 minutes
- Runbook or escalation path documented
Integration with Prometheus
If you use Prometheus Blackbox Exporter:
# prometheus.yml
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://api.veza.com/healthz
- https://api.veza.com/readyz
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Configure alerts in Grafana or Alertmanager for probe failures.