# External Uptime Monitoring Setup This guide describes how to configure external uptime monitoring for the Veza platform. Use this to get notified when services become unavailable. ## Recommended Tools - **UptimeRobot** (free tier: 50 monitors) — [uptimerobot.com](https://uptimerobot.com) - **Better Uptime** — [betteruptime.com](https://betteruptime.com) - **Pingdom** — [pingdom.com](https://pingdom.com) - **Prometheus Blackbox Exporter** (self-hosted) — if all infra is self-hosted ## Endpoints to Monitor | Endpoint | Service | Purpose | |----------|---------|---------| | `GET /health` or `GET /healthz` | Backend API | Basic liveness | | `GET /readyz` | Backend API | Readiness (DB, Redis) | | `GET /api/v1/health` | Backend API | API health (if different from root) | | `GET /health` | Stream Server | Stream service liveness | | `GET /health` | Chat Server | Chat service liveness | **Example URLs** (replace with your domain): - `https://api.veza.com/healthz` - `https://api.veza.com/readyz` - `https://api.veza.com/api/v1/health` - `https://stream.veza.com/health` - `https://chat.veza.com/health` ## UptimeRobot Configuration ### 1. Create Monitors 1. Log in to [UptimeRobot](https://uptimerobot.com) 2. Add Monitor → HTTP(s) 3. For each endpoint: - **Friendly Name**: e.g. "Veza API Health" - **URL**: e.g. `https://api.veza.com/healthz` - **Monitoring Interval**: 5 minutes - **Monitor Type**: HTTP(s) ### 2. Configure Alert Contacts 1. My Settings → Alert Contacts 2. Add Email: your-team@example.com 3. Add Slack (optional): webhook URL for `#alerts` channel ### 3. Alert Settings - **Default**: Alert when 2 consecutive checks fail - **Alert frequency**: Every 5 minutes until resolved (or configure as needed) ## Alert Procedure 1. **On failure**: UptimeRobot sends alert to configured contacts 2. **Check**: Visit the dashboard to see which endpoint failed 3. **Investigate**: Check logs, Prometheus metrics, Grafana 4. **Resolve**: Restart service, fix deployment, or rollback 5. **Post-mortem**: Document root cause and preventive actions ## Checklist - [ ] Monitors created for all critical endpoints - [ ] Alert contacts configured (email, Slack) - [ ] Alert threshold: 2 consecutive failures - [ ] Monitoring interval: 5 minutes - [ ] Runbook or escalation path documented ## Integration with Prometheus If you use Prometheus Blackbox Exporter: ```yaml # prometheus.yml scrape_configs: - job_name: 'blackbox' metrics_path: /probe params: module: [http_2xx] static_configs: - targets: - https://api.veza.com/healthz - https://api.veza.com/readyz relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox-exporter:9115 ``` Configure alerts in Grafana or Alertmanager for probe failures.