veza/docs/MONITORING_SETUP.md
senke b103a09a25 chore: consolidate CI, E2E, backend and frontend updates
- CI: workflows updates (cd, ci), remove playwright.yml
- E2E: global-setup, auth/playlists/profile specs
- Remove playwright-report and test-results artifacts from tracking
- Backend: auth, handlers, services, workers, migrations
- Frontend: components, features, vite config
- Add e2e-results.json to gitignore
- Docs: REMEDIATION_PROGRESS, audit archive
- Rust: chat-server, stream-server updates
2026-02-17 16:43:21 +01:00

93 lines
2.9 KiB
Markdown

# External Uptime Monitoring Setup
This guide describes how to configure external uptime monitoring for the Veza platform. Use this to get notified when services become unavailable.
## Recommended Tools
- **UptimeRobot** (free tier: 50 monitors) — [uptimerobot.com](https://uptimerobot.com)
- **Better Uptime** — [betteruptime.com](https://betteruptime.com)
- **Pingdom** — [pingdom.com](https://pingdom.com)
- **Prometheus Blackbox Exporter** (self-hosted) — if all infra is self-hosted
## Endpoints to Monitor
| Endpoint | Service | Purpose |
|----------|---------|---------|
| `GET /health` or `GET /healthz` | Backend API | Basic liveness |
| `GET /readyz` | Backend API | Readiness (DB, Redis) |
| `GET /api/v1/health` | Backend API | API health (if different from root) |
| `GET /health` | Stream Server | Stream service liveness |
| `GET /health` | Chat Server | Chat service liveness |
**Example URLs** (replace with your domain):
- `https://api.veza.com/healthz`
- `https://api.veza.com/readyz`
- `https://api.veza.com/api/v1/health`
- `https://stream.veza.com/health`
- `https://chat.veza.com/health`
## UptimeRobot Configuration
### 1. Create Monitors
1. Log in to [UptimeRobot](https://uptimerobot.com)
2. Add Monitor → HTTP(s)
3. For each endpoint:
- **Friendly Name**: e.g. "Veza API Health"
- **URL**: e.g. `https://api.veza.com/healthz`
- **Monitoring Interval**: 5 minutes
- **Monitor Type**: HTTP(s)
### 2. Configure Alert Contacts
1. My Settings → Alert Contacts
2. Add Email: your-team@example.com
3. Add Slack (optional): webhook URL for `#alerts` channel
### 3. Alert Settings
- **Default**: Alert when 2 consecutive checks fail
- **Alert frequency**: Every 5 minutes until resolved (or configure as needed)
## Alert Procedure
1. **On failure**: UptimeRobot sends alert to configured contacts
2. **Check**: Visit the dashboard to see which endpoint failed
3. **Investigate**: Check logs, Prometheus metrics, Grafana
4. **Resolve**: Restart service, fix deployment, or rollback
5. **Post-mortem**: Document root cause and preventive actions
## Checklist
- [ ] Monitors created for all critical endpoints
- [ ] Alert contacts configured (email, Slack)
- [ ] Alert threshold: 2 consecutive failures
- [ ] Monitoring interval: 5 minutes
- [ ] Runbook or escalation path documented
## Integration with Prometheus
If you use Prometheus Blackbox Exporter:
```yaml
# prometheus.yml
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://api.veza.com/healthz
- https://api.veza.com/readyz
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
```
Configure alerts in Grafana or Alertmanager for probe failures.