- CI: workflows updates (cd, ci), remove playwright.yml - E2E: global-setup, auth/playlists/profile specs - Remove playwright-report and test-results artifacts from tracking - Backend: auth, handlers, services, workers, migrations - Frontend: components, features, vite config - Add e2e-results.json to gitignore - Docs: REMEDIATION_PROGRESS, audit archive - Rust: chat-server, stream-server updates
93 lines
2.9 KiB
Markdown
93 lines
2.9 KiB
Markdown
# External Uptime Monitoring Setup
|
|
|
|
This guide describes how to configure external uptime monitoring for the Veza platform. Use this to get notified when services become unavailable.
|
|
|
|
## Recommended Tools
|
|
|
|
- **UptimeRobot** (free tier: 50 monitors) — [uptimerobot.com](https://uptimerobot.com)
|
|
- **Better Uptime** — [betteruptime.com](https://betteruptime.com)
|
|
- **Pingdom** — [pingdom.com](https://pingdom.com)
|
|
- **Prometheus Blackbox Exporter** (self-hosted) — if all infra is self-hosted
|
|
|
|
## Endpoints to Monitor
|
|
|
|
| Endpoint | Service | Purpose |
|
|
|----------|---------|---------|
|
|
| `GET /health` or `GET /healthz` | Backend API | Basic liveness |
|
|
| `GET /readyz` | Backend API | Readiness (DB, Redis) |
|
|
| `GET /api/v1/health` | Backend API | API health (if different from root) |
|
|
| `GET /health` | Stream Server | Stream service liveness |
|
|
| `GET /health` | Chat Server | Chat service liveness |
|
|
|
|
**Example URLs** (replace with your domain):
|
|
|
|
- `https://api.veza.com/healthz`
|
|
- `https://api.veza.com/readyz`
|
|
- `https://api.veza.com/api/v1/health`
|
|
- `https://stream.veza.com/health`
|
|
- `https://chat.veza.com/health`
|
|
|
|
## UptimeRobot Configuration
|
|
|
|
### 1. Create Monitors
|
|
|
|
1. Log in to [UptimeRobot](https://uptimerobot.com)
|
|
2. Add Monitor → HTTP(s)
|
|
3. For each endpoint:
|
|
- **Friendly Name**: e.g. "Veza API Health"
|
|
- **URL**: e.g. `https://api.veza.com/healthz`
|
|
- **Monitoring Interval**: 5 minutes
|
|
- **Monitor Type**: HTTP(s)
|
|
|
|
### 2. Configure Alert Contacts
|
|
|
|
1. My Settings → Alert Contacts
|
|
2. Add Email: your-team@example.com
|
|
3. Add Slack (optional): webhook URL for `#alerts` channel
|
|
|
|
### 3. Alert Settings
|
|
|
|
- **Default**: Alert when 2 consecutive checks fail
|
|
- **Alert frequency**: Every 5 minutes until resolved (or configure as needed)
|
|
|
|
## Alert Procedure
|
|
|
|
1. **On failure**: UptimeRobot sends alert to configured contacts
|
|
2. **Check**: Visit the dashboard to see which endpoint failed
|
|
3. **Investigate**: Check logs, Prometheus metrics, Grafana
|
|
4. **Resolve**: Restart service, fix deployment, or rollback
|
|
5. **Post-mortem**: Document root cause and preventive actions
|
|
|
|
## Checklist
|
|
|
|
- [ ] Monitors created for all critical endpoints
|
|
- [ ] Alert contacts configured (email, Slack)
|
|
- [ ] Alert threshold: 2 consecutive failures
|
|
- [ ] Monitoring interval: 5 minutes
|
|
- [ ] Runbook or escalation path documented
|
|
|
|
## Integration with Prometheus
|
|
|
|
If you use Prometheus Blackbox Exporter:
|
|
|
|
```yaml
|
|
# prometheus.yml
|
|
scrape_configs:
|
|
- job_name: 'blackbox'
|
|
metrics_path: /probe
|
|
params:
|
|
module: [http_2xx]
|
|
static_configs:
|
|
- targets:
|
|
- https://api.veza.com/healthz
|
|
- https://api.veza.com/readyz
|
|
relabel_configs:
|
|
- source_labels: [__address__]
|
|
target_label: __param_target
|
|
- source_labels: [__param_target]
|
|
target_label: instance
|
|
- target_label: __address__
|
|
replacement: blackbox-exporter:9115
|
|
```
|
|
|
|
Configure alerts in Grafana or Alertmanager for probe failures.
|