veza/docs/MONITORING_SETUP.md

# External Uptime Monitoring Setup

This guide describes how to configure external uptime monitoring for the Veza platform. Use this to get notified when services become unavailable.

## Recommended Tools

- **UptimeRobot** (free tier: 50 monitors) — [uptimerobot.com](https://uptimerobot.com)
- **Better Uptime** — [betteruptime.com](https://betteruptime.com)
- **Pingdom** — [pingdom.com](https://pingdom.com)
- **Prometheus Blackbox Exporter** (self-hosted) — if all infra is self-hosted

## Endpoints to Monitor

| Endpoint | Service | Purpose |
|----------|---------|---------|
| `GET /health` or `GET /healthz` | Backend API | Basic liveness |
| `GET /readyz` | Backend API | Readiness (DB, Redis) |
| `GET /api/v1/health` | Backend API | API health (if different from root) |
| `GET /health` | Stream Server | Stream service liveness |
| `GET /health` | Chat Server | Chat service liveness |

**Example URLs** (replace with your domain):

- `https://api.veza.com/healthz`
- `https://api.veza.com/readyz`
- `https://api.veza.com/api/v1/health`
- `https://stream.veza.com/health`
- `https://chat.veza.com/health`

## UptimeRobot Configuration

### 1. Create Monitors

1. Log in to [UptimeRobot](https://uptimerobot.com)
2. Add Monitor → HTTP(s)
3. For each endpoint:
   - **Friendly Name**: e.g. "Veza API Health"
   - **URL**: e.g. `https://api.veza.com/healthz`
   - **Monitoring Interval**: 5 minutes
   - **Monitor Type**: HTTP(s)

### 2. Configure Alert Contacts

1. My Settings → Alert Contacts
2. Add Email: your-team@example.com
3. Add Slack (optional): webhook URL for `#alerts` channel

### 3. Alert Settings

- **Default**: Alert when 2 consecutive checks fail
- **Alert frequency**: Every 5 minutes until resolved (or configure as needed)

## Alert Procedure

1. **On failure**: UptimeRobot sends alert to configured contacts
2. **Check**: Visit the dashboard to see which endpoint failed
3. **Investigate**: Check logs, Prometheus metrics, Grafana
4. **Resolve**: Restart service, fix deployment, or rollback
5. **Post-mortem**: Document root cause and preventive actions

## Checklist

- [ ] Monitors created for all critical endpoints
- [ ] Alert contacts configured (email, Slack)
- [ ] Alert threshold: 2 consecutive failures
- [ ] Monitoring interval: 5 minutes
- [ ] Runbook or escalation path documented

## Integration with Prometheus

If you use Prometheus Blackbox Exporter:

```yaml
# prometheus.yml
scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://api.veza.com/healthz
        - https://api.veza.com/readyz
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115
```

Configure alerts in Grafana or Alertmanager for probe failures.