Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m4s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s
Veza CI / Backend (Go) (push) Failing after 15m45s
Veza CI / Frontend (Web) (push) Successful in 18m7s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Successful in 24m9s
Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
* SLO_API_AVAILABILITY : 99.5% on read (GET) endpoints
* SLO_API_LATENCY : 99% writes p95 < 500ms
* SLO_PAYMENT_SUCCESS : 99.5% on POST /api/v1/orders -> 2xx
Each SLO has two alerts :
* <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
* <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)
- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
+ PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
+ db-failover, redis-down, disk-full, cert-expiring-soon : one stub
per likely page. Each lists first moves under 5min + common causes.
Acceptance (Day 10) : promtool check rules vert.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
80 lines
3.9 KiB
Markdown
80 lines
3.9 KiB
Markdown
# Runbook — TLS certificate expiring soon
|
|
|
|
> **Alert** : `CertExpiringSoon` (warning at 30d, critical at 7d).
|
|
> **Owner** : infra on-call.
|
|
|
|
## Cert inventory
|
|
|
|
| Domain | Issuer | Auto-renew ? | Where it lives |
|
|
| -------------------------- | ----------------- | ---------------------- | ---------------------------------------- |
|
|
| `api.veza.fr` | Let's Encrypt | Yes (Caddy) | Caddy data dir on the prod LB |
|
|
| `app.veza.fr` | Let's Encrypt | Yes (Caddy) | Caddy data dir on the prod LB |
|
|
| `staging.veza.fr` + sub | Let's Encrypt | Yes (Caddy) | Staging Caddy |
|
|
| `*.lxd` (internal) | self-signed | No — manually rotated | Each container's `/etc/ssl/...` |
|
|
| `jwt-private.pem` / public | self-generated | No — rotated yearly | Backend host (mounted via volume) |
|
|
| `pgaf-*.veza.lxd` | self-signed | No — rotated yearly | pg_auto_failover pki dir |
|
|
|
|
The alert fires for the **public-facing** certs above. Internal `.lxd` certs are tracked separately by a yearly calendar reminder.
|
|
|
|
## Auto-renewing certs (Let's Encrypt via Caddy)
|
|
|
|
Caddy renews 30 days before expiry. If the alert fires at 30d, that's the renewal window starting — confirm the renewal is happening :
|
|
|
|
```bash
|
|
# On the LB host :
|
|
sudo journalctl -u caddy --since "1 day ago" | grep -i "obtain\|renew\|cert"
|
|
|
|
# Caddy's internal state :
|
|
sudo curl -fsS http://localhost:2019/config/ | jq '.apps.tls.automation'
|
|
```
|
|
|
|
If renewal is failing :
|
|
|
|
1. **Rate-limit** : Let's Encrypt has a 5-attempt-per-hour limit per cert. Check Caddy log for `429 Too Many Requests`.
|
|
2. **DNS not pointing here** : `dig +short api.veza.fr` must point at this LB.
|
|
3. **Port 80 blocked** : ACME HTTP-01 challenge needs port 80. `sudo ss -lntp | grep ':80'` should show Caddy.
|
|
4. **Disk full** : Caddy writes the new cert to disk before swapping. See `disk-full.md`.
|
|
|
|
## Self-signed `.lxd` certs
|
|
|
|
These rotate on a yearly cadence (calendar reminder, not automated). When the alert fires :
|
|
|
|
```bash
|
|
# Inspect a cert :
|
|
echo | openssl s_client -connect minio.lxd:9000 -servername minio.lxd 2>/dev/null | openssl x509 -noout -dates -subject
|
|
|
|
# Regenerate (one-shot for self-signed CA + leaf) :
|
|
cd infra/pki/lab
|
|
./regenerate-cert.sh minio.lxd
|
|
# Then push to the container :
|
|
incus file push minio.crt minio:/etc/ssl/certs/minio.crt
|
|
incus file push minio.key minio:/etc/ssl/private/minio.key
|
|
incus exec minio -- systemctl reload minio
|
|
```
|
|
|
|
(Script TODO — currently the rotation is manual openssl. W4 backlog.)
|
|
|
|
## JWT keys
|
|
|
|
`jwt-private.pem` / `jwt-public.pem` are RSA keys, not X.509. They don't "expire" but are rotated yearly. Procedure :
|
|
|
|
1. Generate a new pair :
|
|
```bash
|
|
./scripts/generate-jwt-keys.sh
|
|
```
|
|
2. Roll the public key first (backend trusts new + old) — current code only loads one ; needs a small extension. **Tracked as v1.1 work.**
|
|
3. Until that's wired, rotation = downtime window where every existing access token becomes invalid (5 min lifetime mitigates this).
|
|
|
|
## After rotation
|
|
|
|
1. Hit a public endpoint and confirm the new cert is served :
|
|
```bash
|
|
echo | openssl s_client -connect api.veza.fr:443 2>/dev/null | openssl x509 -noout -dates
|
|
```
|
|
2. The `CertExpiringSoon` alert clears within one Prometheus scrape interval (~30s) once `probe_ssl_earliest_cert_expiry` is updated by blackbox-exporter.
|
|
3. If the cert was rotated under fire (renewal hit a wall, manual replacement), file a postmortem with the timeline.
|
|
|
|
## What CAN break
|
|
|
|
- Pinned certs in the mobile app (none today, but keep this in mind for v2+).
|
|
- Customer integrations that fetched our public key once and cached it — JWT public key rotation will reject their cached signatures. Until v1.1 we don't promise stable JWT keys to third parties.
|