senke/veza

Fork 0

senke c78bf1b765

Veza CI / Rust (Stream Server) (push) Successful in 5m4s

Details

Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s

Details

Veza CI / Backend (Go) (push) Failing after 15m45s

Details

Veza CI / Frontend (Web) (push) Successful in 18m7s

Details

Veza CI / Notify on failure (push) Successful in 6s

Details

E2E Playwright / e2e (full) (push) Successful in 24m9s

Details

feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)

Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
  * SLO_API_AVAILABILITY  : 99.5% on read (GET) endpoints
  * SLO_API_LATENCY       : 99% writes p95 < 500ms
  * SLO_PAYMENT_SUCCESS   : 99.5% on POST /api/v1/orders -> 2xx

Each SLO has two alerts :
  * <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
  * <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)

- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
  check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
  + PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
  + db-failover, redis-down, disk-full, cert-expiring-soon : one stub
  per likely page. Each lists first moves under 5min + common causes.

Acceptance (Day 10) : promtool check rules vert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 01:30:34 +02:00

3.9 KiB

Raw Blame History

Runbook — TLS certificate expiring soon

Alert : CertExpiringSoon (warning at 30d, critical at 7d). Owner : infra on-call.

Cert inventory

Domain	Issuer	Auto-renew ?	Where it lives
`api.veza.fr`	Let's Encrypt	Yes (Caddy)	Caddy data dir on the prod LB
`app.veza.fr`	Let's Encrypt	Yes (Caddy)	Caddy data dir on the prod LB
`staging.veza.fr` + sub	Let's Encrypt	Yes (Caddy)	Staging Caddy
`*.lxd` (internal)	self-signed	No — manually rotated	Each container's `/etc/ssl/...`
`jwt-private.pem` / public	self-generated	No — rotated yearly	Backend host (mounted via volume)
`pgaf-*.veza.lxd`	self-signed	No — rotated yearly	pg_auto_failover pki dir

The alert fires for the public-facing certs above. Internal .lxd certs are tracked separately by a yearly calendar reminder.

Auto-renewing certs (Let's Encrypt via Caddy)

Caddy renews 30 days before expiry. If the alert fires at 30d, that's the renewal window starting — confirm the renewal is happening :

# On the LB host :
sudo journalctl -u caddy --since "1 day ago" | grep -i "obtain\|renew\|cert"

# Caddy's internal state :
sudo curl -fsS http://localhost:2019/config/ | jq '.apps.tls.automation'

If renewal is failing :

Rate-limit : Let's Encrypt has a 5-attempt-per-hour limit per cert. Check Caddy log for 429 Too Many Requests.
DNS not pointing here : dig +short api.veza.fr must point at this LB.
Port 80 blocked : ACME HTTP-01 challenge needs port 80. sudo ss -lntp | grep ':80' should show Caddy.
Disk full : Caddy writes the new cert to disk before swapping. See disk-full.md.

Self-signed `.lxd` certs

These rotate on a yearly cadence (calendar reminder, not automated). When the alert fires :

# Inspect a cert :
echo | openssl s_client -connect minio.lxd:9000 -servername minio.lxd 2>/dev/null | openssl x509 -noout -dates -subject

# Regenerate (one-shot for self-signed CA + leaf) :
cd infra/pki/lab
./regenerate-cert.sh minio.lxd
# Then push to the container :
incus file push minio.crt minio:/etc/ssl/certs/minio.crt
incus file push minio.key minio:/etc/ssl/private/minio.key
incus exec minio -- systemctl reload minio

(Script TODO — currently the rotation is manual openssl. W4 backlog.)

JWT keys

jwt-private.pem / jwt-public.pem are RSA keys, not X.509. They don't "expire" but are rotated yearly. Procedure :

Generate a new pair :
```
./scripts/generate-jwt-keys.sh
```
Roll the public key first (backend trusts new + old) — current code only loads one ; needs a small extension. Tracked as v1.1 work.
Until that's wired, rotation = downtime window where every existing access token becomes invalid (5 min lifetime mitigates this).

After rotation

Hit a public endpoint and confirm the new cert is served :

echo | openssl s_client -connect api.veza.fr:443 2>/dev/null | openssl x509 -noout -dates

The CertExpiringSoon alert clears within one Prometheus scrape interval (~30s) once probe_ssl_earliest_cert_expiry is updated by blackbox-exporter.
If the cert was rotated under fire (renewal hit a wall, manual replacement), file a postmortem with the timeline.

What CAN break

Pinned certs in the mobile app (none today, but keep this in mind for v2+).
Customer integrations that fetched our public key once and cached it — JWT public key rotation will reject their cached signatures. Until v1.1 we don't promise stable JWT keys to third parties.

3.9 KiB Raw Blame History

Runbook — TLS certificate expiring soon

Cert inventory

Auto-renewing certs (Let's Encrypt via Caddy)

Self-signed .lxd certs

JWT keys

After rotation

What CAN break

3.9 KiB

Raw Blame History

Self-signed `.lxd` certs