veza/docs/runbooks/cert-expiring-soon.md

# Runbook — TLS certificate expiring soon

> **Alert** : `CertExpiringSoon` (warning at 30d, critical at 7d).
> **Owner** : infra on-call.

## Cert inventory

| Domain                     | Issuer            | Auto-renew ?           | Where it lives                           |
| -------------------------- | ----------------- | ---------------------- | ---------------------------------------- |
| `api.veza.fr`              | Let's Encrypt     | Yes (Caddy)            | Caddy data dir on the prod LB            |
| `app.veza.fr`              | Let's Encrypt     | Yes (Caddy)            | Caddy data dir on the prod LB            |
| `staging.veza.fr` + sub    | Let's Encrypt     | Yes (Caddy)            | Staging Caddy                            |
| `*.lxd` (internal)         | self-signed       | No — manually rotated  | Each container's `/etc/ssl/...`          |
| `jwt-private.pem` / public | self-generated    | No — rotated yearly    | Backend host (mounted via volume)        |
| `pgaf-*.veza.lxd`          | self-signed       | No — rotated yearly    | pg_auto_failover pki dir                 |

The alert fires for the **public-facing** certs above. Internal `.lxd` certs are tracked separately by a yearly calendar reminder.

## Auto-renewing certs (Let's Encrypt via Caddy)

Caddy renews 30 days before expiry. If the alert fires at 30d, that's the renewal window starting — confirm the renewal is happening :

```bash
# On the LB host :
sudo journalctl -u caddy --since "1 day ago" | grep -i "obtain\|renew\|cert"

# Caddy's internal state :
sudo curl -fsS http://localhost:2019/config/ | jq '.apps.tls.automation'
```

If renewal is failing :

1. **Rate-limit** : Let's Encrypt has a 5-attempt-per-hour limit per cert. Check Caddy log for `429 Too Many Requests`.
2. **DNS not pointing here** : `dig +short api.veza.fr` must point at this LB.
3. **Port 80 blocked** : ACME HTTP-01 challenge needs port 80. `sudo ss -lntp | grep ':80'` should show Caddy.
4. **Disk full** : Caddy writes the new cert to disk before swapping. See `disk-full.md`.

## Self-signed `.lxd` certs

These rotate on a yearly cadence (calendar reminder, not automated). When the alert fires :

```bash
# Inspect a cert :
echo | openssl s_client -connect minio.lxd:9000 -servername minio.lxd 2>/dev/null | openssl x509 -noout -dates -subject

# Regenerate (one-shot for self-signed CA + leaf) :
cd infra/pki/lab
./regenerate-cert.sh minio.lxd
# Then push to the container :
incus file push minio.crt minio:/etc/ssl/certs/minio.crt
incus file push minio.key minio:/etc/ssl/private/minio.key
incus exec minio -- systemctl reload minio
```

(Script TODO — currently the rotation is manual openssl. W4 backlog.)

## JWT keys

`jwt-private.pem` / `jwt-public.pem` are RSA keys, not X.509. They don't "expire" but are rotated yearly. Procedure :

1. Generate a new pair :
   ```bash
   ./scripts/generate-jwt-keys.sh
   ```
2. Roll the public key first (backend trusts new + old) — current code only loads one ; needs a small extension. **Tracked as v1.1 work.**
3. Until that's wired, rotation = downtime window where every existing access token becomes invalid (5 min lifetime mitigates this).

## After rotation

1. Hit a public endpoint and confirm the new cert is served :
   ```bash
   echo | openssl s_client -connect api.veza.fr:443 2>/dev/null | openssl x509 -noout -dates
   ```
2. The `CertExpiringSoon` alert clears within one Prometheus scrape interval (~30s) once `probe_ssl_earliest_cert_expiry` is updated by blackbox-exporter.
3. If the cert was rotated under fire (renewal hit a wall, manual replacement), file a postmortem with the timeline.

## What CAN break

- Pinned certs in the mobile app (none today, but keep this in mind for v2+).
- Customer integrations that fetched our public key once and cached it — JWT public key rotation will reject their cached signatures. Until v1.1 we don't promise stable JWT keys to third parties.