feat(observability): SLO burn-rate alerts + 7 runbook stubs (W2 Day 10)
Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m4s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s
Veza CI / Backend (Go) (push) Failing after 15m45s
Veza CI / Frontend (Web) (push) Successful in 18m7s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Successful in 24m9s
Some checks failed
Veza CI / Rust (Stream Server) (push) Successful in 5m4s
Security Scan / Secret Scanning (gitleaks) (push) Failing after 42s
Veza CI / Backend (Go) (push) Failing after 15m45s
Veza CI / Frontend (Web) (push) Successful in 18m7s
Veza CI / Notify on failure (push) Successful in 6s
E2E Playwright / e2e (full) (push) Successful in 24m9s
Three SLOs with multi-window burn-rate alerts (Google SRE workbook
methodology) :
* SLO_API_AVAILABILITY : 99.5% on read (GET) endpoints
* SLO_API_LATENCY : 99% writes p95 < 500ms
* SLO_PAYMENT_SUCCESS : 99.5% on POST /api/v1/orders -> 2xx
Each SLO has two alerts :
* <name>SLOFastBurn — page-grade, 2% budget burned in 1h (1h+5m windows)
* <name>SLOSlowBurn — ticket-grade, 5% budget burned in 6h (6h+30m)
- config/prometheus/slo.yml : 12 recording rules + 6 alerts ; promtool
check rules => SUCCESS: 18 rules found.
- config/alertmanager/routes.yml : routing tree splits page-oncall (slack
+ PagerDuty) from ticket-oncall (slack only).
- docs/runbooks/{api-availability,api-latency,payment-success}-slo-burn.md
+ db-failover, redis-down, disk-full, cert-expiring-soon : one stub
per likely page. Each lists first moves under 5min + common causes.
Acceptance (Day 10) : promtool check rules vert.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
84e92a75e2
commit
c78bf1b765
9 changed files with 909 additions and 0 deletions
96
config/alertmanager/routes.yml
Normal file
96
config/alertmanager/routes.yml
Normal file
|
|
@ -0,0 +1,96 @@
|
|||
# Alertmanager routing tree (v1.0.9 W2 Day 10).
|
||||
#
|
||||
# Two channels :
|
||||
# * page-oncall — slack #alerts-page + PagerDuty bridge. Wakes
|
||||
# someone up. Reserved for severity=critical AND
|
||||
# page=true.
|
||||
# * ticket-oncall — slack #alerts-ticket. Files a ticket; investigate
|
||||
# during business hours.
|
||||
#
|
||||
# Routing key = labels on the alert. The SLO rules in slo.yml set
|
||||
# `page: "true"` on fast-burn alerts and `page: "false"` on slow-burn,
|
||||
# so the burn-rate methodology and the routing tree stay coupled.
|
||||
#
|
||||
# This file is meant to be merged into the main alertmanager.yml
|
||||
# (or stitched in via -config-file overrides). Keeping it separate
|
||||
# makes it easy to diff and review the routing logic without
|
||||
# touching receiver credentials.
|
||||
route:
|
||||
receiver: 'slack-default'
|
||||
group_by: ['alertname', 'job', 'slo']
|
||||
group_wait: 30s
|
||||
group_interval: 5m
|
||||
repeat_interval: 4h
|
||||
|
||||
routes:
|
||||
# Page-grade : critical + explicitly tagged page=true.
|
||||
- matchers:
|
||||
- severity = critical
|
||||
- page = "true"
|
||||
receiver: page-oncall
|
||||
group_wait: 10s # page faster than the default 30s
|
||||
repeat_interval: 30m # keep paging until ack'd
|
||||
continue: false
|
||||
|
||||
# Ticket-grade : warning OR critical-without-page.
|
||||
- matchers:
|
||||
- page = "false"
|
||||
receiver: ticket-oncall
|
||||
group_wait: 1m
|
||||
repeat_interval: 12h
|
||||
continue: false
|
||||
|
||||
# Fallback : critical alerts without a page=… label still go to
|
||||
# page-oncall. Better wake someone up for an unlabelled critical
|
||||
# than silently route it to ticket.
|
||||
- matchers:
|
||||
- severity = critical
|
||||
receiver: page-oncall
|
||||
continue: false
|
||||
|
||||
receivers:
|
||||
- name: page-oncall
|
||||
slack_configs:
|
||||
- api_url: '${SLACK_WEBHOOK_URL_PAGE}'
|
||||
channel: '#alerts-page'
|
||||
send_resolved: true
|
||||
title: '🚨 PAGE: {{ .GroupLabels.alertname }}'
|
||||
text: |
|
||||
{{ range .Alerts }}
|
||||
*Severity:* {{ .Labels.severity }}
|
||||
*SLO:* {{ .Labels.slo | default "n/a" }}
|
||||
*Description:* {{ .Annotations.description }}
|
||||
*Runbook:* {{ .Annotations.runbook_url }}
|
||||
{{ end }}
|
||||
# PagerDuty integration — populate routing_key from
|
||||
# ${PAGERDUTY_ROUTING_KEY} once an account is provisioned. Until
|
||||
# then the slack channel is the only page surface.
|
||||
pagerduty_configs:
|
||||
- routing_key: '${PAGERDUTY_ROUTING_KEY}'
|
||||
severity: '{{ .GroupLabels.severity }}'
|
||||
send_resolved: true
|
||||
|
||||
- name: ticket-oncall
|
||||
slack_configs:
|
||||
- api_url: '${SLACK_WEBHOOK_URL_TICKET}'
|
||||
channel: '#alerts-ticket'
|
||||
send_resolved: true
|
||||
title: '🎫 TICKET: {{ .GroupLabels.alertname }}'
|
||||
text: |
|
||||
{{ range .Alerts }}
|
||||
*Severity:* {{ .Labels.severity }}
|
||||
*SLO:* {{ .Labels.slo | default "n/a" }}
|
||||
*Description:* {{ .Annotations.description }}
|
||||
*Runbook:* {{ .Annotations.runbook_url }}
|
||||
{{ end }}
|
||||
|
||||
# slack-default kept as a no-op fallback so a misconfigured alert
|
||||
# that escapes the route tree still hits a receiver — Alertmanager
|
||||
# logs noise instead of silently dropping the alert.
|
||||
- name: slack-default
|
||||
slack_configs:
|
||||
- api_url: '${SLACK_WEBHOOK_URL}'
|
||||
channel: '#alerts'
|
||||
send_resolved: true
|
||||
title: '{{ .Status | toUpper }}: {{ .GroupLabels.alertname }}'
|
||||
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
|
||||
224
config/prometheus/slo.yml
Normal file
224
config/prometheus/slo.yml
Normal file
|
|
@ -0,0 +1,224 @@
|
|||
# SLO definitions + multi-window burn-rate alerts (v1.0.9 W2 Day 10).
|
||||
#
|
||||
# Three SLOs, each with two alerts:
|
||||
# * <name>SLOFastBurn — page-grade: 2% of monthly error budget burned
|
||||
# in 1h. Wakes someone up.
|
||||
# * <name>SLOSlowBurn — ticket-grade: 5% burned in 6h. Files a ticket.
|
||||
#
|
||||
# Multi-window methodology from the Google SRE workbook:
|
||||
# - "Fast burn" = burn_rate > 14.4 over 1h (and confirmation over 5m)
|
||||
# - "Slow burn" = burn_rate > 6 over 6h (and confirmation over 30m)
|
||||
#
|
||||
# burn_rate = error_rate / (1 - SLO_target).
|
||||
# For SLO=99.5%: budget=0.5%; burn_rate=14.4 ⇒ error_rate=7.2%.
|
||||
# For SLO=99.0% (latency): budget=1.0%; burn_rate=14.4 ⇒ error_rate=14.4%.
|
||||
#
|
||||
# All alerts carry a `runbook_url` annotation pointing into
|
||||
# docs/runbooks/<alert-name>.md. Add the alert to that file's "What
|
||||
# tripped me" section if you wake up to one.
|
||||
groups:
|
||||
# ----------------------------------------------------------------------
|
||||
# SLO 1 — API availability (read endpoints)
|
||||
# target: 99.5% of read requests return non-5xx
|
||||
# scope: /api/v1/health + GET endpoints (path label-matched)
|
||||
# ----------------------------------------------------------------------
|
||||
- name: veza_slo_api_availability
|
||||
interval: 30s
|
||||
rules:
|
||||
# Recording rules — express the SLO as ratios so alerts stay
|
||||
# readable. Numerator = requests that violated the SLO (5xx).
|
||||
# Denominator = total in-scope requests. The interval label is
|
||||
# baked into the rule name so PromQL stays declarative.
|
||||
- record: veza:slo_api_availability:burnrate_5m
|
||||
expr: |
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET"}[5m]))
|
||||
- record: veza:slo_api_availability:burnrate_30m
|
||||
expr: |
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[30m]))
|
||||
/
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET"}[30m]))
|
||||
- record: veza:slo_api_availability:burnrate_1h
|
||||
expr: |
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[1h]))
|
||||
/
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET"}[1h]))
|
||||
- record: veza:slo_api_availability:burnrate_6h
|
||||
expr: |
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[6h]))
|
||||
/
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET"}[6h]))
|
||||
|
||||
# Page — 2% of the monthly budget burned in 1h.
|
||||
- alert: APIAvailabilitySLOFastBurn
|
||||
expr: |
|
||||
(veza:slo_api_availability:burnrate_1h > (14.4 * 0.005))
|
||||
and
|
||||
(veza:slo_api_availability:burnrate_5m > (14.4 * 0.005))
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
slo: api_availability
|
||||
page: "true"
|
||||
annotations:
|
||||
summary: "API availability SLO: fast burn (2% budget in 1h)"
|
||||
description: |
|
||||
5xx rate on read endpoints is burning the monthly error budget
|
||||
14.4× the steady-state rate. At this pace the entire monthly
|
||||
budget is consumed in ~3.5 days. Investigate now.
|
||||
runbook_url: "https://docs.veza.fr/runbooks/api-availability-slo-burn"
|
||||
|
||||
# Ticket — 5% of monthly budget burned in 6h. Slower, but signals
|
||||
# a degradation that won't fix itself.
|
||||
- alert: APIAvailabilitySLOSlowBurn
|
||||
expr: |
|
||||
(veza:slo_api_availability:burnrate_6h > (6 * 0.005))
|
||||
and
|
||||
(veza:slo_api_availability:burnrate_30m > (6 * 0.005))
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
slo: api_availability
|
||||
page: "false"
|
||||
annotations:
|
||||
summary: "API availability SLO: slow burn (5% budget in 6h)"
|
||||
description: |
|
||||
Read endpoints are degrading slowly — 6h burn rate is 6× the
|
||||
steady-state pace. File a ticket to investigate during business hours.
|
||||
runbook_url: "https://docs.veza.fr/runbooks/api-availability-slo-burn"
|
||||
|
||||
# ----------------------------------------------------------------------
|
||||
# SLO 2 — API latency (write endpoints)
|
||||
# target: 99% of write requests have p95 < 500ms
|
||||
# scope: POST/PUT/PATCH/DELETE on /api/v1/*
|
||||
# ----------------------------------------------------------------------
|
||||
- name: veza_slo_api_latency
|
||||
interval: 30s
|
||||
rules:
|
||||
# "Bad" = requests slower than 500ms. We compute the ratio of slow
|
||||
# requests to total directly from the histogram bucket boundaries.
|
||||
# 0.5s bucket is le="0.5" — total minus that bucket = slow.
|
||||
- record: veza:slo_api_latency:slow_ratio_5m
|
||||
expr: |
|
||||
1 - (
|
||||
sum(rate(veza_gin_http_request_duration_seconds_bucket{job="veza-backend",method=~"POST|PUT|PATCH|DELETE",le="0.5"}[5m]))
|
||||
/
|
||||
sum(rate(veza_gin_http_request_duration_seconds_count{job="veza-backend",method=~"POST|PUT|PATCH|DELETE"}[5m]))
|
||||
)
|
||||
- record: veza:slo_api_latency:slow_ratio_30m
|
||||
expr: |
|
||||
1 - (
|
||||
sum(rate(veza_gin_http_request_duration_seconds_bucket{job="veza-backend",method=~"POST|PUT|PATCH|DELETE",le="0.5"}[30m]))
|
||||
/
|
||||
sum(rate(veza_gin_http_request_duration_seconds_count{job="veza-backend",method=~"POST|PUT|PATCH|DELETE"}[30m]))
|
||||
)
|
||||
- record: veza:slo_api_latency:slow_ratio_1h
|
||||
expr: |
|
||||
1 - (
|
||||
sum(rate(veza_gin_http_request_duration_seconds_bucket{job="veza-backend",method=~"POST|PUT|PATCH|DELETE",le="0.5"}[1h]))
|
||||
/
|
||||
sum(rate(veza_gin_http_request_duration_seconds_count{job="veza-backend",method=~"POST|PUT|PATCH|DELETE"}[1h]))
|
||||
)
|
||||
- record: veza:slo_api_latency:slow_ratio_6h
|
||||
expr: |
|
||||
1 - (
|
||||
sum(rate(veza_gin_http_request_duration_seconds_bucket{job="veza-backend",method=~"POST|PUT|PATCH|DELETE",le="0.5"}[6h]))
|
||||
/
|
||||
sum(rate(veza_gin_http_request_duration_seconds_count{job="veza-backend",method=~"POST|PUT|PATCH|DELETE"}[6h]))
|
||||
)
|
||||
|
||||
- alert: APILatencySLOFastBurn
|
||||
expr: |
|
||||
(veza:slo_api_latency:slow_ratio_1h > (14.4 * 0.01))
|
||||
and
|
||||
(veza:slo_api_latency:slow_ratio_5m > (14.4 * 0.01))
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
slo: api_latency
|
||||
page: "true"
|
||||
annotations:
|
||||
summary: "API latency SLO: fast burn (writes slow)"
|
||||
description: |
|
||||
More than 14.4% of write requests are taking > 500ms over the
|
||||
last hour. Likely DB contention, lock storm, or a slow query.
|
||||
runbook_url: "https://docs.veza.fr/runbooks/api-latency-slo-burn"
|
||||
|
||||
- alert: APILatencySLOSlowBurn
|
||||
expr: |
|
||||
(veza:slo_api_latency:slow_ratio_6h > (6 * 0.01))
|
||||
and
|
||||
(veza:slo_api_latency:slow_ratio_30m > (6 * 0.01))
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
slo: api_latency
|
||||
page: "false"
|
||||
annotations:
|
||||
summary: "API latency SLO: slow burn (writes slow)"
|
||||
description: |
|
||||
6h slow ratio above 6%. Investigate during hours.
|
||||
runbook_url: "https://docs.veza.fr/runbooks/api-latency-slo-burn"
|
||||
|
||||
# ----------------------------------------------------------------------
|
||||
# SLO 3 — Payment success (POST /api/v1/orders → 201)
|
||||
# target: 99.5% of order POSTs return 2xx
|
||||
# ----------------------------------------------------------------------
|
||||
- name: veza_slo_payment_success
|
||||
interval: 30s
|
||||
rules:
|
||||
- record: veza:slo_payment_success:burnrate_5m
|
||||
expr: |
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*",status!~"2.."}[5m]))
|
||||
/
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*"}[5m]))
|
||||
- record: veza:slo_payment_success:burnrate_30m
|
||||
expr: |
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*",status!~"2.."}[30m]))
|
||||
/
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*"}[30m]))
|
||||
- record: veza:slo_payment_success:burnrate_1h
|
||||
expr: |
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*",status!~"2.."}[1h]))
|
||||
/
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*"}[1h]))
|
||||
- record: veza:slo_payment_success:burnrate_6h
|
||||
expr: |
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*",status!~"2.."}[6h]))
|
||||
/
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*"}[6h]))
|
||||
|
||||
- alert: PaymentSuccessSLOFastBurn
|
||||
expr: |
|
||||
(veza:slo_payment_success:burnrate_1h > (14.4 * 0.005))
|
||||
and
|
||||
(veza:slo_payment_success:burnrate_5m > (14.4 * 0.005))
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
slo: payment_success
|
||||
page: "true"
|
||||
annotations:
|
||||
summary: "Payment success SLO: fast burn (orders failing)"
|
||||
description: |
|
||||
> 7% of POST /api/v1/orders are non-2xx in the last hour.
|
||||
Hyperswitch or marketplace pipeline failure — every minute
|
||||
is a customer charged but no license issued (or vice versa).
|
||||
runbook_url: "https://docs.veza.fr/runbooks/payment-success-slo-burn"
|
||||
|
||||
- alert: PaymentSuccessSLOSlowBurn
|
||||
expr: |
|
||||
(veza:slo_payment_success:burnrate_6h > (6 * 0.005))
|
||||
and
|
||||
(veza:slo_payment_success:burnrate_30m > (6 * 0.005))
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
slo: payment_success
|
||||
page: "false"
|
||||
annotations:
|
||||
summary: "Payment success SLO: slow burn (orders failing)"
|
||||
description: |
|
||||
6h burn rate > 6× — file a ticket, investigate during hours.
|
||||
runbook_url: "https://docs.veza.fr/runbooks/payment-success-slo-burn"
|
||||
62
docs/runbooks/api-availability-slo-burn.md
Normal file
62
docs/runbooks/api-availability-slo-burn.md
Normal file
|
|
@ -0,0 +1,62 @@
|
|||
# Runbook — API availability SLO burn
|
||||
|
||||
> **SLO** : 99.5% of GET requests on `/api/v1/*` return non-5xx (monthly window).
|
||||
> **Alerts** : `APIAvailabilitySLOFastBurn` (page) · `APIAvailabilitySLOSlowBurn` (ticket)
|
||||
> **Owner** : backend on-call.
|
||||
|
||||
## What tripped me
|
||||
|
||||
The 5xx ratio on read endpoints is consuming the monthly error budget faster than the steady-state rate allows :
|
||||
|
||||
- **Fast burn** (`page=true`) : 14.4× over 1h ⇒ entire monthly budget gone in ~3.5 days.
|
||||
- **Slow burn** (`page=false`) : 6× over 6h ⇒ entire budget gone in ~7 days.
|
||||
|
||||
## First moves (under 5 minutes)
|
||||
|
||||
1. **Confirm the alert is real**, not a metric-pipeline glitch :
|
||||
```bash
|
||||
# Live error rate on the GETs we measure :
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="GET"}[5m]))
|
||||
```
|
||||
Open Grafana → "Veza API Overview" dashboard, panel "Request rate by path".
|
||||
|
||||
2. **Identify the affected endpoint**. The fastest pivot :
|
||||
```promql
|
||||
topk(5, sum by (path, status) (
|
||||
rate(veza_gin_http_requests_total{job="veza-backend",method="GET",status=~"5.."}[5m])
|
||||
))
|
||||
```
|
||||
|
||||
3. **Drop into traces**. Open the "Veza Service Map (Tempo)" dashboard and filter the slowest-spans table for the offending path. If the failures correlate with one downstream (Redis, Postgres, Hyperswitch), the trace will show it.
|
||||
|
||||
## Common causes
|
||||
|
||||
| Symptom | Likely cause | Fix |
|
||||
| -------------------------------------------- | ----------------------------------------------------------- | ---------------------------------------------------- |
|
||||
| 5xx concentrated on `/feed`, `/library` | Postgres slow / connection pool exhausted | See `db-failover.md` — check `pg_auto_failover` state |
|
||||
| 5xx concentrated on `/search`, `/tracks` | Postgres FTS index churn or autovacuum holding row locks | `SELECT pid, query FROM pg_stat_activity WHERE state='active' ORDER BY xact_start LIMIT 5;` |
|
||||
| 5xx across all paths, sudden | Pod just restarted / migration broken / DB unreachable | `kubectl get pods -n veza` or `systemctl status veza-backend-api` |
|
||||
| 5xx slowly climbing | Memory leak; container approaching OOMKill | `kubectl top pod -n veza` and bounce the leaking pod |
|
||||
| 5xx confined to one instance | Single bad replica (config, certs, networking) | Drain that instance from the load balancer |
|
||||
|
||||
## If the page is real
|
||||
|
||||
1. **Page the secondary on-call** if the primary fix is going to take > 15 min.
|
||||
2. **Update the status page** (`status.veza.fr`) with "Investigating elevated error rates."
|
||||
3. **Post in #incident-response** with the alert link + first hypothesis.
|
||||
|
||||
## When to silence
|
||||
|
||||
- Confirmed degradation is a known maintenance window already announced : silence for the maintenance window's duration.
|
||||
- Single-instance issue, that instance has been drained : silence for 1h.
|
||||
- Otherwise, **do not silence** — let the alert keep firing until the burn rate drops below threshold naturally.
|
||||
|
||||
## Recovery verification
|
||||
|
||||
After mitigation, both burn-rate windows must drop below threshold for the alert to clear (1h and 5m for fast burn, 6h and 30m for slow burn). The 6h window means the slow-burn alert can stay green for hours after the issue is fixed — don't be surprised.
|
||||
|
||||
## Postmortem trigger
|
||||
|
||||
A page-grade alert that fires for > 15 minutes triggers a postmortem doc (`docs/postmortems/YYYY-MM-DD-<slug>.md`). Include the timeline, the trace IDs, and the metric query screenshots.
|
||||
47
docs/runbooks/api-latency-slo-burn.md
Normal file
47
docs/runbooks/api-latency-slo-burn.md
Normal file
|
|
@ -0,0 +1,47 @@
|
|||
# Runbook — API latency SLO burn
|
||||
|
||||
> **SLO** : 99% of write requests (POST/PUT/PATCH/DELETE) return in < 500ms p95 (monthly window).
|
||||
> **Alerts** : `APILatencySLOFastBurn` (page) · `APILatencySLOSlowBurn` (ticket)
|
||||
> **Owner** : backend on-call.
|
||||
|
||||
## What tripped me
|
||||
|
||||
Writes are taking longer than 500ms p95. The fast burn fires when > 14.4% of writes are slow over 1h.
|
||||
|
||||
## First moves (under 5 minutes)
|
||||
|
||||
1. **Identify the slow endpoints** :
|
||||
```promql
|
||||
topk(5, histogram_quantile(0.95,
|
||||
sum by (path, le) (rate(veza_gin_http_request_duration_seconds_bucket{job="veza-backend",method=~"POST|PUT|PATCH|DELETE"}[5m]))
|
||||
))
|
||||
```
|
||||
2. **Open Tempo service-map dashboard** ("Veza Service Map (Tempo)") and check the slow-spans table for the same paths.
|
||||
|
||||
## Common causes
|
||||
|
||||
| Symptom | Likely cause | Pointer |
|
||||
| ------------------------------------------------ | -------------------------------------------------- | ----------------------------- |
|
||||
| Slow on `/api/v1/orders` (POST) | Hyperswitch upstream latency | `payment-success-slo-burn.md` |
|
||||
| Slow on `/api/v1/tracks` (POST) | S3 multipart pre-sign / commit latency | Check MinIO health |
|
||||
| Slow across all writes | Postgres lock contention / autovacuum | `db-failover.md` §autovacuum |
|
||||
| Slow only on one host | One bad replica (CPU starvation, disk) | Drain & investigate |
|
||||
| Slow + DB pool exhausted in logs | A slow query holding the pool | `db-failover.md` §pool |
|
||||
|
||||
## Mitigation
|
||||
|
||||
- If Hyperswitch : nothing to do but wait + status-page banner.
|
||||
- If DB lock contention : `pg_blocking_pids()` + cancel the offender :
|
||||
```sql
|
||||
SELECT pg_cancel_backend(pid) FROM pg_stat_activity
|
||||
WHERE state = 'active' AND xact_start < now() - INTERVAL '30 seconds';
|
||||
```
|
||||
- If a single bad replica : drain it from the LB and investigate offline.
|
||||
|
||||
## Recovery
|
||||
|
||||
The slow-burn alert can take 6h to clear after a fix. Don't silence — let it ride down.
|
||||
|
||||
## Postmortem trigger
|
||||
|
||||
Same threshold as the availability runbook — fast burn > 15 min triggers a postmortem.
|
||||
80
docs/runbooks/cert-expiring-soon.md
Normal file
80
docs/runbooks/cert-expiring-soon.md
Normal file
|
|
@ -0,0 +1,80 @@
|
|||
# Runbook — TLS certificate expiring soon
|
||||
|
||||
> **Alert** : `CertExpiringSoon` (warning at 30d, critical at 7d).
|
||||
> **Owner** : infra on-call.
|
||||
|
||||
## Cert inventory
|
||||
|
||||
| Domain | Issuer | Auto-renew ? | Where it lives |
|
||||
| -------------------------- | ----------------- | ---------------------- | ---------------------------------------- |
|
||||
| `api.veza.fr` | Let's Encrypt | Yes (Caddy) | Caddy data dir on the prod LB |
|
||||
| `app.veza.fr` | Let's Encrypt | Yes (Caddy) | Caddy data dir on the prod LB |
|
||||
| `staging.veza.fr` + sub | Let's Encrypt | Yes (Caddy) | Staging Caddy |
|
||||
| `*.lxd` (internal) | self-signed | No — manually rotated | Each container's `/etc/ssl/...` |
|
||||
| `jwt-private.pem` / public | self-generated | No — rotated yearly | Backend host (mounted via volume) |
|
||||
| `pgaf-*.veza.lxd` | self-signed | No — rotated yearly | pg_auto_failover pki dir |
|
||||
|
||||
The alert fires for the **public-facing** certs above. Internal `.lxd` certs are tracked separately by a yearly calendar reminder.
|
||||
|
||||
## Auto-renewing certs (Let's Encrypt via Caddy)
|
||||
|
||||
Caddy renews 30 days before expiry. If the alert fires at 30d, that's the renewal window starting — confirm the renewal is happening :
|
||||
|
||||
```bash
|
||||
# On the LB host :
|
||||
sudo journalctl -u caddy --since "1 day ago" | grep -i "obtain\|renew\|cert"
|
||||
|
||||
# Caddy's internal state :
|
||||
sudo curl -fsS http://localhost:2019/config/ | jq '.apps.tls.automation'
|
||||
```
|
||||
|
||||
If renewal is failing :
|
||||
|
||||
1. **Rate-limit** : Let's Encrypt has a 5-attempt-per-hour limit per cert. Check Caddy log for `429 Too Many Requests`.
|
||||
2. **DNS not pointing here** : `dig +short api.veza.fr` must point at this LB.
|
||||
3. **Port 80 blocked** : ACME HTTP-01 challenge needs port 80. `sudo ss -lntp | grep ':80'` should show Caddy.
|
||||
4. **Disk full** : Caddy writes the new cert to disk before swapping. See `disk-full.md`.
|
||||
|
||||
## Self-signed `.lxd` certs
|
||||
|
||||
These rotate on a yearly cadence (calendar reminder, not automated). When the alert fires :
|
||||
|
||||
```bash
|
||||
# Inspect a cert :
|
||||
echo | openssl s_client -connect minio.lxd:9000 -servername minio.lxd 2>/dev/null | openssl x509 -noout -dates -subject
|
||||
|
||||
# Regenerate (one-shot for self-signed CA + leaf) :
|
||||
cd infra/pki/lab
|
||||
./regenerate-cert.sh minio.lxd
|
||||
# Then push to the container :
|
||||
incus file push minio.crt minio:/etc/ssl/certs/minio.crt
|
||||
incus file push minio.key minio:/etc/ssl/private/minio.key
|
||||
incus exec minio -- systemctl reload minio
|
||||
```
|
||||
|
||||
(Script TODO — currently the rotation is manual openssl. W4 backlog.)
|
||||
|
||||
## JWT keys
|
||||
|
||||
`jwt-private.pem` / `jwt-public.pem` are RSA keys, not X.509. They don't "expire" but are rotated yearly. Procedure :
|
||||
|
||||
1. Generate a new pair :
|
||||
```bash
|
||||
./scripts/generate-jwt-keys.sh
|
||||
```
|
||||
2. Roll the public key first (backend trusts new + old) — current code only loads one ; needs a small extension. **Tracked as v1.1 work.**
|
||||
3. Until that's wired, rotation = downtime window where every existing access token becomes invalid (5 min lifetime mitigates this).
|
||||
|
||||
## After rotation
|
||||
|
||||
1. Hit a public endpoint and confirm the new cert is served :
|
||||
```bash
|
||||
echo | openssl s_client -connect api.veza.fr:443 2>/dev/null | openssl x509 -noout -dates
|
||||
```
|
||||
2. The `CertExpiringSoon` alert clears within one Prometheus scrape interval (~30s) once `probe_ssl_earliest_cert_expiry` is updated by blackbox-exporter.
|
||||
3. If the cert was rotated under fire (renewal hit a wall, manual replacement), file a postmortem with the timeline.
|
||||
|
||||
## What CAN break
|
||||
|
||||
- Pinned certs in the mobile app (none today, but keep this in mind for v2+).
|
||||
- Customer integrations that fetched our public key once and cached it — JWT public key rotation will reject their cached signatures. Until v1.1 we don't promise stable JWT keys to third parties.
|
||||
108
docs/runbooks/db-failover.md
Normal file
108
docs/runbooks/db-failover.md
Normal file
|
|
@ -0,0 +1,108 @@
|
|||
# Runbook — Postgres failover (`pg_auto_failover`)
|
||||
|
||||
> **Alerts** : `PostgresPrimaryUnreachable`, `PostgresReplicationLagHigh` · also reached from `api-availability-slo-burn.md` and `api-latency-slo-burn.md`.
|
||||
> **Owner** : infra on-call.
|
||||
|
||||
## Topology recap
|
||||
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ pgaf-monitor │ ← state machine; assigns primary/standby roles
|
||||
└────────┬────────┘
|
||||
│ pg_auto_failover protocol
|
||||
│
|
||||
┌─────┴─────┐
|
||||
│ │
|
||||
┌──▼───┐ ┌───▼────┐
|
||||
│ pgaf-│ │ pgaf- │
|
||||
│primary│ │replica │
|
||||
└───────┘ └────────┘
|
||||
```
|
||||
|
||||
PgBouncer (`pgaf-pgbouncer`, port 6432) sits in front of whoever is currently primary. Backend reads `DATABASE_URL` from env that already points at the bouncer.
|
||||
|
||||
## What "failover" looks like
|
||||
|
||||
- Primary disappears (crash, host reboot, manual `incus stop`).
|
||||
- Monitor notices within `pgaf_health_check_interval` (~10s).
|
||||
- After `pgaf_failover_timeout` (60s), monitor promotes the replica to primary.
|
||||
- PgBouncer is reconfigured by the monitor's notify hook ; new connections go to the new primary.
|
||||
|
||||
**Expected RTO is ~60 seconds.** RPO ≈ 0 if synchronous replication was caught up; up to one tx if async.
|
||||
|
||||
## Diagnose state
|
||||
|
||||
```bash
|
||||
# From any node :
|
||||
sudo -u postgres pg_autoctl show state
|
||||
|
||||
# Look for one node with state="primary" and one with state="secondary".
|
||||
# If both are "wait_for_primary" the formation is wedged.
|
||||
|
||||
# Connection-level test (does the bouncer route to a live primary?) :
|
||||
psql "$DATABASE_URL" -c "SELECT now(), pg_is_in_recovery();"
|
||||
# pg_is_in_recovery = false ⇒ you're hitting the primary
|
||||
```
|
||||
|
||||
## Common failure modes
|
||||
|
||||
### A. Monitor is up, primary is down, replica didn't get promoted
|
||||
|
||||
Either `pgaf_failover_timeout` hasn't elapsed yet (wait 60s) **or** the replica is too far behind to be safe.
|
||||
|
||||
```bash
|
||||
# On the replica :
|
||||
sudo -u postgres pg_autoctl show state
|
||||
# Check the LSN distance — if it's > 1MB the monitor will refuse promote.
|
||||
```
|
||||
|
||||
If monitor refused, manual promotion (only if you accept potential data loss) :
|
||||
|
||||
```bash
|
||||
sudo -u postgres pg_autoctl perform failover --formation default --group 0
|
||||
```
|
||||
|
||||
### B. Monitor itself is down
|
||||
|
||||
The data nodes keep serving their last-known role until the monitor returns. Reads keep working from the standby. **No automatic failover happens** without the monitor — start it before doing anything else.
|
||||
|
||||
```bash
|
||||
sudo systemctl start pg_autoctl@monitor
|
||||
sudo journalctl -u pg_autoctl@monitor -n 200 --no-pager
|
||||
```
|
||||
|
||||
### C. Both data nodes are down (catastrophe)
|
||||
|
||||
Restore from pgBackRest. See the dr-drill runbook in `docs/archive/` (or the `pgbackrest` role README) for the manual procedure. **Estimated RTO ~30 min** with a full+diff already on MinIO.
|
||||
|
||||
## Connection routing
|
||||
|
||||
PgBouncer holds the routing decision, so during a failover :
|
||||
|
||||
```bash
|
||||
# Confirm which Postgres backend is currently behind the bouncer :
|
||||
psql -h pgaf-pgbouncer.lxd -p 6432 -U pgbouncer pgbouncer -c "SHOW SERVERS;"
|
||||
```
|
||||
|
||||
If the bouncer is still pointing at the dead primary :
|
||||
|
||||
```bash
|
||||
# Reload the bouncer config (the pg_auto_failover monitor's
|
||||
# `host_change_hook.sh` should have done this automatically — if not,
|
||||
# something is broken) :
|
||||
sudo systemctl reload pgbouncer
|
||||
```
|
||||
|
||||
## Backend behavior during failover
|
||||
|
||||
The backend's GORM connection pool drops dead connections lazily. Expect a few hundred 5xx during the 30-60s window — this trips `APIAvailabilitySLOFastBurn`. The alert clears once the pool refills.
|
||||
|
||||
## After recovery
|
||||
|
||||
1. Re-add the failed node as standby :
|
||||
```bash
|
||||
sudo -u postgres pg_autoctl create postgres ...
|
||||
```
|
||||
2. Wait for `pg_autoctl show state` to show two healthy nodes.
|
||||
3. Run the next dr-drill cycle to validate backups against the new primary.
|
||||
4. Postmortem if downtime > 5 min.
|
||||
123
docs/runbooks/disk-full.md
Normal file
123
docs/runbooks/disk-full.md
Normal file
|
|
@ -0,0 +1,123 @@
|
|||
# Runbook — Disk full / `/` filesystem at threshold
|
||||
|
||||
> **Alerts** : `DiskSpaceLow` (warning at 85%) · `DiskSpaceCritical` (page at 95%).
|
||||
> **Owner** : infra on-call.
|
||||
|
||||
## Hosts to watch
|
||||
|
||||
| Host | What fills the disk |
|
||||
| --------------------- | -------------------------------------------------------------------- |
|
||||
| `pgaf-primary` | WAL + autovacuum bloat. WAL fills if pgBackRest archive falls behind. |
|
||||
| `pgaf-replica` | Replication lag → WAL not replayed; same WAL accumulation. |
|
||||
| `pgaf-pgbouncer` | Logs in `/var/log/postgresql/pgbouncer.log` if log_disconnections=on. |
|
||||
| `tempo` | Trace blocks under `/var/lib/tempo`. Default retention 14d. |
|
||||
| `otel-collector` | Almost never — no on-disk state by default. |
|
||||
| API/web hosts (k8s) | Container images, log rotation, build caches. |
|
||||
| `minio-*` | Object data — lifecycle policy supposed to manage this. |
|
||||
|
||||
## First moves (under 2 minutes)
|
||||
|
||||
```bash
|
||||
df -h
|
||||
# Identify the mount that's tight, then :
|
||||
sudo du -h --max-depth=2 -x /var/lib | sort -hr | head -20
|
||||
sudo du -h --max-depth=2 -x /var/log | sort -hr | head -20
|
||||
```
|
||||
|
||||
## Postgres data nodes (`pgaf-*`)
|
||||
|
||||
### A. WAL piling up
|
||||
|
||||
If `/var/lib/postgresql/16/main/pg_wal` is the offender :
|
||||
|
||||
```bash
|
||||
# Is pgBackRest shipping ?
|
||||
sudo -u postgres pgbackrest --stanza=veza info | tail -20
|
||||
|
||||
# Last WAL push time should be < 1 minute ago.
|
||||
```
|
||||
|
||||
If pgBackRest is stuck (S3 unreachable, credentials rotated) :
|
||||
|
||||
1. **Don't** force `pg_resetwal` — that's data loss.
|
||||
2. Fix the upstream (network, credentials), then push pending WAL :
|
||||
```bash
|
||||
sudo -u postgres pgbackrest --stanza=veza archive-push <wal_file>
|
||||
```
|
||||
|
||||
### B. Autovacuum bloat
|
||||
|
||||
```bash
|
||||
sudo -u postgres psql -c "
|
||||
SELECT relname, n_live_tup, n_dead_tup,
|
||||
pg_size_pretty(pg_total_relation_size(relid)) AS size
|
||||
FROM pg_stat_user_tables
|
||||
ORDER BY n_dead_tup DESC LIMIT 10;
|
||||
"
|
||||
```
|
||||
|
||||
Manual vacuum on the worst offender :
|
||||
|
||||
```bash
|
||||
sudo -u postgres psql -c "VACUUM (VERBOSE, ANALYZE) <table>;"
|
||||
# Or VACUUM FULL if you have the downtime — it rewrites the table.
|
||||
```
|
||||
|
||||
## Tempo host
|
||||
|
||||
Trace blocks default to 14d retention. If the host is full anyway, the lifecycle compactor isn't keeping up :
|
||||
|
||||
```bash
|
||||
sudo systemctl status tempo
|
||||
sudo journalctl -u tempo -n 200 --no-pager | grep -i compact
|
||||
```
|
||||
|
||||
Emergency recovery — drop oldest blocks manually :
|
||||
|
||||
```bash
|
||||
sudo -u tempo find /var/lib/tempo/blocks -mindepth 1 -maxdepth 1 -type d -mtime +14 -exec rm -rf {} +
|
||||
```
|
||||
|
||||
(This is safe because the blocks are write-once, append-only ; the index in `wal/` is rebuilt at restart.)
|
||||
|
||||
## API/web hosts (Kubernetes)
|
||||
|
||||
```bash
|
||||
# Images :
|
||||
kubectl describe node <node> | grep -A 5 "Allocated resources"
|
||||
|
||||
# Container logs (rotation should be handling this — check):
|
||||
sudo du -h --max-depth=1 /var/log/pods | sort -hr | head -10
|
||||
|
||||
# If a single pod is logging GB/min, that's a regression. Restart it
|
||||
# and grep its previous logs for the loop signature.
|
||||
```
|
||||
|
||||
## MinIO
|
||||
|
||||
If the storage bucket is full :
|
||||
|
||||
```bash
|
||||
mc admin info veza-minio
|
||||
mc du veza-minio/ --depth=2
|
||||
```
|
||||
|
||||
Check the lifecycle policy is applied :
|
||||
|
||||
```bash
|
||||
mc ilm rule list veza-minio/veza-tracks
|
||||
```
|
||||
|
||||
## Recovery verification
|
||||
|
||||
Once free space is back :
|
||||
|
||||
- Postgres : confirm `pg_wal` size is bounded (should be < `wal_keep_size` + ~ 64MB).
|
||||
- Tempo : `df -h /var/lib/tempo` is below 70%.
|
||||
- The disk-space alert clears within one Prometheus scrape interval (~ 30s).
|
||||
|
||||
## Long-term prevention
|
||||
|
||||
- pgBackRest archive lag → fix the alert (currently only `BackupRestoreDrillStale`, doesn't catch this) ; W3 backlog.
|
||||
- Tempo retention spilling → migrate Tempo to S3-backed (`tempo_storage_backend: s3`). W3 day 12 covers this.
|
||||
- API log volume → tighten log levels in prod (`LOG_LEVEL=INFO`).
|
||||
86
docs/runbooks/payment-success-slo-burn.md
Normal file
86
docs/runbooks/payment-success-slo-burn.md
Normal file
|
|
@ -0,0 +1,86 @@
|
|||
# Runbook — Payment success SLO burn
|
||||
|
||||
> **SLO** : 99.5% of `POST /api/v1/orders` return 2xx (monthly window).
|
||||
> **Alerts** : `PaymentSuccessSLOFastBurn` (page) · `PaymentSuccessSLOSlowBurn` (ticket)
|
||||
> **Owner** : payments on-call (rotates with backend on-call until v2.0).
|
||||
|
||||
## Why this is critical
|
||||
|
||||
A failing checkout means **money lost** (charged customer, no license issued) or **money taken twice** (double-submitted on retry). Worst-case fraud window is the time it takes to roll the upstream change. Treat fast-burn here like a Sev-1 incident.
|
||||
|
||||
## First moves (under 5 minutes)
|
||||
|
||||
1. **Confirm the alert** :
|
||||
```promql
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*",status!~"2.."}[5m]))
|
||||
/
|
||||
sum(rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*"}[5m]))
|
||||
```
|
||||
|
||||
2. **Pivot on status code** :
|
||||
```promql
|
||||
sum by (status) (rate(veza_gin_http_requests_total{job="veza-backend",method="POST",path=~"/api/v1/orders.*"}[5m]))
|
||||
```
|
||||
- Spike in **502/503** → Hyperswitch unreachable. See "Hyperswitch outage" below.
|
||||
- Spike in **400** → marketplace validation failing. New deploy regressed something — check recent commits to `internal/core/marketplace/`.
|
||||
- Spike in **500** → DB / connection / panic. Check logs for stack traces.
|
||||
|
||||
3. **Trace pivot**. "Veza Service Map (Tempo)" → filter `payment.webhook` for `status=error` recent spans.
|
||||
|
||||
## Hyperswitch outage
|
||||
|
||||
If Hyperswitch is the upstream culprit :
|
||||
|
||||
```bash
|
||||
# Check Hyperswitch's own status :
|
||||
curl -fsS https://api.hyperswitch.io/health
|
||||
|
||||
# Check the last successful webhook landing :
|
||||
psql "$DATABASE_URL" -c "
|
||||
SELECT id, hyperswitch_payment_id, status, payment_status, updated_at
|
||||
FROM orders
|
||||
WHERE updated_at > NOW() - INTERVAL '15 minutes'
|
||||
ORDER BY updated_at DESC LIMIT 10;
|
||||
"
|
||||
```
|
||||
|
||||
If they're all stuck in `payment_status=pending`, Hyperswitch is silently dropping our webhooks. Engage their support **and** queue a manual reconciliation pass once they're back :
|
||||
|
||||
```bash
|
||||
# Manual reconciliation script (still TODO — tracked in W4 day 17) :
|
||||
go run ./cmd/tools/reconcile_orders --since=15m
|
||||
```
|
||||
|
||||
## DB / pool exhaustion
|
||||
|
||||
If the failures are 500s and the API logs show `pq: too many connections` or `context deadline exceeded` :
|
||||
|
||||
1. Check pgbouncer queue length :
|
||||
```bash
|
||||
psql -h pgaf-pgbouncer.lxd -p 6432 -U pgbouncer pgbouncer -c "SHOW POOLS;"
|
||||
```
|
||||
2. If `cl_waiting > 0` consistently, a slow query is holding pool slots — see `db-failover.md` for finding it.
|
||||
3. Last resort : restart the backend pod to drop in-flight requests (loses idempotency on retried requests; only do this if Hyperswitch is *not* in flight on those orders).
|
||||
|
||||
## Recovery verification
|
||||
|
||||
After fix :
|
||||
|
||||
```bash
|
||||
# Most recent 10 orders should be `completed` or `pending` (not `failed`) :
|
||||
psql "$DATABASE_URL" -c "
|
||||
SELECT status, COUNT(*) FROM orders
|
||||
WHERE created_at > NOW() - INTERVAL '5 minutes'
|
||||
GROUP BY status;
|
||||
"
|
||||
```
|
||||
|
||||
The slow-burn window (6h) takes hours to clear after recovery. Don't silence — wait for the metric.
|
||||
|
||||
## Reconciliation post-incident
|
||||
|
||||
Every fast-burn incident requires a reconciliation pass within 24h :
|
||||
|
||||
1. Pull the list of `orders` with `payment_status='pending'` older than 30 minutes.
|
||||
2. For each, query Hyperswitch directly via `GET /payments/{payment_id}` and update.
|
||||
3. File a postmortem with the count of mismatches resolved.
|
||||
83
docs/runbooks/redis-down.md
Normal file
83
docs/runbooks/redis-down.md
Normal file
|
|
@ -0,0 +1,83 @@
|
|||
# Runbook — Redis unavailable
|
||||
|
||||
> **Alert** : `RedisUnreachable` (existing, in `alert_rules.yml`).
|
||||
> **Owner** : infra on-call.
|
||||
|
||||
## What breaks when Redis is down
|
||||
|
||||
Veza uses Redis for several distinct concerns ; the impact differs by callsite.
|
||||
|
||||
| Subsystem | Effect when Redis is gone | Severity |
|
||||
| --------------------------------- | --------------------------------------------------- | -------- |
|
||||
| Session storage / refresh tokens | Login/refresh fail — users log out on next request | **HIGH** — most users notice within minutes |
|
||||
| Rate limiter (`UserRateLimiter`) | Fails-open — requests stop being rate-limited | MEDIUM — capacity risk if Redis stays down for hours |
|
||||
| JWT revocation | Revoked tokens accepted again | **SECURITY** — silent failure, no user-visible signal |
|
||||
| Cache (track lookups, feed pages) | Slow but works — falls back to Postgres | LOW — surfaces as elevated p95 |
|
||||
| Queue (RabbitMQ-fronted jobs) | Independent — RabbitMQ is the queue, Redis is just metrics | NONE |
|
||||
|
||||
## First moves
|
||||
|
||||
1. **Confirm Redis is actually down**, not "just unreachable from one host" :
|
||||
```bash
|
||||
redis-cli -h redis.lxd ping
|
||||
```
|
||||
2. If it's a single-host issue, skip ahead to "Backend can't reach Redis" below.
|
||||
|
||||
## Redis instance is down
|
||||
|
||||
```bash
|
||||
# Check the systemd state on whichever host owns Redis :
|
||||
sudo systemctl status redis
|
||||
|
||||
# If "failed", inspect logs :
|
||||
sudo journalctl -u redis -n 200 --no-pager
|
||||
|
||||
# Disk full ? Dump dir is /var/lib/redis :
|
||||
df -h /var/lib/redis
|
||||
```
|
||||
|
||||
Common causes :
|
||||
|
||||
- **OOM-killed by RDB snapshot.** `maxmemory` reached, no eviction policy, snapshot fork doubled the RSS. Set `maxmemory-policy allkeys-lru` and bump `maxmemory`.
|
||||
- **Disk full.** AOF or RDB filling `/var/lib/redis`. Truncate AOF (`BGREWRITEAOF`) or move the dir.
|
||||
- **Process crashed.** Bring it back up : `sudo systemctl restart redis`.
|
||||
|
||||
## Backend can't reach Redis
|
||||
|
||||
Network/DNS issue, not a Redis fault. Check :
|
||||
|
||||
```bash
|
||||
# From the API container :
|
||||
nc -zv redis.lxd 6379
|
||||
|
||||
# DNS resolution :
|
||||
getent hosts redis.lxd
|
||||
```
|
||||
|
||||
Likely culprits : Incus bridge restart, security group change on the API host, stale DNS cache.
|
||||
|
||||
## Mitigation while Redis is down
|
||||
|
||||
The backend's `internal/cache/redis_cache.go` already has fallback logic for the cache path. The session and rate-limiter paths fail loud. If recovery is going to take > 5 min :
|
||||
|
||||
1. **Drain new logins** by surfacing a maintenance banner on the frontend : flip `MAINTENANCE_MODE=true` in the API env and restart. (existing — set in `internal/middleware/maintenance.go`).
|
||||
2. **Do NOT drop the rate limiter to "always allow"** — temporarily switch it to "always deny" via env (`RATELIMIT_FAIL_CLOSED=true`) so abuse can't ride the outage.
|
||||
|
||||
## Recovery
|
||||
|
||||
Once Redis is back :
|
||||
|
||||
1. Verify connectivity from each backend instance :
|
||||
```bash
|
||||
docker exec veza-backend-api redis-cli -u "$REDIS_URL" ping
|
||||
```
|
||||
2. Existing sessions stay valid — refresh tokens were lost, but access tokens (5 min lifetime) keep working until expiry. Users will be prompted to log in again as their access tokens roll over.
|
||||
3. Cache is cold — the next 5-15 min of traffic hits Postgres harder. Monitor "Veza API Overview" → "p95 latency" panel.
|
||||
|
||||
## Postmortem trigger
|
||||
|
||||
Any Redis outage > 10 min triggers a postmortem. The session loss UX is bad; we want to know the time-to-detect and time-to-recover.
|
||||
|
||||
## Future-proofing
|
||||
|
||||
Redis Sentinel HA is **W3 day 11** on the launch roadmap. Once that's in, this runbook's "instance is down" section reduces to "the failover happened, verify the new master."
|
||||
Loading…
Reference in a new issue