veza/scripts/soft-launch/monitor-checks.sh
senke 112c64a22b
Some checks are pending
Veza CI / Backend (Go) (push) Waiting to run
Veza CI / Frontend (Web) (push) Waiting to run
Veza CI / Rust (Stream Server) (push) Waiting to run
Veza CI / Notify on failure (push) Blocked by required conditions
E2E Playwright / e2e (full) (push) Waiting to run
Security Scan / Secret Scanning (gitleaks) (push) Waiting to run
feat(soft-launch): cohort tooling + email template + monitor + checklist
The soft-launch report doc (SOFT_LAUNCH_BETA_2026.md) had the
narrative — cohort table, email body inline, monitoring list,
acceptance gate. But the operational pieces were notes-to-self :
"add migration if missing", "Typeform to-do", "schema TBD". The
operator was supposed to assemble them on the day, which on a soft-
launch day is the worst possible time.

Added the missing 6 pieces so the day-of work is "tick boxes",
not "build the tooling" :

  * migrations/990_beta_invites.sql — schema with code (16-char
    base32-ish), email, cohort label, used_at, expires_at + 30d
    default, sent_by FK with ON DELETE SET NULL. Three indexes :
    unique on code (signup-path lookup), cohort (post-launch
    attribution report), partial expires_at WHERE used_at IS NULL
    (cleanup cron).

  * scripts/soft-launch/validate-cohort.sh — sanity check on the
    operator's CSV : header form, malformed emails, duplicates,
    cohort distribution (≥50 total / ≥5 creators / ≥3 distinct
    labels), optional collision check against existing users.
    Exit codes 0 / 1 (block) / 2 (warn-but-proceed). Hard checks
    block, soft checks let the operator override with FORCE=1.

  * scripts/soft-launch/send-invitations.sh — split-phase :
      step 1 (default) inserts beta_invites rows + renders one .eml
        per recipient under scripts/soft-launch/out-<date>/
      step 2 (SEND=1) dispatches via $SEND_CMD (msmtp by default)
    so the operator can review the rendered emls before sending
    100 emails. Per-recipient transactional INSERT so a partial
    failure doesn't poison the table. Failed inserts logged with
    the offending email so the operator can rerun on the subset.

  * templates/email/beta_invite.eml.template — proper MIME multipart
    (text + HTML) eml ready for sendmail-compatible piping. French
    copy aligned with the éthique brand (no FOMO, no urgency
    manipulation, no "limited spots" framing).

  * scripts/soft-launch/monitor-checks.sh — polls the 6 acceptance-
    gate signals defined in SOFT_LAUNCH_BETA_2026.md §"Acceptance
    gate" : testers signed up, Sentry P1 events, status page,
    synthetic parcours, k6 nightly age, HIGH issues. Each gate
    independently emits  / 🔴 /  (last for "couldn't check").
    Verdict on stdout. LOOP=1 keeps polling every CHECK_INTERVAL
    seconds. Designed for cron + tmux, not for an interactive UI.

  * docs/SOFT_LAUNCH_BETA_2026_CHECKLIST.md — pre-flight gate that
    must reach 100% green before the first invitation goes out.
    T-72h section (database, cohort, email infra, redemption path,
    monitoring, comms), D-day section (last-hour, send, hour-1,
    every-4h), 18:00 UTC decision call section. Linked back to the
    bigger SOFT_LAUNCH_BETA_2026.md so the operator can navigate
    between the "what" (report) and the "how / has-everything-
    been-checked" (this checklist) without losing context.

What still requires the operator on the day :
  - Build the cohort CSV (curate emails from real sources)
  - Create the Typeform feedback form ; paste its URL into the
    eml template once known
  - Configure msmtp / sendmail ($SEND_CMD)
  - Press the send button
  - Show up at 18:00 UTC for the decision call

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:38:12 +02:00

255 lines
8.4 KiB
Bash
Executable file

#!/usr/bin/env bash
# monitor-checks.sh — poll the soft-launch acceptance gate live during
# the bêta window so the operator gets a heads-up before the decision
# call instead of discovering at 18:00 UTC that one threshold is red.
#
# Acceptance gate (per docs/SOFT_LAUNCH_BETA_2026.md §"Acceptance gate") :
# - ≥ 50 testers signed up (used_at != NULL on beta_invites)
# - 0 P1 events in Sentry today
# - Status page green for the last 4 h
# - Synthetic parcours all green for 6 h
# - Nightly k6 load test green
# - < 3 HIGH-severity issues reported
#
# v1.0.10 Cluster 3.4.
#
# Usage :
# DATABASE_URL=postgres://... \
# SENTRY_AUTH_TOKEN=... \
# STATUSPAGE_URL=https://status.veza.fr \
# PROM_URL=https://prom.veza.fr \
# bash scripts/soft-launch/monitor-checks.sh
#
# By default the script runs once and exits with the gate's verdict.
# Run it from cron (e.g. every 30 min) or pass LOOP=1 to keep checking
# in-place every CHECK_INTERVAL seconds (default 600 = 10 min).
#
# Optional env :
# LOOP=1 continuous mode
# CHECK_INTERVAL seconds between checks in LOOP mode (default 600)
# QUIET=1 only emit the verdict line (for cron piping)
# THRESHOLD_TESTERS override 50 (default), e.g. set to 100 for
# a stricter sub-window
#
# Exit codes :
# 0 — every gate green
# 1 — at least one gate red
# 2 — at least one gate could not be checked (collector down,
# token wrong, etc.) — operator must verify manually
# 3 — required env / tool missing
set -euo pipefail
DATABASE_URL=${DATABASE_URL:-?}
SENTRY_AUTH_TOKEN=${SENTRY_AUTH_TOKEN:-?}
STATUSPAGE_URL=${STATUSPAGE_URL:-https://status.veza.fr}
PROM_URL=${PROM_URL:-?}
LOOP=${LOOP:-0}
CHECK_INTERVAL=${CHECK_INTERVAL:-600}
QUIET=${QUIET:-0}
THRESHOLD_TESTERS=${THRESHOLD_TESTERS:-50}
[ "$DATABASE_URL" = "?" ] && { echo "DATABASE_URL required" >&2; exit 3; }
[ "$SENTRY_AUTH_TOKEN" = "?" ] && { echo "SENTRY_AUTH_TOKEN required (read scope sufficient)" >&2; exit 3; }
[ "$PROM_URL" = "?" ] && { echo "PROM_URL required" >&2; exit 3; }
command -v psql >/dev/null 2>&1 || { echo "psql required" >&2; exit 3; }
command -v curl >/dev/null 2>&1 || { echo "curl required" >&2; exit 3; }
command -v jq >/dev/null 2>&1 || { echo "jq required" >&2; exit 3; }
# ----------------------------------------------------------------------
# Individual gate checks. Each prints "✅ <name>" / "🔴 <name>" / "⚪ <name>"
# (last for "could not check"), and sets one of GATE_*_OK to 0 / 1 / 2.
# ----------------------------------------------------------------------
GATE_TESTERS_OK=2
GATE_SENTRY_OK=2
GATE_STATUSPAGE_OK=2
GATE_SYNTHETIC_OK=2
GATE_K6_OK=2
GATE_ISSUES_OK=2
check_testers() {
local count
count=$(psql "$DATABASE_URL" -A -t -c "
SELECT count(*) FROM beta_invites WHERE used_at IS NOT NULL;
" 2>/dev/null | tr -d ' ' || echo "?")
if [ "$count" = "?" ] || ! [[ "$count" =~ ^[0-9]+$ ]]; then
echo "⚪ testers signed-up : check failed (psql)"
GATE_TESTERS_OK=2
return
fi
if [ "$count" -ge "$THRESHOLD_TESTERS" ]; then
echo "✅ testers signed-up : $count / $THRESHOLD_TESTERS"
GATE_TESTERS_OK=0
else
echo "🔴 testers signed-up : $count / $THRESHOLD_TESTERS"
GATE_TESTERS_OK=1
fi
}
check_sentry_p1() {
# Sentry API : count of unresolved P1 issues last 24h.
local count
count=$(curl -s -H "Authorization: Bearer $SENTRY_AUTH_TOKEN" \
"https://sentry.io/api/0/projects/veza/veza-backend/issues/?statsPeriod=24h&query=is:unresolved%20level:fatal" \
2>/dev/null | jq 'length' 2>/dev/null || echo "?")
if [ "$count" = "?" ] || ! [[ "$count" =~ ^[0-9]+$ ]]; then
echo "⚪ Sentry P1 events 24h : check failed (auth or network)"
GATE_SENTRY_OK=2
return
fi
if [ "$count" -eq 0 ]; then
echo "✅ Sentry P1 events 24h : 0"
GATE_SENTRY_OK=0
else
echo "🔴 Sentry P1 events 24h : $count (must be 0)"
GATE_SENTRY_OK=1
fi
}
check_statuspage() {
local status
status=$(curl -s "$STATUSPAGE_URL/api/v1/status" 2>/dev/null \
| jq -r '.indicator // .status.indicator // ""' 2>/dev/null || echo "")
case "$status" in
none|operational)
echo "✅ status page : $status (green)"
GATE_STATUSPAGE_OK=0
;;
minor|major|critical)
echo "🔴 status page : $status"
GATE_STATUSPAGE_OK=1
;;
*)
echo "⚪ status page : check failed (got '$status')"
GATE_STATUSPAGE_OK=2
;;
esac
}
check_synthetic() {
# PromQL : sum of probe_success over the last 6h ; expect every
# parcours at 1 (success).
local query='probe_success{probe_kind="synthetic"} == 0'
local resp
resp=$(curl -s --get "$PROM_URL/api/v1/query" \
--data-urlencode "query=$query" 2>/dev/null)
local result_count
result_count=$(echo "$resp" | jq '.data.result | length' 2>/dev/null || echo "?")
if [ "$result_count" = "?" ] || ! [[ "$result_count" =~ ^[0-9]+$ ]]; then
echo "⚪ synthetic parcours : check failed (Prometheus)"
GATE_SYNTHETIC_OK=2
return
fi
if [ "$result_count" -eq 0 ]; then
echo "✅ synthetic parcours : all green"
GATE_SYNTHETIC_OK=0
else
local failing
failing=$(echo "$resp" | jq -r '.data.result[].metric.parcours' 2>/dev/null | tr '\n' ',' | sed 's/,$//')
echo "🔴 synthetic parcours : $result_count failing ($failing)"
GATE_SYNTHETIC_OK=1
fi
}
check_k6_nightly() {
# k6 nightly is exposed as veza_k6_nightly_last_success_timestamp_seconds
# by the Forgejo runner workflow's textfile-collector. Reading via Prom
# gives "is the last success < 30h old?".
local query='time() - veza_k6_nightly_last_success_timestamp_seconds'
local resp age
resp=$(curl -s --get "$PROM_URL/api/v1/query" \
--data-urlencode "query=$query" 2>/dev/null)
age=$(echo "$resp" | jq -r '.data.result[0].value[1] // ""' 2>/dev/null)
if [ -z "$age" ] || [ "$age" = "null" ]; then
echo "⚪ k6 nightly : check failed (metric absent — runner offline?)"
GATE_K6_OK=2
return
fi
age_int=$(printf '%.0f' "$age" 2>/dev/null || echo 999999)
if [ "$age_int" -lt 108000 ]; then # 30h
echo "✅ k6 nightly : last success $(( age_int / 3600 ))h ago"
GATE_K6_OK=0
else
echo "🔴 k6 nightly : last success $(( age_int / 3600 ))h ago (> 30h)"
GATE_K6_OK=1
fi
}
check_high_issues() {
# The operator-reported issues count lives in the SOFT_LAUNCH_BETA_2026.md
# report under "Issues reported". Without an external tracker we read it
# from a known location in the report file. Skip if file absent.
local report
report="$(cd "$(dirname "$0")/../.." && pwd)/docs/SOFT_LAUNCH_BETA_2026.md"
if [ ! -f "$report" ]; then
echo "⚪ HIGH issues count : report file not found"
GATE_ISSUES_OK=2
return
fi
local count
count=$(grep -cE '^\| HIGH ' "$report" 2>/dev/null || echo 0)
if [ "$count" -lt 3 ]; then
echo "✅ HIGH-severity issues reported : $count / < 3"
GATE_ISSUES_OK=0
else
echo "🔴 HIGH-severity issues reported : $count / < 3"
GATE_ISSUES_OK=1
fi
}
# ----------------------------------------------------------------------
# Main loop
# ----------------------------------------------------------------------
run_once() {
if [ "$QUIET" != "1" ]; then
echo "================================================================"
echo "Acceptance gate check — $(date -u +'%Y-%m-%d %H:%M:%S UTC')"
echo "----------------------------------------------------------------"
fi
check_testers
check_sentry_p1
check_statuspage
check_synthetic
check_k6_nightly
check_high_issues
if [ "$QUIET" != "1" ]; then
echo "----------------------------------------------------------------"
fi
local red=0 unknown=0
for v in "$GATE_TESTERS_OK" "$GATE_SENTRY_OK" "$GATE_STATUSPAGE_OK" \
"$GATE_SYNTHETIC_OK" "$GATE_K6_OK" "$GATE_ISSUES_OK"; do
case $v in
1) red=$(( red + 1 )) ;;
2) unknown=$(( unknown + 1 )) ;;
esac
done
if [ "$red" -eq 0 ] && [ "$unknown" -eq 0 ]; then
echo "VERDICT : ALL GATES GREEN — soft-launch is GO"
return 0
elif [ "$red" -gt 0 ]; then
echo "VERDICT : $red gate(s) RED — NO-GO until resolved"
return 1
else
echo "VERDICT : $unknown gate(s) UNCHECKABLE — operator must verify manually before decision call"
return 2
fi
}
if [ "$LOOP" != "1" ]; then
run_once
exit $?
fi
# Continuous mode.
while true; do
run_once || true
echo ""
echo "next check in ${CHECK_INTERVAL}s — Ctrl-C to exit"
sleep "$CHECK_INTERVAL"
done