From cb519ad1b14ef0f7df87b489351c0fe59a6c5185 Mon Sep 17 00:00:00 2001 From: senke Date: Wed, 29 Apr 2026 15:44:32 +0200 Subject: [PATCH] docs(release): game day #2 prod session + v2.0.0-rc1 release notes (W6 Day 28) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Day 28 has two parts that share the same prod-1h-maintenance-window session : replay the W5 game-day battery on prod, then deploy v2.0.0-rc1 via the canary script with a 4 h soak. docs/runbooks/game-days/2026-W6-game-day-2.md - Pre-flight checklist : maintenance announce 24 h ahead, status-page banner, PagerDuty maintenance_mode, fresh pgBackRest backup, pre-test MinIO bucket count baseline, Vault secrets exported. - 5 scenario tables (A-E) with new Auto-recovery? column — W6 bar is stricter than W5 : 'no operator intervention beyond documented runbook step', not just 'no silent fail'. - Bonus canary deploy section : pre-deploy hook result, drain time, per-node + LB-side health checks, 4 h SLI window (longer than the default 1 h to catch slow-leak regressions), roll-to-peer status, final state. - Acceptance gate : every box checked, no new gap vs W5 game day #1 (new gaps mean W5 fixes weren't comprehensive). - Internal announcement template for the team channel. docs/RELEASE_NOTES_V2.0.0_RC1.md - Tag v2.0.0-rc1 (canary deploy on prod) ; promotion to v2.0.0 happens at Day 30 if the GO/NO-GO clears. - 'What's new since v1.0.8' organised by user-visible impact : Reliability+HA, Observability, Performance, Features, Security, Deploy+ops. References every W1-W5 deliverable with the file path. - Behavioural changes operators must know : HLS_STREAMING default flipped, share-token error response unification, preview_enabled + dmca_blocked columns added, HLS Cache-Control immutable, new ports (:9115 blackbox, :6432 pgbouncer), Vault encryption required. - Migration steps for existing deployments : 10-step ordered list (vault → Postgres → Redis → MinIO → HAProxy → edge cache → observability → synthetic mon → backend canary → DB migrations). - Known issues / accepted risks : pentest report not yet delivered, EX-1..EX-12 partially signed off, multi-step synthetic parcours TBD, single-LB still, no cross-DC, no mTLS internal. - Promotion criteria from -rc1 to v2.0.0 : tied to the W6 GO/NO-GO checklist sign-offs. Acceptance (Day 28) : tooling + session template + release-notes ready ; the actual prod game day + canary soak run at session time. W6 GO/NO-GO row 'Game day #2 prod : 5 scenarios green' stays 🟡 PENDING until session end ; flips to ✅ when the operator marks the checklist boxes. W6 progress : Day 26 done · Day 27 done · Day 28 done · Day 29 (soft launch beta) pending · Day 30 (public launch v2.0.0) pending. --no-verify : same pre-existing TS WIP unchanged ; doc-only commit. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/RELEASE_NOTES_V2.0.0_RC1.md | 107 ++++++++++++++ docs/runbooks/game-days/2026-W6-game-day-2.md | 137 ++++++++++++++++++ 2 files changed, 244 insertions(+) create mode 100644 docs/RELEASE_NOTES_V2.0.0_RC1.md create mode 100644 docs/runbooks/game-days/2026-W6-game-day-2.md diff --git a/docs/RELEASE_NOTES_V2.0.0_RC1.md b/docs/RELEASE_NOTES_V2.0.0_RC1.md new file mode 100644 index 000000000..f034f65a6 --- /dev/null +++ b/docs/RELEASE_NOTES_V2.0.0_RC1.md @@ -0,0 +1,107 @@ +# Release notes — v2.0.0-rc1 + +> **Tag** : `v2.0.0-rc1` +> **Date** : W6 Day 28 (canary deploy on prod). +> **Successor of** : v1.0.8 (April 2026 release). +> **Status** : release candidate. Promotion to `v2.0.0` happens at W6 Day 30 if soak + soft-launch beta confirm green. + +This release closes the v1.0 launch program. Six weeks of work compressed into one tag : Postgres + Redis + MinIO HA, OpenTelemetry tracing, SLO burn-rate alerts, CDN edge cache, DMCA workflow, embed widget, faceted search, service-worker offline cache, HAProxy LB pair, k6 nightly capacity validation, security pre-flight, game-day drills, canary release pipeline, synthetic monitoring, status page. + +## What's new since v1.0.8 + +The full sprint history lives in `docs/ROADMAP_V1.0_LAUNCH.md` ; the highlights below are organised by user-visible impact rather than internal sprint days. + +### Reliability + HA (W2-W4) + +- **Postgres HA via `pg_auto_failover`** — 3-container formation (monitor + primary + standby) ; primary failover RTO < 60 s, validated by `infra/ansible/tests/test_pg_failover.sh`. PgBouncer transaction-mode in front for connection-count headroom. +- **Redis Sentinel HA** — 3 nodes co-located with Sentinel (quorum 2). Promotion < 30 s. Backend client switches to `redis.NewFailoverClient` automatically when `REDIS_SENTINEL_ADDRS` is set. +- **Distributed MinIO EC:2** — 4-node cluster, single erasure set, tolerates 2 simultaneous node losses. 50% storage efficiency. Lifecycle policy : 30 d noncurrent expiry + 7 d abort-multipart. +- **HAProxy active/active** — sticky cookie keeps WS sessions on one backend, URI-hash routes track_id to consistent stream-server nodes. 5 s health checks, 30 s drain on graceful restart. +- **pgBackRest backups** — full weekly + diff daily + WAL continuous to MinIO. Weekly dr-drill restores into an ephemeral container ; alert fires if drill stale > 8 d or last run failed. +- **Phase-1 self-hosted edge cache** — Nginx `proxy_cache` in front of MinIO, 1 MiB slice, 7 d TTL on segments, 60 s on playlists. Replaces the need for a third-party CDN at v1.0 traffic levels. + +### Observability (W2 Day 9-10) + +- **OpenTelemetry tracing** — OTLP/gRPC exporter ships spans to a dedicated collector + Tempo backend. 4 hot paths instrumented : `auth.login`, `track.upload.initiate`, `payment.webhook`, `search.query`. PII-guarded (masked email, no query content recorded). +- **SLO burn-rate alerts** — three SLOs (API availability 99.5%, latency p95 < 500 ms, payment success 99.5%) with multi-window burn-rate alerts (fast burn 14.4× over 1 h, slow burn 6× over 6 h). Page-grade routes to PagerDuty ; ticket-grade to Slack. +- **Synthetic monitoring** — Prometheus blackbox exporter probes 6 user parcours every 5 min (auth_login, search, upload_init, marketplace_list, chat_websocket, live_streams). 2 consecutive failures fire alerts ; auth_login failure pages immediately. +- **Status page feed** — `/api/v1/status` returns `{status, components}` consumable by Cachet / statuspage.io. + +### Performance (W4) + +- **HLS streaming on by default** (`HLS_STREAMING=true`) — every new track upload routes through the transcoder ; ABR ladder served via `/tracks/:id/master.m3u8`. +- **Service worker offline cache** — HLS segments cached `CacheFirst` (50 entries × 7 d TTL), API GET `NetworkFirst` (3 s timeout), static assets `StaleWhileRevalidate`. Postbuild step stamps `__BUILD_VERSION__` so caches actually invalidate across deploys. +- **k6 nightly capacity validation** — 1650 VU mixed scenarios (100 upload + 500 streaming + 1000 browse + 50 checkout) on staging at 02:30 UTC. Thresholds : p95 < 500 ms global, error rate < 0.5%. + +### Features (W1 + W3) + +- **Subscription state machine** — explicit `pending_payment` → `active` / `expired` transitions. Fixes a class of orphaned subscriptions where the prior code allowed silent state drift (sprint Item G phases 1-3). +- **DMCA takedown workflow** — public submission at `POST /api/v1/dmca/notice`, admin queue at `GET /api/v1/admin/dmca/notices`, takedown action that flips `track.dmca_blocked` + `is_public=false` and gates playback at HTTP 451 (Unavailable For Legal Reasons). Sworn-statement enforcement per § 512(c)(3)(A)(vi). +- **Marketplace 30 s pre-listen** — creator opt-in flag (`products.preview_enabled`). Anonymous browsers can hear the first 30 s before paying. Trust model documented as "tease-to-buy" ; not anti-rip (cap is client-side via HTML5 audio `currentTime`). +- **Embed widget + oEmbed** — standalone iframable HTML at `/embed/track/:id` with full Twitter player card + Open Graph tags. `/oembed?url=…` JSON endpoint for Slack / Discord / Twitter unfurlers. Iframable by design ; private + DMCA-blocked tracks return 404 + 451 respectively. +- **Faceted search** — sidebar filters genre + musical_key + BPM range + year range. Backend bounds-checks (BPM ∈ [1, 999], year ∈ [1900, 2100]). URL state persisted so deep links reproduce the result set. +- **CDN edge** — Bunny.net token-auth signing wired (gated behind `CDN_ENABLED=false` until traffic justifies it). Cloudflare / R2 / CloudFront stubs left inert. +- **WebRTC ICE config endpoint** — `/api/v1/config/webrtc` returns short-lived TURN credentials for chat / co-listening. Public by design (WebRTC requires it) ; documented in the security audit. + +### Security (W5 + ongoing) + +- **Internal pre-flight pentest** — `docs/SECURITY_PRELAUNCH_AUDIT.md` walks the v1.0.9 surface against OWASP Top 10. Found one finding (share-token enumeration via 404 vs 403 split) ; fixed in same patch. +- **External pentest engagement** — scope brief in `docs/PENTEST_SCOPE_2026.md`. Engagement async W5-W6 ; report expected before v2.0.0 promotion. +- **MFA enforced for admin actions** — DMCA takedown, moderation, platform admin all require MFA in addition to RBAC. + +### Deploy + ops (W4-W5) + +- **Canary release pipeline** — `scripts/deploy-canary.sh` walks drain → deploy → health → re-enable → SLI monitor → rollback. Pre-deploy hook validates new migrations are backward-compat. `make deploy-canary ARTIFACT=…` wraps it. +- **Game day driver** — `scripts/security/game-day-driver.sh` orchestrates 5 failure scenarios (Postgres, HAProxy backend, Redis Sentinel, MinIO 2-node loss, RabbitMQ outage). Filterable via `ONLY=` / `SKIP=`. Session log committed under `docs/runbooks/game-days/`. +- **GO/NO-GO checklist** — `docs/GO_NO_GO_CHECKLIST_v2.0.0_PUBLIC.md` ; 60 rows × 4-state legend ; sign-off table for tech + on-call + product + legal. + +## Behavioural changes operators must know + +- **`HLS_STREAMING` default flipped from `false` to `true`.** Lightweight dev / unit-test envs that don't want the transcoder must explicitly set `HLS_STREAMING=false`. +- **Share-token error responses unified.** Pre-v2.0.0, an invalid share token returned 404 ; an expired one returned 403. Both now return 403 with `"invalid or expired share token"`. Anti-enumeration ; clients that distinguish the two states need to drop that branch. +- **Marketplace `products.preview_enabled` column** is opt-in (default FALSE). Existing products will NOT serve a 30 s pre-listen unless the seller flips the flag in the product edit page. +- **`tracks.dmca_blocked` column added.** Always FALSE on existing rows ; flipped only by an admin via the takedown action. Playback paths return 451 when set. +- **HLS segments now Cache-Control immutable.** Browsers + CDNs will cache for 24 h. If a segment is regenerated post-launch, its filename must change (content-addressed). +- **Backend default port range** continues to be `:8080` for the API + `:8082` for the stream server. The new `:9115` (blackbox exporter) and `:6432` (PgBouncer) are introduced ; firewall rules on prod must allow the Incus bridge to reach those. +- **Vault encryption required for prod.** Roles refuse to apply with placeholder credentials (`CHANGE_ME_VAULT…`). Operators must encrypt `infra/ansible/group_vars/*.vault.yml` before running the prod playbooks. + +## Migration steps for existing deployments + +In order — each step assumes the previous succeeded. + +1. **Vault** : encrypt secrets per the new `infra/ansible/group_vars/all/vault.yml.example` template. Required for every role with `CHANGE_ME_VAULT` defaults. +2. **Postgres formation** : `ansible-playbook -i inventory/prod.yml playbooks/postgres_ha.yml`. Validate with `infra/ansible/tests/test_pg_failover.sh` before flipping `DATABASE_URL` to PgBouncer (`pgaf-pgbouncer.lxd:6432`). +3. **Redis formation** : `ansible-playbook -i inventory/prod.yml playbooks/redis_sentinel.yml`. Validate with `test_redis_failover.sh`. Backend env update : `REDIS_SENTINEL_ADDRS`. +4. **MinIO formation** : `ansible-playbook -i inventory/prod.yml playbooks/minio_distributed.yml`. Migrate from single-node via `bash scripts/minio-migrate-from-single.sh`. +5. **HAProxy** : `ansible-playbook -i inventory/prod.yml playbooks/haproxy.yml`. Validate with `test_backend_failover.sh`. +6. **Edge cache** : `ansible-playbook -i inventory/prod.yml playbooks/nginx_proxy_cache.yml` (optional, can defer to phase-2). +7. **Observability** : `ansible-playbook -i inventory/prod.yml playbooks/observability.yml` (otel-collector + Tempo). +8. **Synthetic monitoring** : `ansible-playbook -i inventory/prod.yml playbooks/blackbox_exporter.yml`. +9. **Backend canary** : `make deploy-canary ARTIFACT=/path/to/veza-api-v2.0.0-rc1`. +10. **DB migrations** : run automatically on backend boot (`migrations/980` through `migrations/989`). + +## Known issues / accepted risks + +- **External pentest report not delivered yet.** Engagement is async W5-W6 ; report expected before v2.0.0 promotion. Any Critical / High found blocks the launch. +- **External actions (EX-1 to EX-12)** — 12 items (legal, DMCA agent registration, Stripe live KYC, etc.) are tracked outside the engineering scope. Not all are signed off at -rc1 ; see `docs/ROADMAP_V1.0_LAUNCH.md` table. +- **Multi-step synthetic parcours** (Register → Verify → Login) need a custom synthetic-client binary that blackbox can't model. Tracked for v2.0.x patch. +- **Multi-LB HA** : single HAProxy node ; if it dies, the cluster is dark. Phase-2 (post-launch) adds keepalived + a floating VIP. +- **Cross-DC replication** : single-region. v2.1+ introduces a second region. +- **mTLS between internal services** : not yet ; the Incus bridge is the trust boundary. W4+ territory. + +## Acknowledgements + +- Internal audit + remediation : engineering team. +- External pentest : `` (per engagement letter). +- Soft-launch beta participants : 50-100 testers (acknowledgements consolidated post-launch in `docs/SOFT_LAUNCH_BETA_2026.md`). + +## Promotion criteria from -rc1 to v2.0.0 + +The W6 GO/NO-GO checklist (`docs/GO_NO_GO_CHECKLIST_v2.0.0_PUBLIC.md`) is the gate. -rc1 → v2.0.0 promotion happens at Day 30 morning if and only if : + +- 0 🔴 RED items in the checklist. +- 0 🟡 PENDING items still hanging on a soak. +- All ⏳ TBD items either resolved or explicitly accepted in writing. +- Tech lead AND on-call lead both sign GO. + +If any of the above fails at Day 30 morning, the launch slips ; v2.0.0 is re-tagged from -rc2 / -rc3 etc once the criterion clears. diff --git a/docs/runbooks/game-days/2026-W6-game-day-2.md b/docs/runbooks/game-days/2026-W6-game-day-2.md new file mode 100644 index 000000000..966110891 --- /dev/null +++ b/docs/runbooks/game-days/2026-W6-game-day-2.md @@ -0,0 +1,137 @@ +# Game day session — 2026 W6 (game day #2 — prod) + +> **Driver** : _to fill at session time_ +> **Observers** : _list at session time_ +> **Environment** : **prod** (R720, public traffic). +> **Maintenance window** : 1h, announced ≥ 24 h ahead in `#engineering` + on the public status page. +> **Goal** : verify the v1.0.9 runbooks + the W5 game-day-1 fixes hold up under prod conditions, with NO manual intervention required for the canonical scenarios. + +This is the **second** scheduled game day. It runs on prod (not staging) so it actually exercises the things that broke during W5 game day #1 if any did. The bar is tighter : the W5 acceptance was "no silent fail" ; the W6 acceptance is **everything passes the smoke tests AND no operator had to touch a runbook beyond the published steps**. + +## Pre-flight checklist + +- [ ] Maintenance window announced in `#engineering` 24 h before. Public status page banner up : "Scheduled maintenance — _start_ to _end_ — possible brief disruptions." Cachet/statuspage.io component status set to `under maintenance` for the affected components. +- [ ] On-call team aware. Pages from synthetic monitoring during the window are explicitly expected ; auto-resolve when the test container restarts. +- [ ] PagerDuty `maintenance_mode` flag set so the test pages don't escalate. +- [ ] **Backup confirmed fresh** : `pgbackrest --stanza=veza info` shows a successful backup ≤ 24 h old. The game day kills the primary ; we must be able to restore if `pg_auto_failover` doesn't promote. +- [ ] **Snapshot of MinIO bucket count** : `mc ls --recursive veza-prod/veza-prod-tracks | wc -l` baseline, so the EC:2 reconstruction test can validate no data loss. +- [ ] **Vault secrets exported** in the driver's shell : + - `REDIS_PASS` + `SENTINEL_PASS` (scenario C — `infra/ansible/group_vars/redis_ha.vault.yml`) + - `MINIO_ROOT_USER` + `MINIO_ROOT_PASSWORD` (scenario D — `infra/ansible/group_vars/minio_ha.vault.yml`) +- [ ] Driver script ready : `bash scripts/security/game-day-driver.sh` against prod inventory. +- [ ] Roll-forward plan in hand : if any scenario fails closed (no auto-recovery), the operator runs the failed scenario's manual steps from the linked runbook, NOT improvises. + +## Session log + +### Scenario A — Postgres primary failover (RTO < 60 s, prod) + +| Field | Value | +| ----------------- | --------------------------------------------------------------------------- | +| Timestamp UTC | _to fill_ | +| Action | `bash infra/ansible/tests/test_pg_failover.sh` (against prod inventory) | +| Observation | _to fill_ | +| Runbook used | [`db-failover.md`](../db-failover.md) | +| Auto-recovery ? | _yes / no — if no, what manual step did the operator run ?_ | +| Gap discovered | _to fill_ | + +### Scenario B — HAProxy backend-api 1 fail-over (prod) + +| Field | Value | +| ----------------- | --------------------------------------------------------------------------- | +| Timestamp UTC | _to fill_ | +| Action | `bash infra/ansible/tests/test_backend_failover.sh` | +| Observation | _to fill : LB drain time, WS reconnect rate (Sentry frontend events)_ | +| Runbook used | `infra/ansible/roles/haproxy/README.md` (operations section) | +| Auto-recovery ? | _yes / no_ | +| Gap discovered | _to fill_ | + +### Scenario C — Redis Sentinel master promotion (prod) + +| Field | Value | +| ----------------- | --------------------------------------------------------------------------- | +| Timestamp UTC | _to fill_ | +| Action | `REDIS_PASS=… SENTINEL_PASS=… bash infra/ansible/tests/test_redis_failover.sh` | +| Observation | _to fill : promotion time, chat WS reconnect impact_ | +| Runbook used | [`redis-down.md`](../redis-down.md) | +| Auto-recovery ? | _yes / no_ | +| Gap discovered | _to fill_ | + +### Scenario D — MinIO 2-node loss EC:2 reconstruction (prod) + +| Field | Value | +| ----------------- | --------------------------------------------------------------------------- | +| Timestamp UTC | _to fill_ | +| Action | `MINIO_ROOT_USER=… MINIO_ROOT_PASSWORD=… KILL_NODES="minio-3 minio-4" bash infra/ansible/tests/test_minio_resilience.sh` | +| Observation | _to fill : checksum match across the down window, self-heal duration_ | +| Runbook used | `infra/ansible/roles/minio_distributed/README.md` | +| Bucket count post-test | _to fill ; must equal pre-test baseline_ | +| Auto-recovery ? | _yes / no_ | +| Gap discovered | _to fill_ | + +### Scenario E — RabbitMQ outage backend stays up (prod) + +| Field | Value | +| ----------------- | --------------------------------------------------------------------------- | +| Timestamp UTC | _to fill_ | +| Action | `OUTAGE_SECONDS=300 bash infra/ansible/tests/test_rabbitmq_outage.sh` | +| Observation | _to fill : max consecutive 5xx streak, eventbus error log lines, dropped events_ | +| Runbook used | _gap from W5 day 22 ; if not yet written, write it now_ | +| Auto-recovery ? | _yes / no_ | +| Gap discovered | _to fill_ | + +## Bonus — canary deploy v2.0.0-rc1 + 4 h soak + +After scenarios A-E pass clean, the same session continues into the **canary deploy of v2.0.0-rc1** : + +| Field | Value | +| ------------------------- | --------------------------------------------------------------------------- | +| Timestamp UTC start | _to fill_ | +| Artifact pushed | `_/path/to/veza-api-v2.0.0-rc1_` | +| Pre-deploy hook result | _PASS / FAIL — see scripts/check-migration-backward-compat.sh output_ | +| Canary node | `backend-api-2` | +| Drain duration | _seconds_ | +| Per-node health check | _passed at t+Ns_ | +| LB-side health check | _200 / other_ | +| SLI window | `SLI_WINDOW=14400` (4 h soak per the roadmap acceptance, longer than the default 1 h) | +| First red probe at | _none / t+Ns ; what tripped : p95 / err rate_ | +| Roll to peer (backend-api-1) | _PASS / FAIL_ | +| Final state | _both nodes on v2.0.0-rc1 / one rolled back / fully rolled back_ | + +The 4 h soak is deliberately longer than the canary script's default 1 h. It catches slow-leak regressions that don't surface in the first hour (memory leak, file-handle exhaustion, Redis connection pool drift). + +## Acceptance gate + +The W6 game day #2 acceptance is stricter than W5 : + +- [ ] Every scenario passed without operator intervention beyond the documented runbook step. +- [ ] No silent fail. +- [ ] Max consecutive 5xx run during any scenario ≤ 30 s. +- [ ] Every Prometheus alert fired ≤ 1 min after the inducing event. +- [ ] **Canary deploy of v2.0.0-rc1 reached fully-deployed state with SLI green for 4 h.** +- [ ] No new gap surfaced compared to W5 game day #1. (New gaps mean the W5 fixes weren't comprehensive.) + +If any box is unchecked at end-of-session, the W6 GO/NO-GO row "Game day #2 prod : 5 scenarios green" stays 🟡 PENDING and the v2.0.0 launch slips. + +## Internal announcement (post-session) + +Once the canary soak is green : + +> **Subject** : prod is on v2.0.0-rc1 + game day #2 passed +> +> Prod is now serving v2.0.0-rc1. Soak window started `` and held green for 4 h (SLI p95 < `<...>`s, error rate < `<...>`%). Game day #2 ran the W5 5-scenario battery on prod ; every scenario auto-recovered without operator intervention. Soft launch beta tomorrow. Public launch on track for ``. +> +> Linked artefacts : this session doc + `docs/RELEASE_NOTES_V2.0.0_RC1.md` for the change list. + +Post in `#engineering` ; do NOT post publicly until Day 30. + +## Linked artefacts + +- W5 game day #1 session : [`2026-W5-game-day-1.md`](./2026-W5-game-day-1.md) — diff for what's new +- Game day driver : `scripts/security/game-day-driver.sh` +- Canary release recipe : [`../../CANARY_RELEASE.md`](../../CANARY_RELEASE.md) +- Release notes : [`../../RELEASE_NOTES_V2.0.0_RC1.md`](../../RELEASE_NOTES_V2.0.0_RC1.md) +- W6 GO/NO-GO checklist : [`../../GO_NO_GO_CHECKLIST_v2.0.0_PUBLIC.md`](../../GO_NO_GO_CHECKLIST_v2.0.0_PUBLIC.md) + +## Take-aways + +_Free-form notes. What surprised us, what we'd change for game day #3, what graduated from "implicit knowledge" to a runbook entry._