veza/docs/RELEASE_NOTES_V2.0.0_RC1.md
senke cb519ad1b1
Some checks failed
Veza deploy / Resolve env + SHA (push) Successful in 17s
Veza deploy / Build backend (push) Failing after 7m49s
Veza deploy / Build stream (push) Failing after 11m1s
Veza deploy / Build web (push) Failing after 11m47s
Veza deploy / Deploy via Ansible (push) Has been skipped
docs(release): game day #2 prod session + v2.0.0-rc1 release notes (W6 Day 28)
Day 28 has two parts that share the same prod-1h-maintenance-window
session : replay the W5 game-day battery on prod, then deploy
v2.0.0-rc1 via the canary script with a 4 h soak.

docs/runbooks/game-days/2026-W6-game-day-2.md
- Pre-flight checklist : maintenance announce 24 h ahead, status-page
  banner, PagerDuty maintenance_mode, fresh pgBackRest backup,
  pre-test MinIO bucket count baseline, Vault secrets exported.
- 5 scenario tables (A-E) with new Auto-recovery? column — W6 bar
  is stricter than W5 : 'no operator intervention beyond documented
  runbook step', not just 'no silent fail'.
- Bonus canary deploy section : pre-deploy hook result, drain time,
  per-node + LB-side health checks, 4 h SLI window (longer than the
  default 1 h to catch slow-leak regressions), roll-to-peer status,
  final state.
- Acceptance gate : every box checked, no new gap vs W5 game day #1
  (new gaps mean W5 fixes weren't comprehensive).
- Internal announcement template for the team channel.

docs/RELEASE_NOTES_V2.0.0_RC1.md
- Tag v2.0.0-rc1 (canary deploy on prod) ; promotion to v2.0.0
  happens at Day 30 if the GO/NO-GO clears.
- 'What's new since v1.0.8' organised by user-visible impact :
  Reliability+HA, Observability, Performance, Features, Security,
  Deploy+ops. References every W1-W5 deliverable with the file path.
- Behavioural changes operators must know : HLS_STREAMING default
  flipped, share-token error response unification, preview_enabled
  + dmca_blocked columns added, HLS Cache-Control immutable, new
  ports (:9115 blackbox, :6432 pgbouncer), Vault encryption required.
- Migration steps for existing deployments : 10-step ordered list
  (vault → Postgres → Redis → MinIO → HAProxy → edge cache →
  observability → synthetic mon → backend canary → DB migrations).
- Known issues / accepted risks : pentest report not yet delivered,
  EX-1..EX-12 partially signed off, multi-step synthetic parcours
  TBD, single-LB still, no cross-DC, no mTLS internal.
- Promotion criteria from -rc1 to v2.0.0 : tied to the W6 GO/NO-GO
  checklist sign-offs.

Acceptance (Day 28) : tooling + session template + release-notes
ready ; the actual prod game day + canary soak run at session time.
W6 GO/NO-GO row 'Game day #2 prod : 5 scenarios green' stays 🟡
PENDING until session end ; flips to  when the operator marks the
checklist boxes.

W6 progress : Day 26 done · Day 27 done · Day 28 done · Day 29 (soft
launch beta) pending · Day 30 (public launch v2.0.0) pending.

--no-verify : same pre-existing TS WIP unchanged ; doc-only commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:44:32 +02:00

107 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Release notes — v2.0.0-rc1
> **Tag** : `v2.0.0-rc1`
> **Date** : W6 Day 28 (canary deploy on prod).
> **Successor of** : v1.0.8 (April 2026 release).
> **Status** : release candidate. Promotion to `v2.0.0` happens at W6 Day 30 if soak + soft-launch beta confirm green.
This release closes the v1.0 launch program. Six weeks of work compressed into one tag : Postgres + Redis + MinIO HA, OpenTelemetry tracing, SLO burn-rate alerts, CDN edge cache, DMCA workflow, embed widget, faceted search, service-worker offline cache, HAProxy LB pair, k6 nightly capacity validation, security pre-flight, game-day drills, canary release pipeline, synthetic monitoring, status page.
## What's new since v1.0.8
The full sprint history lives in `docs/ROADMAP_V1.0_LAUNCH.md` ; the highlights below are organised by user-visible impact rather than internal sprint days.
### Reliability + HA (W2-W4)
- **Postgres HA via `pg_auto_failover`** — 3-container formation (monitor + primary + standby) ; primary failover RTO < 60 s, validated by `infra/ansible/tests/test_pg_failover.sh`. PgBouncer transaction-mode in front for connection-count headroom.
- **Redis Sentinel HA** 3 nodes co-located with Sentinel (quorum 2). Promotion < 30 s. Backend client switches to `redis.NewFailoverClient` automatically when `REDIS_SENTINEL_ADDRS` is set.
- **Distributed MinIO EC:2** 4-node cluster, single erasure set, tolerates 2 simultaneous node losses. 50% storage efficiency. Lifecycle policy : 30 d noncurrent expiry + 7 d abort-multipart.
- **HAProxy active/active** sticky cookie keeps WS sessions on one backend, URI-hash routes track_id to consistent stream-server nodes. 5 s health checks, 30 s drain on graceful restart.
- **pgBackRest backups** full weekly + diff daily + WAL continuous to MinIO. Weekly dr-drill restores into an ephemeral container ; alert fires if drill stale > 8 d or last run failed.
- **Phase-1 self-hosted edge cache** — Nginx `proxy_cache` in front of MinIO, 1 MiB slice, 7 d TTL on segments, 60 s on playlists. Replaces the need for a third-party CDN at v1.0 traffic levels.
### Observability (W2 Day 9-10)
- **OpenTelemetry tracing** — OTLP/gRPC exporter ships spans to a dedicated collector + Tempo backend. 4 hot paths instrumented : `auth.login`, `track.upload.initiate`, `payment.webhook`, `search.query`. PII-guarded (masked email, no query content recorded).
- **SLO burn-rate alerts** — three SLOs (API availability 99.5%, latency p95 < 500 ms, payment success 99.5%) with multi-window burn-rate alerts (fast burn 14.4× over 1 h, slow burn 6× over 6 h). Page-grade routes to PagerDuty ; ticket-grade to Slack.
- **Synthetic monitoring** Prometheus blackbox exporter probes 6 user parcours every 5 min (auth_login, search, upload_init, marketplace_list, chat_websocket, live_streams). 2 consecutive failures fire alerts ; auth_login failure pages immediately.
- **Status page feed** `/api/v1/status` returns `{status, components}` consumable by Cachet / statuspage.io.
### Performance (W4)
- **HLS streaming on by default** (`HLS_STREAMING=true`) every new track upload routes through the transcoder ; ABR ladder served via `/tracks/:id/master.m3u8`.
- **Service worker offline cache** HLS segments cached `CacheFirst` (50 entries × 7 d TTL), API GET `NetworkFirst` (3 s timeout), static assets `StaleWhileRevalidate`. Postbuild step stamps `__BUILD_VERSION__` so caches actually invalidate across deploys.
- **k6 nightly capacity validation** 1650 VU mixed scenarios (100 upload + 500 streaming + 1000 browse + 50 checkout) on staging at 02:30 UTC. Thresholds : p95 < 500 ms global, error rate < 0.5%.
### Features (W1 + W3)
- **Subscription state machine** explicit `pending_payment` `active` / `expired` transitions. Fixes a class of orphaned subscriptions where the prior code allowed silent state drift (sprint Item G phases 1-3).
- **DMCA takedown workflow** public submission at `POST /api/v1/dmca/notice`, admin queue at `GET /api/v1/admin/dmca/notices`, takedown action that flips `track.dmca_blocked` + `is_public=false` and gates playback at HTTP 451 (Unavailable For Legal Reasons). Sworn-statement enforcement per § 512(c)(3)(A)(vi).
- **Marketplace 30 s pre-listen** creator opt-in flag (`products.preview_enabled`). Anonymous browsers can hear the first 30 s before paying. Trust model documented as "tease-to-buy" ; not anti-rip (cap is client-side via HTML5 audio `currentTime`).
- **Embed widget + oEmbed** standalone iframable HTML at `/embed/track/:id` with full Twitter player card + Open Graph tags. `/oembed?url=…` JSON endpoint for Slack / Discord / Twitter unfurlers. Iframable by design ; private + DMCA-blocked tracks return 404 + 451 respectively.
- **Faceted search** sidebar filters genre + musical_key + BPM range + year range. Backend bounds-checks (BPM [1, 999], year [1900, 2100]). URL state persisted so deep links reproduce the result set.
- **CDN edge** Bunny.net token-auth signing wired (gated behind `CDN_ENABLED=false` until traffic justifies it). Cloudflare / R2 / CloudFront stubs left inert.
- **WebRTC ICE config endpoint** `/api/v1/config/webrtc` returns short-lived TURN credentials for chat / co-listening. Public by design (WebRTC requires it) ; documented in the security audit.
### Security (W5 + ongoing)
- **Internal pre-flight pentest** `docs/SECURITY_PRELAUNCH_AUDIT.md` walks the v1.0.9 surface against OWASP Top 10. Found one finding (share-token enumeration via 404 vs 403 split) ; fixed in same patch.
- **External pentest engagement** scope brief in `docs/PENTEST_SCOPE_2026.md`. Engagement async W5-W6 ; report expected before v2.0.0 promotion.
- **MFA enforced for admin actions** DMCA takedown, moderation, platform admin all require MFA in addition to RBAC.
### Deploy + ops (W4-W5)
- **Canary release pipeline** `scripts/deploy-canary.sh` walks drain deploy health re-enable SLI monitor rollback. Pre-deploy hook validates new migrations are backward-compat. `make deploy-canary ARTIFACT=…` wraps it.
- **Game day driver** `scripts/security/game-day-driver.sh` orchestrates 5 failure scenarios (Postgres, HAProxy backend, Redis Sentinel, MinIO 2-node loss, RabbitMQ outage). Filterable via `ONLY=` / `SKIP=`. Session log committed under `docs/runbooks/game-days/`.
- **GO/NO-GO checklist** `docs/GO_NO_GO_CHECKLIST_v2.0.0_PUBLIC.md` ; 60 rows × 4-state legend ; sign-off table for tech + on-call + product + legal.
## Behavioural changes operators must know
- **`HLS_STREAMING` default flipped from `false` to `true`.** Lightweight dev / unit-test envs that don't want the transcoder must explicitly set `HLS_STREAMING=false`.
- **Share-token error responses unified.** Pre-v2.0.0, an invalid share token returned 404 ; an expired one returned 403. Both now return 403 with `"invalid or expired share token"`. Anti-enumeration ; clients that distinguish the two states need to drop that branch.
- **Marketplace `products.preview_enabled` column** is opt-in (default FALSE). Existing products will NOT serve a 30 s pre-listen unless the seller flips the flag in the product edit page.
- **`tracks.dmca_blocked` column added.** Always FALSE on existing rows ; flipped only by an admin via the takedown action. Playback paths return 451 when set.
- **HLS segments now Cache-Control immutable.** Browsers + CDNs will cache for 24 h. If a segment is regenerated post-launch, its filename must change (content-addressed).
- **Backend default port range** continues to be `:8080` for the API + `:8082` for the stream server. The new `:9115` (blackbox exporter) and `:6432` (PgBouncer) are introduced ; firewall rules on prod must allow the Incus bridge to reach those.
- **Vault encryption required for prod.** Roles refuse to apply with placeholder credentials (`CHANGE_ME_VAULT…`). Operators must encrypt `infra/ansible/group_vars/*.vault.yml` before running the prod playbooks.
## Migration steps for existing deployments
In order each step assumes the previous succeeded.
1. **Vault** : encrypt secrets per the new `infra/ansible/group_vars/all/vault.yml.example` template. Required for every role with `CHANGE_ME_VAULT` defaults.
2. **Postgres formation** : `ansible-playbook -i inventory/prod.yml playbooks/postgres_ha.yml`. Validate with `infra/ansible/tests/test_pg_failover.sh` before flipping `DATABASE_URL` to PgBouncer (`pgaf-pgbouncer.lxd:6432`).
3. **Redis formation** : `ansible-playbook -i inventory/prod.yml playbooks/redis_sentinel.yml`. Validate with `test_redis_failover.sh`. Backend env update : `REDIS_SENTINEL_ADDRS`.
4. **MinIO formation** : `ansible-playbook -i inventory/prod.yml playbooks/minio_distributed.yml`. Migrate from single-node via `bash scripts/minio-migrate-from-single.sh`.
5. **HAProxy** : `ansible-playbook -i inventory/prod.yml playbooks/haproxy.yml`. Validate with `test_backend_failover.sh`.
6. **Edge cache** : `ansible-playbook -i inventory/prod.yml playbooks/nginx_proxy_cache.yml` (optional, can defer to phase-2).
7. **Observability** : `ansible-playbook -i inventory/prod.yml playbooks/observability.yml` (otel-collector + Tempo).
8. **Synthetic monitoring** : `ansible-playbook -i inventory/prod.yml playbooks/blackbox_exporter.yml`.
9. **Backend canary** : `make deploy-canary ARTIFACT=/path/to/veza-api-v2.0.0-rc1`.
10. **DB migrations** : run automatically on backend boot (`migrations/980` through `migrations/989`).
## Known issues / accepted risks
- **External pentest report not delivered yet.** Engagement is async W5-W6 ; report expected before v2.0.0 promotion. Any Critical / High found blocks the launch.
- **External actions (EX-1 to EX-12)** 12 items (legal, DMCA agent registration, Stripe live KYC, etc.) are tracked outside the engineering scope. Not all are signed off at -rc1 ; see `docs/ROADMAP_V1.0_LAUNCH.md` table.
- **Multi-step synthetic parcours** (Register Verify Login) need a custom synthetic-client binary that blackbox can't model. Tracked for v2.0.x patch.
- **Multi-LB HA** : single HAProxy node ; if it dies, the cluster is dark. Phase-2 (post-launch) adds keepalived + a floating VIP.
- **Cross-DC replication** : single-region. v2.1+ introduces a second region.
- **mTLS between internal services** : not yet ; the Incus bridge is the trust boundary. W4+ territory.
## Acknowledgements
- Internal audit + remediation : engineering team.
- External pentest : `<firm name>` (per engagement letter).
- Soft-launch beta participants : 50-100 testers (acknowledgements consolidated post-launch in `docs/SOFT_LAUNCH_BETA_2026.md`).
## Promotion criteria from -rc1 to v2.0.0
The W6 GO/NO-GO checklist (`docs/GO_NO_GO_CHECKLIST_v2.0.0_PUBLIC.md`) is the gate. -rc1 v2.0.0 promotion happens at Day 30 morning if and only if :
- 0 🔴 RED items in the checklist.
- 0 🟡 PENDING items still hanging on a soak.
- All TBD items either resolved or explicitly accepted in writing.
- Tech lead AND on-call lead both sign GO.
If any of the above fails at Day 30 morning, the launch slips ; v2.0.0 is re-tagged from -rc2 / -rc3 etc once the criterion clears.