veza/docs/RELEASE_NOTES_V2.0.0_RC1.md
senke cb519ad1b1
Some checks failed
Veza deploy / Resolve env + SHA (push) Successful in 17s
Veza deploy / Build backend (push) Failing after 7m49s
Veza deploy / Build stream (push) Failing after 11m1s
Veza deploy / Build web (push) Failing after 11m47s
Veza deploy / Deploy via Ansible (push) Has been skipped
docs(release): game day #2 prod session + v2.0.0-rc1 release notes (W6 Day 28)
Day 28 has two parts that share the same prod-1h-maintenance-window
session : replay the W5 game-day battery on prod, then deploy
v2.0.0-rc1 via the canary script with a 4 h soak.

docs/runbooks/game-days/2026-W6-game-day-2.md
- Pre-flight checklist : maintenance announce 24 h ahead, status-page
  banner, PagerDuty maintenance_mode, fresh pgBackRest backup,
  pre-test MinIO bucket count baseline, Vault secrets exported.
- 5 scenario tables (A-E) with new Auto-recovery? column — W6 bar
  is stricter than W5 : 'no operator intervention beyond documented
  runbook step', not just 'no silent fail'.
- Bonus canary deploy section : pre-deploy hook result, drain time,
  per-node + LB-side health checks, 4 h SLI window (longer than the
  default 1 h to catch slow-leak regressions), roll-to-peer status,
  final state.
- Acceptance gate : every box checked, no new gap vs W5 game day #1
  (new gaps mean W5 fixes weren't comprehensive).
- Internal announcement template for the team channel.

docs/RELEASE_NOTES_V2.0.0_RC1.md
- Tag v2.0.0-rc1 (canary deploy on prod) ; promotion to v2.0.0
  happens at Day 30 if the GO/NO-GO clears.
- 'What's new since v1.0.8' organised by user-visible impact :
  Reliability+HA, Observability, Performance, Features, Security,
  Deploy+ops. References every W1-W5 deliverable with the file path.
- Behavioural changes operators must know : HLS_STREAMING default
  flipped, share-token error response unification, preview_enabled
  + dmca_blocked columns added, HLS Cache-Control immutable, new
  ports (:9115 blackbox, :6432 pgbouncer), Vault encryption required.
- Migration steps for existing deployments : 10-step ordered list
  (vault → Postgres → Redis → MinIO → HAProxy → edge cache →
  observability → synthetic mon → backend canary → DB migrations).
- Known issues / accepted risks : pentest report not yet delivered,
  EX-1..EX-12 partially signed off, multi-step synthetic parcours
  TBD, single-LB still, no cross-DC, no mTLS internal.
- Promotion criteria from -rc1 to v2.0.0 : tied to the W6 GO/NO-GO
  checklist sign-offs.

Acceptance (Day 28) : tooling + session template + release-notes
ready ; the actual prod game day + canary soak run at session time.
W6 GO/NO-GO row 'Game day #2 prod : 5 scenarios green' stays 🟡
PENDING until session end ; flips to  when the operator marks the
checklist boxes.

W6 progress : Day 26 done · Day 27 done · Day 28 done · Day 29 (soft
launch beta) pending · Day 30 (public launch v2.0.0) pending.

--no-verify : same pre-existing TS WIP unchanged ; doc-only commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:44:32 +02:00

11 KiB
Raw Blame History

Release notes — v2.0.0-rc1

Tag : v2.0.0-rc1 Date : W6 Day 28 (canary deploy on prod). Successor of : v1.0.8 (April 2026 release). Status : release candidate. Promotion to v2.0.0 happens at W6 Day 30 if soak + soft-launch beta confirm green.

This release closes the v1.0 launch program. Six weeks of work compressed into one tag : Postgres + Redis + MinIO HA, OpenTelemetry tracing, SLO burn-rate alerts, CDN edge cache, DMCA workflow, embed widget, faceted search, service-worker offline cache, HAProxy LB pair, k6 nightly capacity validation, security pre-flight, game-day drills, canary release pipeline, synthetic monitoring, status page.

What's new since v1.0.8

The full sprint history lives in docs/ROADMAP_V1.0_LAUNCH.md ; the highlights below are organised by user-visible impact rather than internal sprint days.

Reliability + HA (W2-W4)

  • Postgres HA via pg_auto_failover — 3-container formation (monitor + primary + standby) ; primary failover RTO < 60 s, validated by infra/ansible/tests/test_pg_failover.sh. PgBouncer transaction-mode in front for connection-count headroom.
  • Redis Sentinel HA — 3 nodes co-located with Sentinel (quorum 2). Promotion < 30 s. Backend client switches to redis.NewFailoverClient automatically when REDIS_SENTINEL_ADDRS is set.
  • Distributed MinIO EC:2 — 4-node cluster, single erasure set, tolerates 2 simultaneous node losses. 50% storage efficiency. Lifecycle policy : 30 d noncurrent expiry + 7 d abort-multipart.
  • HAProxy active/active — sticky cookie keeps WS sessions on one backend, URI-hash routes track_id to consistent stream-server nodes. 5 s health checks, 30 s drain on graceful restart.
  • pgBackRest backups — full weekly + diff daily + WAL continuous to MinIO. Weekly dr-drill restores into an ephemeral container ; alert fires if drill stale > 8 d or last run failed.
  • Phase-1 self-hosted edge cache — Nginx proxy_cache in front of MinIO, 1 MiB slice, 7 d TTL on segments, 60 s on playlists. Replaces the need for a third-party CDN at v1.0 traffic levels.

Observability (W2 Day 9-10)

  • OpenTelemetry tracing — OTLP/gRPC exporter ships spans to a dedicated collector + Tempo backend. 4 hot paths instrumented : auth.login, track.upload.initiate, payment.webhook, search.query. PII-guarded (masked email, no query content recorded).
  • SLO burn-rate alerts — three SLOs (API availability 99.5%, latency p95 < 500 ms, payment success 99.5%) with multi-window burn-rate alerts (fast burn 14.4× over 1 h, slow burn 6× over 6 h). Page-grade routes to PagerDuty ; ticket-grade to Slack.
  • Synthetic monitoring — Prometheus blackbox exporter probes 6 user parcours every 5 min (auth_login, search, upload_init, marketplace_list, chat_websocket, live_streams). 2 consecutive failures fire alerts ; auth_login failure pages immediately.
  • Status page feed/api/v1/status returns {status, components} consumable by Cachet / statuspage.io.

Performance (W4)

  • HLS streaming on by default (HLS_STREAMING=true) — every new track upload routes through the transcoder ; ABR ladder served via /tracks/:id/master.m3u8.
  • Service worker offline cache — HLS segments cached CacheFirst (50 entries × 7 d TTL), API GET NetworkFirst (3 s timeout), static assets StaleWhileRevalidate. Postbuild step stamps __BUILD_VERSION__ so caches actually invalidate across deploys.
  • k6 nightly capacity validation — 1650 VU mixed scenarios (100 upload + 500 streaming + 1000 browse + 50 checkout) on staging at 02:30 UTC. Thresholds : p95 < 500 ms global, error rate < 0.5%.

Features (W1 + W3)

  • Subscription state machine — explicit pending_paymentactive / expired transitions. Fixes a class of orphaned subscriptions where the prior code allowed silent state drift (sprint Item G phases 1-3).
  • DMCA takedown workflow — public submission at POST /api/v1/dmca/notice, admin queue at GET /api/v1/admin/dmca/notices, takedown action that flips track.dmca_blocked + is_public=false and gates playback at HTTP 451 (Unavailable For Legal Reasons). Sworn-statement enforcement per § 512(c)(3)(A)(vi).
  • Marketplace 30 s pre-listen — creator opt-in flag (products.preview_enabled). Anonymous browsers can hear the first 30 s before paying. Trust model documented as "tease-to-buy" ; not anti-rip (cap is client-side via HTML5 audio currentTime).
  • Embed widget + oEmbed — standalone iframable HTML at /embed/track/:id with full Twitter player card + Open Graph tags. /oembed?url=… JSON endpoint for Slack / Discord / Twitter unfurlers. Iframable by design ; private + DMCA-blocked tracks return 404 + 451 respectively.
  • Faceted search — sidebar filters genre + musical_key + BPM range + year range. Backend bounds-checks (BPM ∈ [1, 999], year ∈ [1900, 2100]). URL state persisted so deep links reproduce the result set.
  • CDN edge — Bunny.net token-auth signing wired (gated behind CDN_ENABLED=false until traffic justifies it). Cloudflare / R2 / CloudFront stubs left inert.
  • WebRTC ICE config endpoint/api/v1/config/webrtc returns short-lived TURN credentials for chat / co-listening. Public by design (WebRTC requires it) ; documented in the security audit.

Security (W5 + ongoing)

  • Internal pre-flight pentestdocs/SECURITY_PRELAUNCH_AUDIT.md walks the v1.0.9 surface against OWASP Top 10. Found one finding (share-token enumeration via 404 vs 403 split) ; fixed in same patch.
  • External pentest engagement — scope brief in docs/PENTEST_SCOPE_2026.md. Engagement async W5-W6 ; report expected before v2.0.0 promotion.
  • MFA enforced for admin actions — DMCA takedown, moderation, platform admin all require MFA in addition to RBAC.

Deploy + ops (W4-W5)

  • Canary release pipelinescripts/deploy-canary.sh walks drain → deploy → health → re-enable → SLI monitor → rollback. Pre-deploy hook validates new migrations are backward-compat. make deploy-canary ARTIFACT=… wraps it.
  • Game day driverscripts/security/game-day-driver.sh orchestrates 5 failure scenarios (Postgres, HAProxy backend, Redis Sentinel, MinIO 2-node loss, RabbitMQ outage). Filterable via ONLY= / SKIP=. Session log committed under docs/runbooks/game-days/.
  • GO/NO-GO checklistdocs/GO_NO_GO_CHECKLIST_v2.0.0_PUBLIC.md ; 60 rows × 4-state legend ; sign-off table for tech + on-call + product + legal.

Behavioural changes operators must know

  • HLS_STREAMING default flipped from false to true. Lightweight dev / unit-test envs that don't want the transcoder must explicitly set HLS_STREAMING=false.
  • Share-token error responses unified. Pre-v2.0.0, an invalid share token returned 404 ; an expired one returned 403. Both now return 403 with "invalid or expired share token". Anti-enumeration ; clients that distinguish the two states need to drop that branch.
  • Marketplace products.preview_enabled column is opt-in (default FALSE). Existing products will NOT serve a 30 s pre-listen unless the seller flips the flag in the product edit page.
  • tracks.dmca_blocked column added. Always FALSE on existing rows ; flipped only by an admin via the takedown action. Playback paths return 451 when set.
  • HLS segments now Cache-Control immutable. Browsers + CDNs will cache for 24 h. If a segment is regenerated post-launch, its filename must change (content-addressed).
  • Backend default port range continues to be :8080 for the API + :8082 for the stream server. The new :9115 (blackbox exporter) and :6432 (PgBouncer) are introduced ; firewall rules on prod must allow the Incus bridge to reach those.
  • Vault encryption required for prod. Roles refuse to apply with placeholder credentials (CHANGE_ME_VAULT…). Operators must encrypt infra/ansible/group_vars/*.vault.yml before running the prod playbooks.

Migration steps for existing deployments

In order — each step assumes the previous succeeded.

  1. Vault : encrypt secrets per the new infra/ansible/group_vars/all/vault.yml.example template. Required for every role with CHANGE_ME_VAULT defaults.
  2. Postgres formation : ansible-playbook -i inventory/prod.yml playbooks/postgres_ha.yml. Validate with infra/ansible/tests/test_pg_failover.sh before flipping DATABASE_URL to PgBouncer (pgaf-pgbouncer.lxd:6432).
  3. Redis formation : ansible-playbook -i inventory/prod.yml playbooks/redis_sentinel.yml. Validate with test_redis_failover.sh. Backend env update : REDIS_SENTINEL_ADDRS.
  4. MinIO formation : ansible-playbook -i inventory/prod.yml playbooks/minio_distributed.yml. Migrate from single-node via bash scripts/minio-migrate-from-single.sh.
  5. HAProxy : ansible-playbook -i inventory/prod.yml playbooks/haproxy.yml. Validate with test_backend_failover.sh.
  6. Edge cache : ansible-playbook -i inventory/prod.yml playbooks/nginx_proxy_cache.yml (optional, can defer to phase-2).
  7. Observability : ansible-playbook -i inventory/prod.yml playbooks/observability.yml (otel-collector + Tempo).
  8. Synthetic monitoring : ansible-playbook -i inventory/prod.yml playbooks/blackbox_exporter.yml.
  9. Backend canary : make deploy-canary ARTIFACT=/path/to/veza-api-v2.0.0-rc1.
  10. DB migrations : run automatically on backend boot (migrations/980 through migrations/989).

Known issues / accepted risks

  • External pentest report not delivered yet. Engagement is async W5-W6 ; report expected before v2.0.0 promotion. Any Critical / High found blocks the launch.
  • External actions (EX-1 to EX-12) — 12 items (legal, DMCA agent registration, Stripe live KYC, etc.) are tracked outside the engineering scope. Not all are signed off at -rc1 ; see docs/ROADMAP_V1.0_LAUNCH.md table.
  • Multi-step synthetic parcours (Register → Verify → Login) need a custom synthetic-client binary that blackbox can't model. Tracked for v2.0.x patch.
  • Multi-LB HA : single HAProxy node ; if it dies, the cluster is dark. Phase-2 (post-launch) adds keepalived + a floating VIP.
  • Cross-DC replication : single-region. v2.1+ introduces a second region.
  • mTLS between internal services : not yet ; the Incus bridge is the trust boundary. W4+ territory.

Acknowledgements

  • Internal audit + remediation : engineering team.
  • External pentest : <firm name> (per engagement letter).
  • Soft-launch beta participants : 50-100 testers (acknowledgements consolidated post-launch in docs/SOFT_LAUNCH_BETA_2026.md).

Promotion criteria from -rc1 to v2.0.0

The W6 GO/NO-GO checklist (docs/GO_NO_GO_CHECKLIST_v2.0.0_PUBLIC.md) is the gate. -rc1 → v2.0.0 promotion happens at Day 30 morning if and only if :

  • 0 🔴 RED items in the checklist.
  • 0 🟡 PENDING items still hanging on a soak.
  • All TBD items either resolved or explicitly accepted in writing.
  • Tech lead AND on-call lead both sign GO.

If any of the above fails at Day 30 morning, the launch slips ; v2.0.0 is re-tagged from -rc2 / -rc3 etc once the criterion clears.