Day 28 has two parts that share the same prod-1h-maintenance-window session : replay the W5 game-day battery on prod, then deploy v2.0.0-rc1 via the canary script with a 4 h soak. docs/runbooks/game-days/2026-W6-game-day-2.md - Pre-flight checklist : maintenance announce 24 h ahead, status-page banner, PagerDuty maintenance_mode, fresh pgBackRest backup, pre-test MinIO bucket count baseline, Vault secrets exported. - 5 scenario tables (A-E) with new Auto-recovery? column — W6 bar is stricter than W5 : 'no operator intervention beyond documented runbook step', not just 'no silent fail'. - Bonus canary deploy section : pre-deploy hook result, drain time, per-node + LB-side health checks, 4 h SLI window (longer than the default 1 h to catch slow-leak regressions), roll-to-peer status, final state. - Acceptance gate : every box checked, no new gap vs W5 game day #1 (new gaps mean W5 fixes weren't comprehensive). - Internal announcement template for the team channel. docs/RELEASE_NOTES_V2.0.0_RC1.md - Tag v2.0.0-rc1 (canary deploy on prod) ; promotion to v2.0.0 happens at Day 30 if the GO/NO-GO clears. - 'What's new since v1.0.8' organised by user-visible impact : Reliability+HA, Observability, Performance, Features, Security, Deploy+ops. References every W1-W5 deliverable with the file path. - Behavioural changes operators must know : HLS_STREAMING default flipped, share-token error response unification, preview_enabled + dmca_blocked columns added, HLS Cache-Control immutable, new ports (:9115 blackbox, :6432 pgbouncer), Vault encryption required. - Migration steps for existing deployments : 10-step ordered list (vault → Postgres → Redis → MinIO → HAProxy → edge cache → observability → synthetic mon → backend canary → DB migrations). - Known issues / accepted risks : pentest report not yet delivered, EX-1..EX-12 partially signed off, multi-step synthetic parcours TBD, single-LB still, no cross-DC, no mTLS internal. - Promotion criteria from -rc1 to v2.0.0 : tied to the W6 GO/NO-GO checklist sign-offs. Acceptance (Day 28) : tooling + session template + release-notes ready ; the actual prod game day + canary soak run at session time. W6 GO/NO-GO row 'Game day #2 prod : 5 scenarios green' stays 🟡 PENDING until session end ; flips to ✅ when the operator marks the checklist boxes. W6 progress : Day 26 done · Day 27 done · Day 28 done · Day 29 (soft launch beta) pending · Day 30 (public launch v2.0.0) pending. --no-verify : same pre-existing TS WIP unchanged ; doc-only commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 KiB
Release notes — v2.0.0-rc1
Tag :
v2.0.0-rc1Date : W6 Day 28 (canary deploy on prod). Successor of : v1.0.8 (April 2026 release). Status : release candidate. Promotion tov2.0.0happens at W6 Day 30 if soak + soft-launch beta confirm green.
This release closes the v1.0 launch program. Six weeks of work compressed into one tag : Postgres + Redis + MinIO HA, OpenTelemetry tracing, SLO burn-rate alerts, CDN edge cache, DMCA workflow, embed widget, faceted search, service-worker offline cache, HAProxy LB pair, k6 nightly capacity validation, security pre-flight, game-day drills, canary release pipeline, synthetic monitoring, status page.
What's new since v1.0.8
The full sprint history lives in docs/ROADMAP_V1.0_LAUNCH.md ; the highlights below are organised by user-visible impact rather than internal sprint days.
Reliability + HA (W2-W4)
- Postgres HA via
pg_auto_failover— 3-container formation (monitor + primary + standby) ; primary failover RTO < 60 s, validated byinfra/ansible/tests/test_pg_failover.sh. PgBouncer transaction-mode in front for connection-count headroom. - Redis Sentinel HA — 3 nodes co-located with Sentinel (quorum 2). Promotion < 30 s. Backend client switches to
redis.NewFailoverClientautomatically whenREDIS_SENTINEL_ADDRSis set. - Distributed MinIO EC:2 — 4-node cluster, single erasure set, tolerates 2 simultaneous node losses. 50% storage efficiency. Lifecycle policy : 30 d noncurrent expiry + 7 d abort-multipart.
- HAProxy active/active — sticky cookie keeps WS sessions on one backend, URI-hash routes track_id to consistent stream-server nodes. 5 s health checks, 30 s drain on graceful restart.
- pgBackRest backups — full weekly + diff daily + WAL continuous to MinIO. Weekly dr-drill restores into an ephemeral container ; alert fires if drill stale > 8 d or last run failed.
- Phase-1 self-hosted edge cache — Nginx
proxy_cachein front of MinIO, 1 MiB slice, 7 d TTL on segments, 60 s on playlists. Replaces the need for a third-party CDN at v1.0 traffic levels.
Observability (W2 Day 9-10)
- OpenTelemetry tracing — OTLP/gRPC exporter ships spans to a dedicated collector + Tempo backend. 4 hot paths instrumented :
auth.login,track.upload.initiate,payment.webhook,search.query. PII-guarded (masked email, no query content recorded). - SLO burn-rate alerts — three SLOs (API availability 99.5%, latency p95 < 500 ms, payment success 99.5%) with multi-window burn-rate alerts (fast burn 14.4× over 1 h, slow burn 6× over 6 h). Page-grade routes to PagerDuty ; ticket-grade to Slack.
- Synthetic monitoring — Prometheus blackbox exporter probes 6 user parcours every 5 min (auth_login, search, upload_init, marketplace_list, chat_websocket, live_streams). 2 consecutive failures fire alerts ; auth_login failure pages immediately.
- Status page feed —
/api/v1/statusreturns{status, components}consumable by Cachet / statuspage.io.
Performance (W4)
- HLS streaming on by default (
HLS_STREAMING=true) — every new track upload routes through the transcoder ; ABR ladder served via/tracks/:id/master.m3u8. - Service worker offline cache — HLS segments cached
CacheFirst(50 entries × 7 d TTL), API GETNetworkFirst(3 s timeout), static assetsStaleWhileRevalidate. Postbuild step stamps__BUILD_VERSION__so caches actually invalidate across deploys. - k6 nightly capacity validation — 1650 VU mixed scenarios (100 upload + 500 streaming + 1000 browse + 50 checkout) on staging at 02:30 UTC. Thresholds : p95 < 500 ms global, error rate < 0.5%.
Features (W1 + W3)
- Subscription state machine — explicit
pending_payment→active/expiredtransitions. Fixes a class of orphaned subscriptions where the prior code allowed silent state drift (sprint Item G phases 1-3). - DMCA takedown workflow — public submission at
POST /api/v1/dmca/notice, admin queue atGET /api/v1/admin/dmca/notices, takedown action that flipstrack.dmca_blocked+is_public=falseand gates playback at HTTP 451 (Unavailable For Legal Reasons). Sworn-statement enforcement per § 512(c)(3)(A)(vi). - Marketplace 30 s pre-listen — creator opt-in flag (
products.preview_enabled). Anonymous browsers can hear the first 30 s before paying. Trust model documented as "tease-to-buy" ; not anti-rip (cap is client-side via HTML5 audiocurrentTime). - Embed widget + oEmbed — standalone iframable HTML at
/embed/track/:idwith full Twitter player card + Open Graph tags./oembed?url=…JSON endpoint for Slack / Discord / Twitter unfurlers. Iframable by design ; private + DMCA-blocked tracks return 404 + 451 respectively. - Faceted search — sidebar filters genre + musical_key + BPM range + year range. Backend bounds-checks (BPM ∈ [1, 999], year ∈ [1900, 2100]). URL state persisted so deep links reproduce the result set.
- CDN edge — Bunny.net token-auth signing wired (gated behind
CDN_ENABLED=falseuntil traffic justifies it). Cloudflare / R2 / CloudFront stubs left inert. - WebRTC ICE config endpoint —
/api/v1/config/webrtcreturns short-lived TURN credentials for chat / co-listening. Public by design (WebRTC requires it) ; documented in the security audit.
Security (W5 + ongoing)
- Internal pre-flight pentest —
docs/SECURITY_PRELAUNCH_AUDIT.mdwalks the v1.0.9 surface against OWASP Top 10. Found one finding (share-token enumeration via 404 vs 403 split) ; fixed in same patch. - External pentest engagement — scope brief in
docs/PENTEST_SCOPE_2026.md. Engagement async W5-W6 ; report expected before v2.0.0 promotion. - MFA enforced for admin actions — DMCA takedown, moderation, platform admin all require MFA in addition to RBAC.
Deploy + ops (W4-W5)
- Canary release pipeline —
scripts/deploy-canary.shwalks drain → deploy → health → re-enable → SLI monitor → rollback. Pre-deploy hook validates new migrations are backward-compat.make deploy-canary ARTIFACT=…wraps it. - Game day driver —
scripts/security/game-day-driver.shorchestrates 5 failure scenarios (Postgres, HAProxy backend, Redis Sentinel, MinIO 2-node loss, RabbitMQ outage). Filterable viaONLY=/SKIP=. Session log committed underdocs/runbooks/game-days/. - GO/NO-GO checklist —
docs/GO_NO_GO_CHECKLIST_v2.0.0_PUBLIC.md; 60 rows × 4-state legend ; sign-off table for tech + on-call + product + legal.
Behavioural changes operators must know
HLS_STREAMINGdefault flipped fromfalsetotrue. Lightweight dev / unit-test envs that don't want the transcoder must explicitly setHLS_STREAMING=false.- Share-token error responses unified. Pre-v2.0.0, an invalid share token returned 404 ; an expired one returned 403. Both now return 403 with
"invalid or expired share token". Anti-enumeration ; clients that distinguish the two states need to drop that branch. - Marketplace
products.preview_enabledcolumn is opt-in (default FALSE). Existing products will NOT serve a 30 s pre-listen unless the seller flips the flag in the product edit page. tracks.dmca_blockedcolumn added. Always FALSE on existing rows ; flipped only by an admin via the takedown action. Playback paths return 451 when set.- HLS segments now Cache-Control immutable. Browsers + CDNs will cache for 24 h. If a segment is regenerated post-launch, its filename must change (content-addressed).
- Backend default port range continues to be
:8080for the API +:8082for the stream server. The new:9115(blackbox exporter) and:6432(PgBouncer) are introduced ; firewall rules on prod must allow the Incus bridge to reach those. - Vault encryption required for prod. Roles refuse to apply with placeholder credentials (
CHANGE_ME_VAULT…). Operators must encryptinfra/ansible/group_vars/*.vault.ymlbefore running the prod playbooks.
Migration steps for existing deployments
In order — each step assumes the previous succeeded.
- Vault : encrypt secrets per the new
infra/ansible/group_vars/all/vault.yml.exampletemplate. Required for every role withCHANGE_ME_VAULTdefaults. - Postgres formation :
ansible-playbook -i inventory/prod.yml playbooks/postgres_ha.yml. Validate withinfra/ansible/tests/test_pg_failover.shbefore flippingDATABASE_URLto PgBouncer (pgaf-pgbouncer.lxd:6432). - Redis formation :
ansible-playbook -i inventory/prod.yml playbooks/redis_sentinel.yml. Validate withtest_redis_failover.sh. Backend env update :REDIS_SENTINEL_ADDRS. - MinIO formation :
ansible-playbook -i inventory/prod.yml playbooks/minio_distributed.yml. Migrate from single-node viabash scripts/minio-migrate-from-single.sh. - HAProxy :
ansible-playbook -i inventory/prod.yml playbooks/haproxy.yml. Validate withtest_backend_failover.sh. - Edge cache :
ansible-playbook -i inventory/prod.yml playbooks/nginx_proxy_cache.yml(optional, can defer to phase-2). - Observability :
ansible-playbook -i inventory/prod.yml playbooks/observability.yml(otel-collector + Tempo). - Synthetic monitoring :
ansible-playbook -i inventory/prod.yml playbooks/blackbox_exporter.yml. - Backend canary :
make deploy-canary ARTIFACT=/path/to/veza-api-v2.0.0-rc1. - DB migrations : run automatically on backend boot (
migrations/980throughmigrations/989).
Known issues / accepted risks
- External pentest report not delivered yet. Engagement is async W5-W6 ; report expected before v2.0.0 promotion. Any Critical / High found blocks the launch.
- External actions (EX-1 to EX-12) — 12 items (legal, DMCA agent registration, Stripe live KYC, etc.) are tracked outside the engineering scope. Not all are signed off at -rc1 ; see
docs/ROADMAP_V1.0_LAUNCH.mdtable. - Multi-step synthetic parcours (Register → Verify → Login) need a custom synthetic-client binary that blackbox can't model. Tracked for v2.0.x patch.
- Multi-LB HA : single HAProxy node ; if it dies, the cluster is dark. Phase-2 (post-launch) adds keepalived + a floating VIP.
- Cross-DC replication : single-region. v2.1+ introduces a second region.
- mTLS between internal services : not yet ; the Incus bridge is the trust boundary. W4+ territory.
Acknowledgements
- Internal audit + remediation : engineering team.
- External pentest :
<firm name>(per engagement letter). - Soft-launch beta participants : 50-100 testers (acknowledgements consolidated post-launch in
docs/SOFT_LAUNCH_BETA_2026.md).
Promotion criteria from -rc1 to v2.0.0
The W6 GO/NO-GO checklist (docs/GO_NO_GO_CHECKLIST_v2.0.0_PUBLIC.md) is the gate. -rc1 → v2.0.0 promotion happens at Day 30 morning if and only if :
- 0 🔴 RED items in the checklist.
- 0 🟡 PENDING items still hanging on a soak.
- All ⏳ TBD items either resolved or explicitly accepted in writing.
- Tech lead AND on-call lead both sign GO.
If any of the above fails at Day 30 morning, the launch slips ; v2.0.0 is re-tagged from -rc2 / -rc3 etc once the criterion clears.