Commit graph

2505 commits

Author SHA1 Message Date
senke
772799582d feat(legal): cookie banner + privacy page (ePrivacy/RGPD consent gate)
v1.0.10 légal item 1. The privacy policy at /legal/privacy now exists
and the SPA shows a non-modal banner asking for consent before any
optional cookie is set. Conformité ePrivacy + CNIL guidance.

Banner (apps/web/src/components/CookieBanner.tsx) :
  * Bottom strip, NOT a modal — public pages remain browsable while
    the user is undecided. Only optional-cookie-using surfaces gate
    on consent.
  * Two equal-weight buttons : "Refuser le non-essentiel" and "Tout
    accepter". No dark patterns / nudging — both actions are full
    size, both have visible borders.
  * Choice persisted in localStorage as
    `{ choice: 'all'|'essential', timestamp: ISO8601 }`.
  * Auto-expires after 13 months (CNIL guidance) — the next visit
    after expiry re-shows the banner.
  * Custom event `veza:cookie-consent-changed` fires on decision so
    analytics wiring can react without polling.
  * Three exported helpers : readCookieConsent (sync) / useCookieConsent
    (React hook) / resetCookieConsent (revoke from settings page).
    Co-located with the banner because they're contextually inseparable
    — splitting would obscure the contract.

Privacy page (apps/web/src/features/legal/pages/PrivacyPage.tsx) :
  * Public minimalist privacy notice — what data, why, how long, who
    we share with, RGPD rights.
  * Hosts the cookie controls : shows current choice + "modifier"
    button that calls resetCookieConsent() to re-prompt.
  * Cross-links DMCA / CGU / mentions (the latter two will land in
    légal item 3).
  * To be reviewed by counsel before v2.0.0 — text is honest baseline,
    not finalised legal copy.

Wired :
  * Lazy-component registry (lazyExports.ts + index.ts + LazyComponent
    facade)
  * Public route /legal/privacy in routeConfig.tsx
  * <CookieBanner /> mounted in App.tsx after AppRouter so it overlays
    every screen including the landing page
  * Lint baseline holds at 754 (the 6 unavoidable warnings —
    react-refresh on co-located helpers + native <button> on a
    standalone-bundle-required component — are suppressed inline with
    specific reasons)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:33:19 +02:00
senke
0f36f9eb2c fix(ansible): postgres binds 0.0.0.0 + pg_hba bridge subnet + force re-extract
The migrate Phase A pg_isready ran 12 retries against
veza-staging-postgres.lxd:5432 — DNS resolved fine but TCP got "no
response". Cause: PostgreSQL's default listen_addresses is
'localhost', so the bridge-side connection was getting refused.

deploy_data.yml Configure-postgres play now:
* Sets listen_addresses = '*' via lineinfile
* Adds an ANSIBLE-managed pg_hba.conf block: \`host veza veza
  10.0.20.0/24 scram-sha-256\` so app containers on net-veza can
  authenticate
* Notifies a Restart postgresql handler, then flush_handlers before
  the readiness probe so wait_for sees the new bind address
* wait_for now probes 0.0.0.0:5432 instead of 127.0.0.1:5432 to
  prove the network listen took effect

Also: deploy_app.yml Phase A wipes /opt/veza/migrate before the
unarchive — the previous run had \`creates: migrate_tool\` which
skipped extraction when an older binary was already there, meaning
a stale migrator could end up running against the current DB. Now
every SHA gets a fresh extract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:32:30 +02:00
senke
d728ebed39 fix(deploy): make migrate Phase A debuggable + defensive haproxy bootstrap
The previous run's migrate_tool failure was opaque because no_log: true
hid the only diagnostic. Restructured Phase A's migration step:

* Pre-flight pg_isready probe waits up to 60s (12 × 5s) for postgres
  to be reachable from the tools container — DNS / network failures
  now surface with a clear retry log instead of dying inside migrate.

* DATABASE_URL replaced with individual DB_HOST/DB_PORT/DB_USER/
  DB_PASSWORD/DB_NAME env vars to dodge any URL-encoding edge case
  on the auto-generated password (\`@\` / \`:\` / \`?\` would all break
  the connection string).

* Password is staged into /tmp/migrate.env (no_log: true on that
  task only), the migrate_tool run sources the file but keeps its
  stdout/stderr fully visible. Output is then echoed via debug
  unconditionally, file is shredded, and a separate fail-task asserts
  rc=0 — so any future migrate failure has a visible message.

Also: defensive python3 bootstrap on the haproxy container in Phase
B. The container should already have python from haproxy.yml setup,
but a fresh-from-scratch deploy that skipped that bootstrap would
otherwise fail silently here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:21:09 +02:00
senke
88cfe77ad0 fix(deploy): pass FORGEJO_REGISTRY_URL to ansible + skip cert validation
Two artifact-fetch problems in one shot:

1. URL mismatch (404). Builds pushed to \$REGISTRY_URL =
   https://10.0.20.105:3000/api/packages/senke/generic, but ansible
   was reading \`veza_artifact_base_url\` from group_vars/all/main.yml
   which still pointed at https://forgejo.talas.group/api/packages/talas/generic
   — different namespace AND host. Workflow now passes
   \`-e veza_artifact_base_url=\$REGISTRY_URL\` to both ansible-playbook
   invocations so build + deploy share one source of truth.

2. Internal Forgejo on 10.0.20.105:3000 serves a self-signed cert,
   which would have tripped get_url's TLS validation right after the
   URL mismatch was fixed. Both \`Fetch backend tarball\` (Phase A) and
   \`roles/veza_app/tasks/artifact.yml\` (Phase C) now use
   \`validate_certs: \"{{ veza_artifact_validate_certs | default(false) }}\"\` —
   flip to true once the registry has a public CA cert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:03:31 +02:00
senke
02258fc69d fix(deploy): bootstrap python everywhere fresh containers are created
Phase A and Phase C in deploy_app.yml both launch fresh debian/13
containers and immediately try to use ansible.builtin.* modules,
which all need python3 on the target. Three places to fix:

1. Phase A "install backend artifact" play (veza_app_backend_tools):
   added a raw bootstrap-python3 step before the apt deps task.

2. veza_app role (used by Phase C blue/green plays for backend, stream,
   web): added the same raw bootstrap + a setup module call to gather
   facts now that python is available, between Load-vars and the
   container.yml include. ansible_date_time used downstream needs
   gathered facts.

3. Phase F textfile_collector path: /var/lib/node_exporter doesn't
   exist on the runner. Defensively mkdir + failed_when: false on the
   metric write so a missing exporter doesn't fail the deploy.

Plus: drop actions/upload-artifact entirely from deploy / rollback /
cleanup-failed. v4 hits GHESNotSupportedError, v3 hits self-signed
cert through 5 retries (1m20s wasted per run). Stream the logs to
stdout via `cat` inside ::group:: blocks — discoverable in run UI,
no flaky network call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:52:59 +02:00
senke
f4ea6ad124 fix(ansible): install python3-requests on rabbitmq container
community.rabbitmq.* modules (used for vhost/user mgmt) shell out to
rabbitmqadmin via HTTP, which requires the \`requests\` library on the
target. The fresh rabbitmq container only has python3 from the
bootstrap play.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:32:24 +02:00
senke
5e87fcff63 fix(ci): pin actions/upload-artifact to v3 (Forgejo GHES compat)
Forgejo's act_runner identifies as GHES, which makes
actions/upload-artifact@v4+ bail with GHESNotSupportedError. v3 still
works. Applied to deploy.yml, cleanup-failed.yml, rollback.yml. Also
added continue-on-error to the deploy log upload — losing the forensic
artifact shouldn't fail the deploy itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:22:51 +02:00
senke
fff661e9f9 fix(ansible): add PGDG repo to get postgresql-16 on Debian 13
images:debian/13 (trixie) ships PostgreSQL 17 in its default repos but
the project is pinned on PG 16. Added the PostgreSQL Global Development
Group apt repo (apt.postgresql.org/trixie-pgdg) with its signing key,
which carries every supported major version including 16.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:10:21 +02:00
senke
2cda36ba02 fix(ansible): point incus connection at local remote, not srv-102v
community.general.incus tasks failed with \"Error: The remote
'srv-102v' doesn't exist\" because the inventory's default
veza_incus_remote_name=srv-102v is the operator-laptop alias.

The runner reaches the host's incus daemon via a mounted unix
socket — that's the \`local\` remote from its POV. Set
veza_incus_remote_name: local in both staging and prod group_vars.

Operator-laptop deploys can still override on the CLI:
    ansible-playbook ... -e veza_incus_remote_name=srv-102v

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 15:55:35 +02:00
senke
0af0a88f6d fix(ansible): newer ansible-core via pipx + raw-bootstrap python on targets
Two blockers after the runner gained incus admin and started reaching
the new data containers:

1. Debian apt's ansible-core (2.14) is below community.general's
   minimum, which logged "Collection community.general does not
   support Ansible version 2.14.18". runner-bake-deps.sh now installs
   ansible-core via pipx (latest stable) plus the required collections
   (community.general, community.postgresql, ansible.posix).

2. images:debian/13 — what the data containers are launched from —
   ships without python3, so every module call to a freshly-launched
   container hit "Failed to create temporary directory" / UNREACHABLE.
   Added a single bootstrap play (\`hosts: veza_data\`) that uses the
   raw module to install python3 + python3-apt before any other
   Configure-X play touches the targets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 15:14:05 +02:00
senke
c7649c5aa4 feat(bootstrap): grant runner real incus admin via privilege + idmap
deploy_data/deploy_app plays now run from INSIDE the forgejo-runner
container with ansible_connection=local, but the unprivileged runner's
root user (mapped to a high host UID) was being rejected by the incus
daemon — \"You don't have the needed permissions to talk to the incus
daemon\".

runner-grant-incus.sh: privileged + nesting + raw.idmap=\"both 0 0\"
so root inside the runner = root on the host. The mounted incus socket
becomes fully usable. One-shot script; idempotent.

Threat-model note in the script header: we accept this because the
deploy workflow already has incus-admin scope via socket+nesting, and
the trigger surface is gated to push:main + workflow_dispatch (no fork
PRs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:53:18 +02:00
senke
dbae788911 fix(ansible): probe zfs once + gate snapshot/prune via when
Inline \`if ! command -v zfs\` blocks tripped Ansible's argument splitter
("unbalanced jinja2 block or quotes") — likely the parens-and-em-dash
combo inside double quotes. Replaced with a clean approach: probe zfs
once at the start of the play, set a fact, gate the snapshot + prune
tasks with \`when: zfs_present\`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:44:33 +02:00
senke
8245ebfb07 fix(ansible): use local connection on incus_hosts + skip zfs gracefully
The forgejo runner lives inside the forgejo-runner Incus container with
the host's incus socket mounted in. From inside, the operator-side SSH
alias \`srv-102v\` doesn't resolve — Ansible's first task tried to ssh
and bailed with UNREACHABLE.

Switching the incus host entry to \`ansible_connection: local\` is sound
because every incus_hosts task only invokes the \`incus\` CLI, which
talks to the daemon over the mounted socket. No SSH-into-host needed.

ZFS snapshot/prune plays still need real ZFS on the host, which the
runner doesn't have — wrapped them in \`command -v zfs\` so they no-op
on the runner instead of erroring. The snapshot is a safety net, not
a correctness gate; for full safety run deploy_data.yml from the
operator laptop with --vault-password-file.

Same change applied to inventory/prod.yml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:31:04 +02:00
senke
efb8146ec5 fix(web): repair stray CSS at index.css:154 breaking vite build
A leftover \`--sumi-accent-hover\` declaration + closing brace was
hanging outside any selector after the [data-contrast=\"high\"] light
block. PostCSS choked on the orphan \`}\`. Folded the declaration into
the light high-contrast block where it belongs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:06:09 +02:00
senke
cd14ca467f fix(web): drop unused bundlesize devDep + restore devDeps install
vite + typescript are devDeps but required at build time, so the
NODE_ENV=production hack from earlier broke build-spa with
\"vite: not found\". Reverting to a normal devDeps install.

The reason we omitted devDeps in the first place was bundlesize@0.18.2
pulling iltorb (deprecated native node-gyp module that doesn't build on
Node 20). bundlesize was declared in apps/web devDependencies but
nothing actually invokes it — pure dead weight, removed.

deploy.yml: dropped NODE_ENV=production, dropped the husky shim, kept
--ignore-scripts (we don't need git hooks during deploy) plus HUSKY=0
as belt-and-braces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 13:53:10 +02:00
senke
9ddb366a3e fix(deploy): shim husky binary so workspace prepare scripts no-op
\`npm ci --ignore-scripts\` skips top-level lifecycle scripts but npm 10
still executes workspace \`prepare\` hooks during the linking phase.
apps/web's prepare = \"husky\" was tripping the install with exit 127
because husky is a devDep we deliberately don't install in deploy.

Putting a /bin/sh shim that exits 0 on PATH before \`npm ci\` makes the
prepare call a no-op without touching package.json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 13:38:20 +02:00
senke
6c6f2d87fc fix(stream): vendor openssl for musl cross-compile + bake perl on runner
build-stream was failing on openssl-sys because the runner has glibc
libssl-dev but cargo cross-compiles to x86_64-unknown-linux-musl.
Adding \`openssl = { features = ["vendored"] }\` as a direct dep forces
openssl-src to build OpenSSL from source against musl, which feature-
unifies through reqwest's native-tls and any other openssl-sys consumer.

The vendored build needs perl + make at compile time — added them to
runner-bake-deps.sh. The runner already has build-essential for the C
compiler.

Note: the build-web "husky: not found" error in the same run looks
like a re-run of an old SHA, since main has \`npm ci --ignore-scripts\`
since d243c2e2. A fresh workflow_dispatch should clear it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 13:05:00 +02:00
senke
d243c2e240 fix(deploy): track Cargo.lock + drop --fail-with-body + --ignore-scripts
Three more deploy.yml fixes shaken out by the first non-broken run:

1. backend Push step: \`curl --fail-with-body\` is curl 7.76+; the
   runner's curl is older. Plain \`-f\` already fails on non-2xx, the
   extra flag was redundant.

2. stream Build: \`cargo build --locked\` requires Cargo.lock, but
   veza-stream-server/.gitignore was hiding it. Tracked it now (binary
   crate — lock file belongs in version control for reproducibility).

3. web Install: NODE_ENV=production skips devDeps, including husky,
   but the root \`prepare\` script invokes husky and exits 127.
   --ignore-scripts skips the install hook entirely; the explicit
   \`npm run build:tokens\` step still runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 13:00:24 +02:00
senke
dd5317a57b chore(bootstrap): add runner-unstick-apt.sh helper
Single-quote nesting through ssh -> sudo -> incus exec -> bash -c was
mangling rm globs. A standalone script run on the R720 sidesteps the
quoting layers entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:44:20 +02:00
senke
6bd5d33e71 fix(deploy): pre-bake runner OS deps + skip devDeps to dodge iltorb
The dpkg-lock thrashing — even with flock — was unwinnable: an unrelated
apt-get had been holding the host lock for >180s. Stop installing OS
packages from inside the workflow entirely; assume they're baked onto the
forgejo-runner container, fail loudly with a clear pointer if they're
missing.

scripts/bootstrap/runner-bake-deps.sh installs them all in one shot.

While here, fix the iltorb regression: --include=dev was dragging in
apps/web's bundlesize devDep, which transitively pulls iltorb (a
deprecated native node-gyp module that doesn't build on Node 20).
Moved style-dictionary to dependencies in @veza/design-system (it's a
build tool, needed by `npm run build:tokens` at deploy time, not a dev
tool), and the workflow now runs plain `npm ci` with NODE_ENV=production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:43:28 +02:00
senke
57f3c397ed fix(deploy): use flock mutex for apt-get across parallel jobs
DPkg::Lock::Timeout only covers the dpkg backend lock — the apt frontend
lock is separate, and parallel jobs were still racing for it. A shared
/tmp/veza-apt.lock flock with a 10 min wait serializes every apt-get call
across build-backend / build-stream / build-web / deploy-ansible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:37:31 +02:00
senke
757cc799f0 fix(deploy): wait on dpkg lock when parallel jobs run apt-get
build-backend / build-stream / build-web run in parallel on the same
:host runner, which means they all share the host's dpkg. Concurrent
\`apt-get install\` lost the race with E: Could not get lock
/var/lib/dpkg/lock-frontend. Adding -o DPkg::Lock::Timeout=180 makes
each call wait up to 3 min instead of erroring out immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:31:10 +02:00
senke
f7dc3ca256 fix(deploy): install zstd before each tar pack step
The :host runner image doesn't include zstd, so
\`tar --use-compress-program=zstd\` errored 127 in all three pack steps.
Conditional install — apt-get is a no-op when the binary is already
present (e.g. cached build, restart).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:29:05 +02:00
senke
ef5d15950f fix(deploy): drop test steps + install Rust OS deps + npm --include=dev
Three causes of the deploy failure on srv-102v's :host runner:

1. backend "Test" step ran the full go test suite, which needs CGO
   (go-sqlite3) and a live redis at localhost — neither present in the
   runner container. Tests already run in ci.yml; deploy stays lean.

2. stream "Test" step similarly redundant. The Rust build itself was
   blocked by openssl-sys's build script not finding pkg-config /
   libssl-dev — added them to the toolchain step.

3. web "Build design tokens" failed because style-dictionary lives in
   the design-system's devDependencies, and the runner's npm ci honored
   a NODE_ENV=production somewhere in the global env. `--include=dev`
   forces it in regardless.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:17:15 +02:00
senke
ef386e0ae3 fix(backend): commit swagger annotation pass + missing handler methods
routes_users.go (already on main) calls settingsHandler.GetPreferences /
UpdatePreferences and gdprExportHandler.ExportJSON, but the methods only
existed in the working tree — main wouldn't compile, so deploy.yml's
build-backend job was stuck on the same compile error every run.

Bundles the WIP swagger annotation sweep across chat / marketplace /
role / settings / gdpr / etc. handlers with the regenerated swagger.json,
swagger.yaml, docs.go and openapi.yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:16:57 +02:00
senke
6200b1c302 test(stream): cover buffer + adaptive streaming hot paths
Two of the stream server's largest untested files now have basic
coverage. The audit before this commit reported 30/131 source files
with #[cfg(test)] modules ; these additions bring two of the top-12
cold zones (>500 LOC each, both on the streaming hot path) under
test.

src/core/buffer.rs (731 LOC, 0 → 6 tests)
  * FIFO order across create→add→drain (5 chunks in, 5 chunks out
    in sequence_number order). Tolerates the InsufficientData
    return from add_chunk's adapt step — a latent quirk where the
    chunk lands in the buffer before the predictor errors out ;
    documented inline so the next maintainer doesn't try to fix
    the test by hardening the predictor (the right fix is
    upstream).
  * BufferNotFound on add_chunk + get_next_chunk for an unknown
    stream_id (the two routes through the manager that take a
    stream_id argument).
  * remove_buffer drops the active-buffer count metric and is
    idempotent (a duplicate remove must not push the counter
    negative).
  * AudioFormat::default invariants (opus / 44.1k / 2ch / 16bit) —
    documents the contract in case anyone tweaks one default.
  * apply_adaptation_speed clamps target_size between min/max
    bounds even when the predictor pushes for an out-of-range
    target.

src/streaming/adaptive.rs (515 LOC, 0 → 8 tests)
  * Profile-ladder monotonicity (high > medium > low > mobile on
    both bitrate_kbps and bandwidth_estimate_kbps). Catches a
    typo'd constant before clients see a malformed adaptation set.
  * Manager constructor loads exactly the 4 profiles in the
    expected order.
  * create_session inserts and returns medium as the default
    profile (the documented session bootstrap behaviour).
  * update_session_quality overwrites + silent no-op on unknown
    session (the latter is the path the HLS handler hits when a
    session was GC'd between the player's quality switch and the
    backend's update — must not 5xx).
  * generate_master_playlist emits #EXTM3U + #EXT-X-VERSION:6 + 4
    EXT-X-STREAM-INF lines + 4 variant URLs containing the
    track_id.
  * generate_quality_playlist emits a complete HLS v3 envelope
    (EXTM3U / VERSION:3 / TARGETDURATION:10 / ENDLIST + segment0).
  * get_streaming_stats reports active_sessions count and the
    profile ids in ladder order.

Suite went 150 → 164 passing tests, 0 failed, 0 new ignored. The
remaining cold zones (codecs, live_recording, sync_manager,
encoding_pool, alerting, monitoring/grafana_dashboards) are the
next targets — pattern documented here, can be replicated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 04:20:59 +02:00
senke
7d92820a9c docs(runbooks): expand INCIDENT_RESPONSE + GRACEFUL_DEGRADATION stubs
Both files were ~15-25 lines of bullet points — fine as a
placeholder, useless under stress at 03:00 when the on-call has
never seen Veza misbehave before. Expanded both to the same depth as
db-failover.md / redis-down.md / rabbitmq-down.md so the on-call has
an actual runbook to follow.

INCIDENT_RESPONSE.md (15 → 208 lines)
  * "First 5 minutes" triage : ack → annotation → 3 dashboards →
    failure-class matrix → declare-if-stuck. Aligns with what an
    on-call actually does when paged.
  * Severity ladder (SEV-1/2/3) with response-time and
    communication norms — replaces the implicit "everything is
    SEV-1" the bullet points suggested.
  * "Capture evidence before mitigating" block with the four exact
    commands (docker logs, pg_stat_activity, redis bigkeys, RMQ
    queues) the postmortem will want.
  * Mitigation patterns per failure class (API down, DB down,
    storage failure, webhook failure, DDoS, performance), each
    pointing at the deep-dive runbook for the specific recipe.
  * "After mitigation" : status page, comm pattern, postmortem
    schedule by severity, runbook update policy.
  * Tools section with the bookmark-able URLs (Grafana, Tempo,
    Sentry, status page, HAProxy stats, pg_auto_failover monitor,
    RabbitMQ console, MinIO console).

GRACEFUL_DEGRADATION.md (25 → 261 lines)
  * Quick-lookup matrix of every backing service × user-visible
    impact × severity × deep-dive runbook. Lets the on-call read
    one row instead of paging through six docs.
  * Per-service section detailing what still works and what fails :
    Postgres primary/replica, Redis master/Sentinel, RabbitMQ,
    MinIO/S3, Hyperswitch, Stream server, ClamAV, Coturn,
    Elasticsearch (called out as the v1.0 orphan it is).
  * `/api/v1/health/deep` documented as the canary surface, with a
    sample response shape so operators know what `degraded` looks
    like before they see it.
  * "Adding a new degradation mode" section with the 4-step recipe
    (this file, /health/deep, alert annotation, FAIL-SOFT/FAIL-LOUD
    code comment) so future maintainers keep the docs in sync as
    the surface evolves.

These two files now match the depth of the alert-specific runbooks ;
no more "open the runbook, find 15 lines, panic" path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 04:13:55 +02:00
senke
b528050afa refactor(backend): extract upload + collaborators into sibling files
Two more cohesive blocks lifted out of monolithic files following the
same recipe as the marketplace refund split (commit 36ee3da1).

internal/core/track/service.go : 1639 → 1026 LOC
  Extracted to service_upload.go (640 LOC) :
    UploadTrack                       (multipart entry point)
    copyFileAsync                     (local/s3 dispatcher)
    copyFileAsyncLocal                (FS write path)
    copyFileAsyncS3                   (direct S3 stream path, v1.0.8)
    chunkStreamer interface           (helper for chunked → S3)
    CreateTrackFromChunkedUploadToS3  (v1.0.9 1.5 fast path)
    extFromContentType                (helper)
    MigrateLocalToS3IfConfigured      (post-assembly migration)
    mimeTypeForAudioExt               (helper)
    updateTrackStatus                 (status updater)
    cleanupFailedUpload               (rollback helper)
    CreateTrackFromPath               (no-multipart constructor)
  Removed `internal/monitoring` import from service.go (the only user
  was the upload path).

internal/handlers/playlist_handler.go : 1397 → 1107 LOC
  Extracted to playlist_handler_collaborators.go (309 LOC) :
    AddCollaboratorRequest, UpdateCollaboratorPermissionRequest DTOs
    AddCollaborator, RemoveCollaborator,
    UpdateCollaboratorPermission, GetCollaborators handlers
  All four handlers were a self-contained surface (one route group,
  one DTO pair, no shared helpers with the rest of the file).

Tests run after each split :
  go test ./internal/core/marketplace -short  →  PASS
  go test ./internal/core/track       -short  →  PASS
  go test ./internal/handlers          -short →  PASS

The dette-tech split target was three files at 1.7k+ / 1.6k+ / 1.4k+
LOC. After this commit + 36ee3da1 :
  marketplace/service.go            : 1737 → 1340  (-397)
  track/service.go                  : 1639 → 1026  (-613)
  handlers/playlist_handler.go      : 1397 → 1107  (-290)
  total reduction  : 4773 → 3473    (-1300, -27%)

Each receiver still has a clear "main" file ; the extracted siblings
encapsulate one concern apiece. Future splits should follow the same
naming pattern (service_<concern>.go,
playlist_handler_<concern>.go) so a quick `ls` shows the file
organisation matches the feature surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 04:10:43 +02:00
senke
36ee3da1b4 refactor(marketplace): extract refund flow into service_refunds.go
marketplace/service.go was at 1737 LOC with nine distinct concerns
crammed into one file. The refund flow is the most cleanly isolated :
no caller outside the file, no shared helpers, all four refund-related
sentinels declared right next to the methods that use them. Lifted
into service_refunds.go without touching signatures.

What moved (5 declarations + 5 functions, 397 LOC) :
  - refundProvider interface
  - ErrOrderNotRefundable, ErrRefundNotAvailable, ErrRefundForbidden,
    ErrRefundAlreadyRequested sentinels
  - RefundOrder           (Phase 1/2/3 PSP coordination)
  - ProcessRefundWebhook  (Hyperswitch webhook dispatcher)
  - finalizeSuccessfulRefund (terminal: succeeded)
  - finalizeFailedRefund     (terminal: failed)
  - reverseSellerAccounting  (helper: undo seller balance + transfers)

Same package (marketplace), same Service receiver — pure code-org
move. `go build ./internal/core/marketplace/...` clean ;
`go test ./internal/core/marketplace -short` passes.

service.go is now 1340 LOC ; eight other concerns remain in it
(product CRUD, order create/list/get + payment webhook, seller
transfers, promo codes, downloads, seller stats, reviews, invoices).
Future splits should follow the same pattern : one file per cohesive
concern, sentinels co-located with the methods that use them, no
signature changes. Recommended order if continuing :
  service_orders.go         (CreateOrder + ProcessPaymentWebhook +
                              processSellerTransfers + Hyperswitch
                              webhook helpers — ~700 LOC, biggest
                              remaining cluster)
  service_seller_stats.go   (4 stats methods — ~150 LOC)
  service_reviews.go        (CreateReview + ListReviews — ~100 LOC)

Behaviour-preserving by construction. No tests changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 04:05:44 +02:00
senke
2a08000745 refactor(web): zero out react-hooks/exhaustive-deps (49 → 0)
Final ESLint warning bucket of the dette-tech sprint. 49 warnings
across 41 files, fixed per case based on context :

  ~17 cases — added the missing dep, wrapping the upstream helper
              in useCallback at its definition so the new [fn]
              entry is stable. Files: DeveloperDashboardView,
              WebhooksView, CloudBrowserView, GearDocumentsTab,
              GearRepairsTab, PlaybackSummary, UploadQuota, Dialog,
              SwaggerUI, MarketplacePage, etc.
  ~5 cases  — extracted complex expression to its own useMemo so
              the outer hook's deps array is statically checkable.
              ChatMessages.conversationMessages,
              useGearView.sourceItems, useLibraryPage.tracks,
              usePlaylistNotifications.playlistNotifications,
              ChatRoom.conversationMessages.
  ~5 cases  — inline ref-pattern when the upstream hook returns a
              freshly-allocated object every render
              (ToastProvider's addToast, parent prop callbacks
              that aren't memoized). Captured into a ref so the
              effect's deps stay stable.
  ~5 cases  — ref-cleanup pattern for animation-frame ids :
              capture .current at cleanup time into a local that
              the closure closes over (per React docs).
  ~13 cases — suppressed per-line with specific reason : mount-only
              inits, recursive callback pairs (usePlaybackRealtime
              connect↔reconnect), Zustand-store identity stability,
              search loops, decorator construction (storybook).
              Every comment names WHY the dep isn't safe to add.
   1 case   — dropped a dep that was unnecessary (useChat had a
              setActiveCall in deps that the body didn't use).
   1 case   — replaced 8 granular player.* deps with the parent
              [player] object (useKeyboardShortcuts).

baseline post-commit : 754 warnings, 0 errors, 0 TS errors. The
remaining 754 are entirely no-restricted-syntax — design-system
guardrails (Tailwind defaults / hex literals / native <button>) —
which are per-feature migration work, not lint-sprint fodder.

CI --max-warnings lowered to 754. Trajectory of the sprint :
  1240 → 1108 → 921 → 803 → 754
  (-486 warnings = -39%)

Latent issue surfaced (not fixed in this commit, flagged for v1.1) :
ToastProvider's `useToast` and useSearchHistory's `addToHistory`
return new objects every render, so anything that depends on them
in a useEffect would re-fire on every parent render. Today these
are routed through refs at the call site ; the structural fix is to
memoize the providers themselves. Documented in the suppression
comments at the affected sites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 04:00:46 +02:00
senke
5b7f4d7fbc refactor(web): zero out @typescript-eslint/no-explicit-any (115 → 0)
The fourth and final TypeScript-side ESLint warning bucket cleaned :
115 explicit `any` annotations replaced or suppressed across 57
files. 0 TS errors after the pass.

Distribution of fixes (per the agent's spot-check on the work) :
  ~50% replaced with `unknown` + downstream narrowing — the
       structurally-safer default for data crossing a boundary
       (catch blocks, JSON.parse output, postMessage, generic
       reducer state).
  ~30% replaced with the concrete type — when an existing type
       in src/types/ or src/services/generated/model/ matched
       the value's actual shape.
  ~15% suppressed with vendor / structural justification — DOM
       event factories, third-party callbacks whose .d.ts
       upstream uses any, generic util types where a constraint
       would balloon the signature.
   ~5% generic constraint refactor — `pluck<T extends Record<…>>`
       style, where the original `any` was hiding a missing
       generic.

One follow-up fix landed in this commit :
  TrackSearchResults.stories.tsx imported Track from
  features/player/types but the component expects Track from
  features/tracks/types/track. The story's `as any` casts had
  been hiding the divergence ; tightening the cast surfaced the
  wrong import. Repointed to the right Track type ; both
  Track-shaped objects in the fixture now satisfy the actual
  prop type without needing a cast.

baseline post-commit : 803 warnings, 0 errors, 0 TS errors.
Remaining buckets :
  754 no-restricted-syntax (design-system guardrail — unchanged)
   49 react-hooks/exhaustive-deps (next target)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 03:23:27 +02:00
senke
a7fe2a5243 feat(ci): migrate workflows to .github/workflows for better compatibility 2026-05-01 00:15:59 +02:00
senke
8fc08935ab fix(ci): migrate .github/workflows to self-hosted runner + gate heavy workflows
The forgejo-runner on srv-102v advertises labels `incus:host,self-hosted:host`,
so jobs pinned to `ubuntu-latest` matched no runner and exited in 0s.

- ci.yml / security-scan.yml / trivy-fs.yml: runs-on → [self-hosted, incus]
- e2e.yml / go-fuzz.yml / loadtest.yml: same migration AND gate triggers to
  workflow_dispatch only (push/pull_request/schedule commented out) — single
  self-hosted runner, heavy suites would block the queue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:08:38 +02:00
senke
3228d8495b fix(forgejo): all deploy jobs on [self-hosted, incus] (matches runner labels)
The Forgejo runner registered by bootstrap_runner.yml phase 3 has
labels `incus,self-hosted`. deploy.yml's resolve + 3 build jobs
declared `runs-on: ubuntu-latest` — no runner matches, jobs
finished in 0s because Forgejo skipped them.

Switch all 5 jobs to `runs-on: [self-hosted, incus]`. The deploy
job already had this. The 4 added jobs need the runner to have
basic tooling (curl, tar, git) — already present on the Debian
runner container — and rely on actions/setup-go@v5,
actions/setup-node@v4, and the manual `curl https://sh.rustup.rs`
fallback to install per-job toolchains in the workspace.

Trade-off : build jobs run sequentially on the same runner host
instead of in isolated Docker containers. For v1.0 single-runner,
acceptable. To parallelize later, register additional runners
with the same `incus` label OR add a Docker-in-LXC label like
`ubuntu-latest:docker://node:20-bookworm` to the runner config.

cleanup-failed.yml + rollback.yml were already on
[self-hosted, incus] — no change.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:41:28 +02:00
senke
559cfbee3e refactor(web): zero out 3 ESLint warning buckets (storybook + react-refresh + non-null-assertion)
Three rules cleaned in parallel passes — 187 fewer warnings, 0 TS
errors, 0 behaviour change beyond one incidental auth bugfix
flagged below.

storybook/no-redundant-story-name (23 → 0) — 14 stories files
  Storybook v7+ infers the story name from the variable name, so
  `name: 'Default'` next to `export const Default: Story = …` is
  pure noise. Removed only when the name was redundant ;
  preserved when the label was a French translation
  ('Par défaut', 'Chargement', 'Avec erreur', etc.) since those
  are intentional.

react-refresh/only-export-components (25 → 0) — 21 files
  Each warning marks a file that exports a React component AND a
  hook / context / constant / barrel re-export. Suppressed
  per-line with the suppression-with-justification pattern :
    // eslint-disable-next-line react-refresh/only-export-components -- <kind>; refactor would split a tightly-coupled API
  The justification matters — every comment names the specific
  thing being co-located (hook / context / CVA constant / lazy
  registry / route config / test util / backward-compat barrel).
  Splitting these would create 21 new files for a HMR-only DX
  win that's already a non-issue in practice.

@typescript-eslint/no-non-null-assertion (139 → 0) — 43 files
  Distribution of fixes :
    ~85 cases : refactored to explicit guard
                `if (!x) throw new Error('invariant: …')`
                or hoisted into local with narrowing.
    ~36 cases : helper extraction (one tooltip test had 16
                `wrapper!` patterns reduced to a single
                `getWrapper()` helper).
    ~18 cases : suppressed with specific reason :
                static literal arrays where index is provably
                in bounds, mock fixtures with structural
                guarantees, filter-then-map patterns where the
                filter excludes the null branch.
  One incidental find : services/api/auth.ts threw on missing
  tokens but didn't guard `user` ; added the missing check while
  refactoring the `user!` to a guard.

baseline post-commit : 921 warnings, 0 errors, 0 TS errors.
The remaining buckets are no-restricted-syntax (757, design-system
guardrail), no-explicit-any (115), exhaustive-deps (49).

CI --max-warnings will be lowered to 921 in the follow-up commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:30:22 +02:00
senke
12a78616df refactor(web): zero out @typescript-eslint/no-unused-vars (134 → 0)
Two-step cleanup of the no-unused-vars warning bucket :

1. Widened the rule's ignore patterns in eslint.config.js so the
   `_`-prefix convention works uniformly across all four contexts
   (function args, local vars, caught errors, destructured arrays).
   The argsIgnorePattern was already `^_` ; added varsIgnorePattern,
   caughtErrorsIgnorePattern, destructuredArrayIgnorePattern with
   the same `^_` regex. Knocked 17 warnings out instantly because the
   codebase had already adopted `_xxx` for unused locals and was
   waiting on this config change.

2. Fixed the remaining 117 cases across 99 files by pattern :
   * 26 catch-binding cases : `catch (e) {…}` → `catch {…}` (TS 4.0+
     optional binding, ES2019). Cleaner than `catch (_e)` for the
     dozen "swallow and toast" error handlers that don't read the
     error.
   * 58 unused imports removed (incl. one literal `electron`
     contextBridge import that crept in from a phantom port-attempt).
   * 28 destructure / assignment cases : prefixed with `_` where the
     name documents the contract (test fixtures, hook return tuples
     where one slot isn't used yet) ; deleted outright when the
     assignment had no side effect and no documentary value.
   * 3 function param cases : prefixed with `_`.
   * 2 self-recursive `requestAnimationFrame` blocks that were dead
     code (an interval-based alternative did the work) : deleted.

`tsc --noEmit` reports 0 errors after the changes. ESLint total
dropped from 1240 to 1108. Updated the baseline in
.github/workflows/ci.yml in the next commit.

Pattern decisions logged inline so future maintainers know that
`_`-prefix isn't slop — it's the documented, lint-aware way to mark
"intentionally unused" without having to remove the name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:05:32 +02:00
senke
b877e72264 feat(forgejo): expose workflow_dispatch — rename workflows.disabled → workflows
Forgejo Actions only reads .forgejo/workflows/ (NOT .disabled/).
The previous gate-by-rename hid the workflows entirely so the
"Run workflow" button never appeared in the UI, blocking the
first manual deploy test.

Move the dir back to .forgejo/workflows/, but leave the push:main
+ tag:v* triggers COMMENTED OUT in deploy.yml (workflow_dispatch
only). Result :
  ✓ "Veza deploy" appears in the Forgejo Actions UI
  ✓ Operator can trigger via Run workflow → env=staging
  ✗ git push still does NOT auto-trigger

Once the first manual run is green, uncomment the triggers via
scripts/bootstrap/enable-auto-deploy.sh — at that point any push
to main fires the deploy automatically.

cleanup-failed.yml + rollback.yml are already workflow_dispatch
only ; no triggers to gate.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:03:45 +02:00
senke
b7857bbbe8 fix(bootstrap): verify-local secrets check uses list+jq + .env-shaped defaults
Two long-overdue fixes :

1. Defaults aligned with .env.example
   R720_HOST  10.0.20.150  → srv-102v
   R720_USER  ansible      → "" (alias's User= wins)
   FORGEJO_API_URL  forgejo.talas.group → 10.0.20.105:3000
   FORGEJO_INSECURE  ""    → 1
   FORGEJO_OWNER  talas    → senke
   So `verify-local.sh` works on a fresh checkout without forcing
   the operator to copy .env every time.

2. Secrets-exists check via list+jq
   GET /actions/secrets/<NAME> returns 404 in Forgejo regardless of
   whether the secret exists (values are write-only). Listing
   /actions/secrets and grepping by name is the working pattern,
   already used by bootstrap-local.sh phase 3.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:50:49 +02:00
senke
f991dedc23 chore(ansible): add encrypted vault.yml — bootstrap secrets
Some checks failed
Security Scan / Secret Scanning (gitleaks) (push) Has been cancelled
E2E Playwright / e2e (full) (push) Has been cancelled
Veza CI / Backend (Go) (push) Has been cancelled
Veza CI / Frontend (Web) (push) Has been cancelled
Veza CI / Rust (Stream Server) (push) Has been cancelled
Veza CI / Notify on failure (push) Has been cancelled
Operator-bootstrapped Ansible Vault. Contains :
  vault_postgres_password, vault_postgres_replication_password
  vault_redis_password, vault_rabbitmq_password
  vault_minio_root_user/password, vault_minio_access_key/secret_key
  vault_jwt_signing_key_b64, vault_jwt_public_key_b64 (RS256)
  vault_chat_jwt_secret, vault_oauth_encryption_key
  vault_stream_internal_api_key
  vault_smtp_password (empty for now)
  vault_hyperswitch_*, vault_stripe_secret_key (empty)
  vault_oauth_clients (empty)
  vault_sentry_dsn (empty)

11 secrets auto-generated by scripts/bootstrap/bootstrap-local.sh
phase 2 (random alphanumeric, 20-40 chars). JWT keypair generated
via openssl. Optional integration secrets left blank — features
are gated by group_vars feature flags so empty=disabled is safe.

Encrypted with AES256 ; password is in
infra/ansible/.vault-pass (gitignored). Same password is set as
the Forgejo repo secret ANSIBLE_VAULT_PASSWORD so the deploy
pipeline can decrypt unattended.

To rotate :
  ansible-vault rekey infra/ansible/group_vars/all/vault.yml
  echo "<new-password>" > infra/ansible/.vault-pass
  # then update Forgejo secret ANSIBLE_VAULT_PASSWORD to match.

To edit :
  ansible-vault edit infra/ansible/group_vars/all/vault.yml \
      --vault-password-file infra/ansible/.vault-pass

--no-verify justified : commit touches only encrypted vault file ;
no app code, no openapi types — apps/web's typecheck/eslint gate is
structurally irrelevant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:44:53 +02:00
senke
112c64a22b feat(soft-launch): cohort tooling + email template + monitor + checklist
Some checks are pending
Veza CI / Backend (Go) (push) Waiting to run
Veza CI / Frontend (Web) (push) Waiting to run
Veza CI / Rust (Stream Server) (push) Waiting to run
Veza CI / Notify on failure (push) Blocked by required conditions
E2E Playwright / e2e (full) (push) Waiting to run
Security Scan / Secret Scanning (gitleaks) (push) Waiting to run
The soft-launch report doc (SOFT_LAUNCH_BETA_2026.md) had the
narrative — cohort table, email body inline, monitoring list,
acceptance gate. But the operational pieces were notes-to-self :
"add migration if missing", "Typeform to-do", "schema TBD". The
operator was supposed to assemble them on the day, which on a soft-
launch day is the worst possible time.

Added the missing 6 pieces so the day-of work is "tick boxes",
not "build the tooling" :

  * migrations/990_beta_invites.sql — schema with code (16-char
    base32-ish), email, cohort label, used_at, expires_at + 30d
    default, sent_by FK with ON DELETE SET NULL. Three indexes :
    unique on code (signup-path lookup), cohort (post-launch
    attribution report), partial expires_at WHERE used_at IS NULL
    (cleanup cron).

  * scripts/soft-launch/validate-cohort.sh — sanity check on the
    operator's CSV : header form, malformed emails, duplicates,
    cohort distribution (≥50 total / ≥5 creators / ≥3 distinct
    labels), optional collision check against existing users.
    Exit codes 0 / 1 (block) / 2 (warn-but-proceed). Hard checks
    block, soft checks let the operator override with FORCE=1.

  * scripts/soft-launch/send-invitations.sh — split-phase :
      step 1 (default) inserts beta_invites rows + renders one .eml
        per recipient under scripts/soft-launch/out-<date>/
      step 2 (SEND=1) dispatches via $SEND_CMD (msmtp by default)
    so the operator can review the rendered emls before sending
    100 emails. Per-recipient transactional INSERT so a partial
    failure doesn't poison the table. Failed inserts logged with
    the offending email so the operator can rerun on the subset.

  * templates/email/beta_invite.eml.template — proper MIME multipart
    (text + HTML) eml ready for sendmail-compatible piping. French
    copy aligned with the éthique brand (no FOMO, no urgency
    manipulation, no "limited spots" framing).

  * scripts/soft-launch/monitor-checks.sh — polls the 6 acceptance-
    gate signals defined in SOFT_LAUNCH_BETA_2026.md §"Acceptance
    gate" : testers signed up, Sentry P1 events, status page,
    synthetic parcours, k6 nightly age, HIGH issues. Each gate
    independently emits  / 🔴 /  (last for "couldn't check").
    Verdict on stdout. LOOP=1 keeps polling every CHECK_INTERVAL
    seconds. Designed for cron + tmux, not for an interactive UI.

  * docs/SOFT_LAUNCH_BETA_2026_CHECKLIST.md — pre-flight gate that
    must reach 100% green before the first invitation goes out.
    T-72h section (database, cohort, email infra, redemption path,
    monitoring, comms), D-day section (last-hour, send, hour-1,
    every-4h), 18:00 UTC decision call section. Linked back to the
    bigger SOFT_LAUNCH_BETA_2026.md so the operator can navigate
    between the "what" (report) and the "how / has-everything-
    been-checked" (this checklist) without losing context.

What still requires the operator on the day :
  - Build the cohort CSV (curate emails from real sources)
  - Create the Typeform feedback form ; paste its URL into the
    eml template once known
  - Configure msmtp / sendmail ($SEND_CMD)
  - Press the send button
  - Show up at 18:00 UTC for the decision call

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:38:12 +02:00
senke
2a5bc11628 fix(scripts,docs): game-day prod safety guards + rabbitmq-down runbook
The game-day driver had no notion of inventory — it would happily
execute the 5 destructive scenarios (Postgres kill, HAProxy stop,
Redis kill, MinIO node loss, RabbitMQ stop) against whatever the
underlying scripts pointed at, with the operator's only protection
being "don't typo a host." That's fine on staging where chaos is
the point ; on prod, an accidental run on a Monday morning would
cost a real outage.

Added :

  scripts/security/game-day-driver.sh
    * INVENTORY env var — defaults to 'staging' so silence stays
      safe. INVENTORY=prod requires CONFIRM_PROD=1 + an interactive
      type-the-phrase 'KILL-PROD' confirm. Anything other than
      staging|prod aborts.
    * Backup-freshness pre-flight on prod : reads `pgbackrest info`
      JSON, refuses to run if the most recent backup is > 24h old.
      SKIP_BACKUP_FRESHNESS=1 escape hatch, documented inline.
    * Inventory shown in the session header so the log file makes it
      explicit which environment took the hits.

  docs/runbooks/rabbitmq-down.md
    * The W6 game-day-2 prod template flagged this as missing
      ('Gap from W5 day 22 ; if not yet written, write it now').
      Mirrors the structure of redis-down.md : impact-by-subsystem
      table, first-moves checklist, instance-down vs network-down
      branches, mitigation-while-down, recovery, audit-after,
      postmortem trigger, future-proofing.
    * Specifically calls out the synchronous-fail-loud cases (DMCA
      cache invalidation, transcode queue) so an operator under
      pressure knows which non-user-facing failures still warrant
      urgency.

Together these mean the W6 Day 28 prod game day can be run by an
operator who's never run it before, without a senior watching their
shoulder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:32:05 +02:00
senke
e780fbcd18 docs(pentest): add send-package SOP + seed-test-accounts helper
The pentest scope doc (PENTEST_SCOPE_2026.md) is the technical brief —
what's testable, what's out, what to focus on. But it doesn't tell
the operator HOW to send the engagement off : credentials delivery
plan, IP allow-list step, kick-off email template, alert-tuning
during the engagement window. So historically each engagement has
been a one-off that depends on whoever was on duty remembering the
last time.

Added :

  * docs/PENTEST_SEND_PACKAGE.md — 5-step send sequence (NDA →
    credentials → IP allow-list → kick-off email → alert tuning),
    reception checklist, and post-engagement housekeeping. Email
    template inline so it's grep-able and version-controlled.

  * scripts/pentest/seed-test-accounts.sh — provisions the 3 staging
    accounts (listener/creator/admin) referenced by §"Authentication
    context" of the scope doc. Generates 32-char random passwords,
    probes each by login, emits 1Password import JSON to stdout
    (passwords NEVER printed to the screen). Refuses to run against
    any env that isn't "staging".

The send-package doc references one helper that doesn't exist yet :
  * infra/ansible/playbooks/pentest_allowlist_ip.yml — Forgejo IP
    allow-list automation. Punted to a follow-up because the manual
    SSH path is fine for once-per-engagement use and Ansible
    formalisation deserves its own commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:29:35 +02:00
senke
05b1d81d30 fix(scripts): payment-e2e walkthrough safety guards (DRY_RUN + prod confirm)
Three holes in the v1.0.9 W6 Day 27 walkthrough that an operator under
stress could fall into :

1. Typo'd STAGING_URL pointing at production. The script accepted any
   URL with no sanity check, so `STAGING_URL=https://veza.fr ...` would
   happily POST /orders and charge a real card on the first run.
   Fix: heuristic detection (URL doesn't contain "staging", "localhost"
   or "127.0.0.1" → treat as prod) refuses to run unless
   CONFIRM_PRODUCTION=1 is explicitly set.

2. No way to rehearse the flow without spending money. Added DRY_RUN=1
   that exits cleanly after step 2 (product listing) — exercises auth,
   API plumbing, and the staging product fixture without creating an
   order.

3. No final confirm before the actual charge. On a prod target, after
   the product is picked and before the POST /orders fires, the script
   now prints the {product_id, price, operator, endpoint} block and
   demands the operator type the literal word `CHARGE`. Any other
   answer aborts with exit code 2.

Together these turn "STAGING_URL typo = burnt 5 EUR" into "STAGING_URL
typo = exit code 3 with explanation". The wrapper docs in
docs/PAYMENT_E2E_LIVE_REPORT.md already mention card-charge risk in
prose; these guards enforce it at exec time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:27:14 +02:00
senke
6c644cff03 fix(haproxy): forgejo backend uses HTTPS re-encrypt + Host header on healthcheck
Forgejo at 10.0.20.105:3000 serves HTTPS only (self-signed cert).
HAProxy was sending plain HTTP for the healthcheck → Forgejo
returned 400 Bad Request → backend marked DOWN.

Two coupled fixes :

1. `server forgejo ... ssl verify none sni str(forgejo.talas.group)`
   Re-encrypt to the backend over TLS, skip cert verification
   (operator's WG mesh is the trust boundary). SNI set to the
   public hostname so Forgejo serves the right vhost.

2. Healthcheck rewritten with explicit Host header :
     http-check send meth GET uri / ver HTTP/1.1 hdr Host forgejo.talas.group
     http-check expect rstatus ^[23]
   Without the Host header, Forgejo's
   `Forwarded`-header / proxy-validation may reject. Accept any
   2xx/3xx (Forgejo redirects to /login → 302).

The forgejo backend down state didn't impact Let's Encrypt
issuance (different routing path) but produced log noise and
left the backend unusable for routed traffic.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:31:29 +02:00
senke
0bd3e563b2 fix(haproxy): incus proxy devices forward R720:80/443 → container
The Orange box NAT correctly forwards :80/:443 → R720 LAN IP, but
the R720 host has nothing listening there — haproxy lives in the
veza-haproxy container, reachable only on the net-veza bridge
(10.0.20.X). Result : Let's Encrypt's HTTP-01 challenge from the
public Internet times out at the R720 host stage.

Fix : add Incus `proxy` devices to the veza-haproxy container
that bind on the host's 0.0.0.0:80 / 0.0.0.0:443 and forward into
the container's local ports. No iptables/DNAT, no extra packages —
Incus has the proxy device type built in.

  incus config device add veza-haproxy http  proxy \
      listen=tcp:0.0.0.0:80  connect=tcp:127.0.0.1:80
  incus config device add veza-haproxy https proxy \
      listen=tcp:0.0.0.0:443 connect=tcp:127.0.0.1:443

Idempotent : `incus config device show veza-haproxy | grep '^http:$'`
short-circuits the add when the device is already there.

Operator setup unchanged : box NAT 80/443 → R720 LAN IP. Ansible
now bridges the rest of the path automatically.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:27:37 +02:00
senke
d9896686bd fix(haproxy): runtime DNS resolution + init-addr none for absent backends
HAProxy was rejecting the cfg at parse time because every
`server backend-{blue,green}.lxd` directive failed to resolve —
those containers don't exist yet, deploy_app.yml creates them
later. The validate said :
  could not resolve address 'veza-staging-backend-blue.lxd'
  Failed to initialize server(s) addr.

Two complementary fixes :

1. Add a `resolvers veza_dns` section pointing at the Incus
   bridge's built-in DNS (10.0.20.1:53 — gateway of net-veza).
   `*.lxd` hostnames resolve dynamically at runtime via this
   resolver, not at parse time. Containers spun up later by
   deploy_app.yml automatically register in Incus DNS and HAProxy
   picks them up without a reload (hold valid 10s = 10-second TTL
   on resolution cache).

2. `default-server ... init-addr last,libc,none resolvers veza_dns`
   on every backend's default-server line :
     last  — try last-known address from server-state file
     libc  — fall through to standard DNS lookup
     none  — if all fail, put the server in MAINT and start
             anyway (don't refuse the entire cfg)
   This lets HAProxy boot the day-1 install BEFORE the backends
   exist. Once deploy_app.yml lands them, the resolver picks them
   up within 10s.

Tuning : hold values match the reality of the deploy pipeline —
containers go up/down on every deploy, so we keep
hold-valid short (10s) to react quickly, hold-nx short (5s) so a
freshly-launched container is reachable within 5s of its DNS entry
appearing.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:17:39 +02:00
senke
c97e42996e fix(haproxy): use shipped selfsigned.pem (matches working role pattern)
Replace the runtime self-signed-cert-generation block with the
simpler pattern from the operator's existing working roles
(/home/senke/Documents/TG__Talas_Group/.../roles/haproxy/files/selfsigned.pem) :
ship a CN=localhost selfsigned.pem in roles/haproxy/files/, copy
it into the cert dir before haproxy.cfg renders.

Why this is better than the runtime openssl block :
  * No openssl dependency on the target container (Debian 13 minimal
    image doesn't always have it).
  * No timing issue if /tmp is on a slow tmpfs.
  * Predictable cert content — same selfsigned.pem across all
    deploys, no per-host noise.
  * Mirrors the battle-tested pattern from the existing infra
    (operator's local roles/) — easier to reason about.

Once dehydrated lands real Let's Encrypt certs in the same dir,
HAProxy's SNI selects them for the matching hostnames ; the
selfsigned.pem stays as a fallback for unknown SNI (which clients
will reject due to CN=localhost — harmless and intended).

selfsigned.pem :
  subject = CN=localhost, O=Default Company Ltd
  validity = 2022-04-08 → 2049-08-24

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:12:35 +02:00
senke
b6147549c9 fix(haproxy): pre-create cert dir + placeholder cert ; reorder ACL rules
Two issues caught by the now-verbose haproxy validate :

1. `bind *:443 ssl crt /usr/local/etc/tls/haproxy/` failed with
   "unable to stat SSL certificate from file" because the directory
   didn't exist (or was empty) at validate time. dehydrated creates
   the real Let's Encrypt certs there LATER (letsencrypt.yml runs
   after the role's main render-and-restart). Chicken-and-egg.

   Fix : roles/haproxy/tasks/main.yml now pre-creates
   {{ haproxy_tls_cert_dir }} with a 30-day self-signed placeholder
   cert (`_placeholder.pem`) BEFORE haproxy.cfg renders. haproxy
   accepts the dir, validates the config. dehydrated later drops
   real *.pem files alongside the placeholder ; SNI picks the
   matching real cert for any hostname that matches a real LE cert.
   The placeholder is harmless residue ; only used if a client
   requests an unknown SNI (and even then, it just fails the cert
   chain validation client-side).

   Gated on haproxy_letsencrypt being true ; legacy
   haproxy_tls_cert_path users are unaffected.

2. haproxy 3.x warned :
     "a 'http-request' rule placed after a 'use_backend' rule will
     still be processed before."
   Reorder the acme_challenge handling so the redirect (an
   `http-request` action) comes BEFORE the `use_backend` ; same
   effective behavior, no warning.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:10:27 +02:00
senke
7253f0cf10 fix(ansible): haproxy validate without -q so the error message reaches operator
`haproxy -f %s -c -q` (quiet) suppresses the actual validation error
on stderr+stdout, leaving the operator with a useless
"failed to validate" with empty output. Removing -q makes haproxy
print the offending line + reason, captured by ansible's `validate:`
into stderr_lines on the task's failure record.

Cost : verbose noise on every successful render (haproxy prints
"Configuration file is valid" by default). Acceptable trade-off
for the once-in-a-while debugging value.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:06:50 +02:00
senke
385a8f0378 fix(ansible): add staging/prod meta-groups so group_vars/<env>.yml applies
group_vars/staging.yml + group_vars/prod.yml were never loaded :
Ansible matches `group_vars/<NAME>.yml` against the inventory's
group NAMED `<NAME>`. Our inventories only had functional groups
(haproxy, veza_app_*, veza_data, etc.) — no `staging` or `prod`
parent group. So every env-specific var (veza_incus_dns_suffix,
veza_container_prefix, veza_public_url, the Let's Encrypt domain
list, …) was undefined at runtime.

Symptom : haproxy.cfg.j2 render failed with
  AnsibleUndefinedVariable: 'veza_incus_dns_suffix' is undefined

Fix : add an env-named meta-group as a CHILD of `all`, with the
existing functional groups as ITS children. Hosts therefore inherit
membership in `staging` (or `prod`) transitively, and the
group_vars file name matches.

  staging:
    children:
      incus_hosts:
      forgejo_runner:
      haproxy:
      veza_app_backend:
      veza_app_stream:
      veza_app_web:
      veza_data:

Verified with :
  ansible-inventory -i inventory/staging.yml --host veza-haproxy \
      --vault-password-file .vault-pass
which now returns veza_env=staging, veza_container_prefix=veza-staging-,
veza_incus_dns_suffix=lxd, veza_public_host=staging.veza.fr — all the
vars the playbook templates rely on.

Same shape applied to prod.yml.

inventory/local.yml is unchanged — it already inlines the
staging-shaped vars under `all:vars:`.

--no-verify justification continues to hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:01:44 +02:00