senke/veza - Talas Project: Beyond coding. We Forge.

senke/veza

Fork 0

Commit graph

Author	SHA1	Message	Date
senke	3f326e8266	fix(ci): unblock CI red — gofmt + e2e webserver reuse + orders.hyperswitch_payment_id (Day 4) Some checks failed Veza CI / Rust (Stream Server) (push) Successful in 4m22s Details Security Scan / Secret Scanning (gitleaks) (push) Successful in 1m5s Details Veza CI / Frontend (Web) (push) Failing after 17m19s Details E2E Playwright / e2e (full) (push) Failing after 20m28s Details Veza CI / Backend (Go) (push) Successful in 21m31s Details Veza CI / Notify on failure (push) Successful in 4s Details Three pre-existing infra issues surfaced by the Day 1→Day 3 push wave. Each is independent — bundled here because the goal is "ci.yml + e2e.yml green" before the v1.0.9 tag, and they're all small. (1) gofmt — ci.yml golangci-lint v2 step Five files were unformatted on main. Pre-existing (untouched by my Item G work, but the formatter caught them now): - internal/api/router.go - internal/core/marketplace/reconcile_hyperswitch_test.go - internal/models/user.go - internal/monitoring/ledger_metrics.go - internal/monitoring/ledger_metrics_test.go Pure whitespace via `gofmt -w` — no behavior change. (2) e2e silent-fail — playwright webServer port collision The e2e workflow pre-starts the backend in step 9 ("Build + start backend API") so it can fail-fast on a non-ok health check. But playwright.config.ts had `reuseExistingServer: !process.env.CI` on the backend webServer entry — meaning in CI Playwright tried to spawn a SECOND backend on port 18080. The spawn collided with EADDRINUSE and Playwright silently exited before printing any test output. The artifact upload then warned "No files were found" because tests/e2e/playwright-report/ never got written, and the job ended in `Failure` for an unrelated reason (the artifact upload step's GHESNotSupportedError). Fix: backend `reuseExistingServer: true` always — workflow + dev both pre-start backend on 18080. Vite stays `!CI` because the workflow doesn't pre-start it. Comment in playwright.config.ts documents the symptom so the next person debugging gets the pointer immediately. (3) orders.hyperswitch_payment_id missing in fresh DBs — migration 080 skip-branch + 099 ordering drift Migration 080 (`add_payment_fields`) wraps its ALTERs in "skip if orders doesn't exist". At authoring time orders existed earlier in the migration sequence; that ordering has since shifted (orders is now created at 099_z_create_orders.sql, AFTER 080). Result: in any freshly-migrated DB (CI, fresh dev, future restore drills) migration 080 takes the skip branch and the columns are never added — even though the Order model and the marketplace code rely on them. Symptom: every CI run logs pq: column "hyperswitch_payment_id" does not exist from the periodic ledger_metrics worker. Order checkout would also fail to persist payment_id at write time, breaking reconciliation. Fix: append-only migration 987 with idempotent `ADD COLUMN IF NOT EXISTS` + a partial index on the reconciliation hot path. Production envs that did pick up 080 in the original order are no-ops; fresh envs converge to the same end state. Rollback in migrations/rollback/. Verified locally: $ cd veza-backend-api && go build ./... && VEZA_SKIP_INTEGRATION=1 \ go test -short -count=1 ./internal/... (all green) SKIP_TESTS=1: backend-only Go + Playwright config + SQL. Frontend unit tests irrelevant to this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 12:03:55 +02:00
senke	94dfc80b73	feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F Five Prometheus gauges + reconciler metrics + Grafana dashboard + three alert rules. Closes axis-1 P1.8 and adds observability for item C's reconciler (user review: "F should include reconciler_* metrics, otherwise tag is blind on the worker we just shipped"). Gauges (veza_ledger_, sampled every 60s): * orphan_refund_rows — THE canary. Pending refunds with empty hyperswitch_refund_id older than 5m = Phase 2 crash in RefundOrder. Alert: > 0 for 5m → page. * stuck_orders_pending — order pending > 30m with non-empty payment_id. Alert: > 0 for 10m → page. * stuck_refunds_pending — refund pending > 30m with hs_id. * failed_transfers_at_max_retry — permanently_failed rows. * reversal_pending_transfers — item B rows stuck > 30m. Reconciler metrics (veza_reconciler_): * actions_total{phase} — counter by phase. * orphan_refunds_total — two-phase-bug canary. * sweep_duration_seconds — exponential histogram. * last_run_timestamp — alert: stale > 2h → page (worker dead). Implementation notes: * Sampler thresholds hardcoded to match reconciler defaults — intentional mismatch allowed (alerts fire while reconciler already working = correct behavior). * Query error sets gauge to -1 (sentinel for "sampler broken"). * marketplace package routes through monitoring recorders so it doesn't import prometheus directly. * Sampler runs regardless of Hyperswitch enablement; gauges default 0 when pipeline idle. * Graceful shutdown wired in cmd/api/main.go. Alert rules in config/alertmanager/ledger.yml with runbook pointers + detailed descriptions — each alert explains WHAT happened, WHY the reconciler may not resolve it, and WHERE to look first. Grafana dashboard config/grafana/dashboards/ledger-health.json — top row = 5 stat panels (orphan first, color-coded red on > 0), middle row = trend timeseries + reconciler action rate by phase, bottom row = sweep duration p50/p95/p99 + seconds-since-last-tick + orphan cumulative. Tests — 6 cases, all green (sqlite :memory:): * CountsStuckOrdersPending (includes the filter on non-empty payment_id) * StuckOrdersZeroWhenAllCompleted * CountsOrphanRefunds (THE canary) * CountsStuckRefundsWithHsID (gauge-orthogonality check) * CountsFailedAndReversalPendingTransfers * ReconcilerRecorders (counter + gauge shape) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 03:40:14 +02:00

Author

SHA1

Message

Date

senke

3f326e8266

fix(ci): unblock CI red — gofmt + e2e webserver reuse + orders.hyperswitch_payment_id (Day 4)

Veza CI / Rust (Stream Server) (push) Successful in 4m22s

Details

Security Scan / Secret Scanning (gitleaks) (push) Successful in 1m5s

Details

Veza CI / Frontend (Web) (push) Failing after 17m19s

Details

E2E Playwright / e2e (full) (push) Failing after 20m28s

Details

Veza CI / Backend (Go) (push) Successful in 21m31s

Details

Veza CI / Notify on failure (push) Successful in 4s

Details

Three pre-existing infra issues surfaced by the Day 1→Day 3 push wave.
Each is independent — bundled here because the goal is "ci.yml + e2e.yml
green" before the v1.0.9 tag, and they're all small.

(1) gofmt — ci.yml golangci-lint v2 step

  Five files were unformatted on main. Pre-existing (untouched by my
  Item G work, but the formatter caught them now):
    - internal/api/router.go
    - internal/core/marketplace/reconcile_hyperswitch_test.go
    - internal/models/user.go
    - internal/monitoring/ledger_metrics.go
    - internal/monitoring/ledger_metrics_test.go
  Pure whitespace via `gofmt -w` — no behavior change.

(2) e2e silent-fail — playwright webServer port collision

  The e2e workflow pre-starts the backend in step 9 ("Build + start
  backend API") so it can fail-fast on a non-ok health check. But
  playwright.config.ts had `reuseExistingServer: !process.env.CI` on
  the backend webServer entry — meaning in CI Playwright tried to
  spawn a SECOND backend on port 18080. The spawn collided with
  EADDRINUSE and Playwright silently exited before printing any test
  output. The artifact upload then warned "No files were found"
  because tests/e2e/playwright-report/ never got written, and the job
  ended in `Failure` for an unrelated reason (the artifact upload
  step's GHESNotSupportedError).

  Fix: backend `reuseExistingServer: true` always — workflow + dev
  both pre-start backend on 18080. Vite stays `!CI` because the
  workflow doesn't pre-start it. Comment in playwright.config.ts
  documents the symptom so the next person debugging gets the
  pointer immediately.

(3) orders.hyperswitch_payment_id missing in fresh DBs — migration 080
    skip-branch + 099 ordering drift

  Migration 080 (`add_payment_fields`) wraps its ALTERs in
  "skip if orders doesn't exist". At authoring time orders existed
  earlier in the migration sequence; that ordering has since shifted
  (orders is now created at 099_z_create_orders.sql, AFTER 080).
  Result: in any freshly-migrated DB (CI, fresh dev, future restore
  drills) migration 080 takes the skip branch and the columns are
  never added — even though the Order model and the marketplace code
  rely on them.

  Symptom: every CI run logs
    pq: column "hyperswitch_payment_id" does not exist
  from the periodic ledger_metrics worker. Order checkout would also
  fail to persist payment_id at write time, breaking reconciliation.

  Fix: append-only migration 987 with idempotent
  `ADD COLUMN IF NOT EXISTS` + a partial index on the reconciliation
  hot path. Production envs that did pick up 080 in the original
  order are no-ops; fresh envs converge to the same end state.
  Rollback in migrations/rollback/.

Verified locally:
  $ cd veza-backend-api && go build ./... && VEZA_SKIP_INTEGRATION=1 \
      go test -short -count=1 ./internal/...
  (all green)

SKIP_TESTS=1: backend-only Go + Playwright config + SQL. Frontend
unit tests irrelevant to this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 12:03:55 +02:00

senke

94dfc80b73

feat(metrics): ledger-health gauges + alert rules — v1.0.7 item F

Five Prometheus gauges + reconciler metrics + Grafana dashboard +
three alert rules. Closes axis-1 P1.8 and adds observability for
item C's reconciler (user review: "F should include reconciler_*
metrics, otherwise tag is blind on the worker we just shipped").

Gauges (veza_ledger_, sampled every 60s):
  * orphan_refund_rows — THE canary. Pending refunds with empty
    hyperswitch_refund_id older than 5m = Phase 2 crash in
    RefundOrder. Alert: > 0 for 5m → page.
  * stuck_orders_pending — order pending > 30m with non-empty
    payment_id. Alert: > 0 for 10m → page.
  * stuck_refunds_pending — refund pending > 30m with hs_id.
  * failed_transfers_at_max_retry — permanently_failed rows.
  * reversal_pending_transfers — item B rows stuck > 30m.

Reconciler metrics (veza_reconciler_):
  * actions_total{phase} — counter by phase.
  * orphan_refunds_total — two-phase-bug canary.
  * sweep_duration_seconds — exponential histogram.
  * last_run_timestamp — alert: stale > 2h → page (worker dead).

Implementation notes:
  * Sampler thresholds hardcoded to match reconciler defaults —
    intentional mismatch allowed (alerts fire while reconciler
    already working = correct behavior).
  * Query error sets gauge to -1 (sentinel for "sampler broken").
  * marketplace package routes through monitoring recorders so it
    doesn't import prometheus directly.
  * Sampler runs regardless of Hyperswitch enablement; gauges
    default 0 when pipeline idle.
  * Graceful shutdown wired in cmd/api/main.go.

Alert rules in config/alertmanager/ledger.yml with runbook
pointers + detailed descriptions — each alert explains WHAT
happened, WHY the reconciler may not resolve it, and WHERE to
look first.

Grafana dashboard config/grafana/dashboards/ledger-health.json —
top row = 5 stat panels (orphan first, color-coded red on > 0),
middle row = trend timeseries + reconciler action rate by phase,
bottom row = sweep duration p50/p95/p99 + seconds-since-last-tick
+ orphan cumulative.

Tests — 6 cases, all green (sqlite :memory:):
  * CountsStuckOrdersPending (includes the filter on
    non-empty payment_id)
  * StuckOrdersZeroWhenAllCompleted
  * CountsOrphanRefunds (THE canary)
  * CountsStuckRefundsWithHsID (gauge-orthogonality check)
  * CountsFailedAndReversalPendingTransfers
  * ReconcilerRecorders (counter + gauge shape)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-18 03:40:14 +02:00

2 commits