Demo — Visdom Code Review

MEASURED METRICS — REAL RUNS

Findings by Severity

Per scenario · real run output

Cost Per Layer

L0+L1 free · L3 only runs for HIGH/CRITICAL

F1 Score vs Market

50 PRs · 5 repos · advisor judge · honest numbers

4× Hidden Tax Breakdown

Illustrative cost model — proportions vary by team

DEMO SCENARIOS — REAL RUNS, REAL FINDINGS

Each scenario is a real PR that looks fine — until VCR runs

Every PR below passed CI, had a clean description, and claimed tests were green. VCR found the hidden issues anyway.

META VCR reviewing its own codebase (metacircular) STANDALONE External real-world code

META · TypeScript

Securing the AI Client

PR: feat: add retry and caching to AI client

"All existing tests pass."

VCR found:

Hardcoded API key, PII in logs, retry without backoff

● 1 critical ● 5 high ● 3 medium

critical
medium
high

$0.06 23s

View PR #43 on GitHub →

META · TypeScript

Refactoring the Gate

PR: refactor: simplify deterministic gate pattern matching

"Behavior unchanged. All tests pass."

VCR found:

Weakened SQL check, timing-unsafe compare, SSRF rule disabled

● 1 critical ● 8 high ● 2 medium

high
critical
high

$0.04 14s

View PR #44 on GitHub →

META · TypeScript

Hollow Test Suite

PR: test: add comprehensive pipeline layer tests

"100% line coverage. All green."

VCR found:

15 tests mocking their own subjects — zero behavioral assertions

● 3 high ● 1 medium

medium
high
high

$0.04 16s

View PR #45 on GitHub →

STANDALONE · Python

Payment Service

PR: feat: add payment processing endpoint

"Tested against Stripe sandbox. All passing."

VCR found:

SQL injection via f-string, card data in logs, weak JWT secret

● 3 critical ● 6 high ● 1 medium

high
critical
critical

$0.05 19s

Local run — no GitHub PR

PRODUCTION REPOSITORY RUN

On a real repository

The same engine, run end-to-end on llama3-java-hat — a real Java LLM-inference project by the site's author — with tool defaults and no config in the repo.

38 PRs · 154 findings · $2.98 total ($0.078/PR) · ~22 s/PR

Deep review (L3) ran on 27 of 38 PRs — the cheap triage gate stopped the rest.

Config-as-code was then introduced in a single PR — PR #50 — and the review of that PR used the configuration the PR itself carries: a repo-specific regex rule fired (llama3/no-stdout-logging), an LLM org rule fired with a ready-to-apply fix suggestion (llama3/tensor-ops-document-shape), and a finding cited the repo's own docs/standards/error-handling.md. 4 findings · $0.0476 · 18.6 s

Actual review output:

PR #50 — demo: VISDOM config-as-code review showcase
4 files, +63/-0 lines

HIGH (2)
  • InterruptedException swallowed without re-interrupt [src/main/java/com/arturskowronski/llama3babylon/hat/TensorDiagnostics.java]
  • rowChecksum accepts float[] with no shape/dimension validation [src/main/java/com/arturskowronski/llama3babylon/hat/TensorDiagnostics.java]

MEDIUM (2)
  • System.out logging in library code [src/main/java/com/arturskowronski/llama3babylon/hat/TensorDiagnostics.java]
  • rowChecksum accepts any-length array with no dimension guard [src/main/java/com/arturskowronski/llama3babylon/hat/TensorDiagnostics.java]

Duration: 18.6s · Cost: $0.0476 · L3: yes

The full SARIF export for this review is vendored at docs/demos/llama3-v8/review.sarif for readers wiring code-scanning tooling. The review comments visible on the PR itself come from a separate run of the same engine, which found 3 of these 4 findings — the deterministic layers agree run to run; the LLM lenses do not always.

Every PR on this repo is reviewed by VCR

Findings by Severity

Cost Per Layer

F1 Score vs Market

4× Hidden Tax Breakdown

How each PR moves through the pipeline

From PR opened to human decision

Each scenario is a real PR that looks fine — until VCR runs

Securing the AI Client

Refactoring the Gate

Hollow Test Suite

Payment Service

On a real repository

Team health — measured on every PR