Most AI code review tools report how many findings they generate. VCR measures whether those findings are correct,
complete, and worth the cost. This page documents the evaluation methodology used by
npm run demo:triage.
💡 Run it yourself
git clone https://github.com/VirtusLab/visdom-code-review && cd visdom-code-review/demo && npm install && npm run demo:triage Why evaluate?
A code review tool that generates 50 comments per PR with 80% noise is worse than no tool at all. Developer trust erodes at ~15% false positive rate. Beyond that, teams ignore all findings — including the real ones.
The Cry Wolf Effect: noise destroys signal
VCR's design philosophy is precision over recall — better to miss a LOW finding than erode trust with false positives. The evaluation methodology quantifies whether we deliver on that promise.
Framework overview
The evaluation combines three established approaches:
CR-Bench Classification
arxiv:2603.11078
Every finding is classified against ground truth into one of three categories:
Signal-to-Noise Ratio
SNR Framework (Jet Xu)
Findings are tiered by impact, and the ratio of signal to noise determines developer trust:
Cost-Quality Tradeoff
Triage Framework (arxiv:2604.07494) & Spotify Verification Loop
Per-layer cost efficiency determines whether the layered architecture delivers on its economic promise:
- Layer 1 (deterministic) should catch maximum findings at $0
- Layer 2 (Haiku) should triage correctly for ~$0.02
- Layer 3 (Sonnet) should justify its ~$0.40 cost with unique deep findings
Metrics computed
| Metric | Formula | What it measures | Target |
|---|---|---|---|
| Precision | bug_hits / total_findings | How often findings are real | ≥80% |
| Recall | matched_GT / total_GT | How many known bugs were found | ≥80% |
| F1 Score | 2 × (P × R) / (P + R) | Harmonic mean — penalizes imbalance | ≥70% |
| Usefulness Rate | (hits + valid) / total | Findings a developer would act on (CR-Bench) | ≥80% |
| Signal Ratio | (T1 + T2) / total | Share of findings that matter | ≥80% |
| SNR | (hits + valid) / noise | Signal-to-noise ratio | ≥5:1 |
| False Positive Rate | noise / total | Developer trust erosion risk | ≤5% |
| Cost per Bug Hit | total_cost / bug_hits | Economic efficiency of detection | <$0.10 |
Evaluation pipeline
Define ground truth
Each scenario declares every planted vulnerability with tier classification (T1/T2), file location, and description.
Run VCR pipeline
All 4 layers execute. Findings are collected with metadata: layer, severity, confidence, file, line.
Classify findings
Each finding is matched against ground truth using bidirectional keyword overlap on file + title + description. Classified as Bug Hit, Valid Suggestion, or Noise.
Compute metrics
Precision, Recall, F1, Usefulness Rate, SNR, FPR, and cost metrics. Per-layer breakdown shows which layers contribute what.
Identify gaps
Ground truth entries without a matching finding are surfaced as missed vulnerabilities — the recall gap.
Current results: "The Perfect PR"
The demo scenario plants 14 known vulnerabilities (9 Tier-1 security, 5 Tier-2 architecture/test-quality) in a deliberately flawed authentication service. Here are the evaluation results:
Per-layer contribution
Layer 1 catches 4 findings at zero cost. Layer 2 adds 3 findings and makes the gate decision for $0.02. Layer 3 adds 7 unique findings that only deep analysis with repository context can find — justifying its $0.42 cost.
Ground truth definition
Each scenario declares its ground truth: every vulnerability planted, classified by tier. The evaluator matches pipeline findings against this list using bidirectional keyword overlap (file match required, semantic similarity threshold of 20%).
Tier 1 — Critical signal (9 entries)
| ID | File | Vulnerability |
|---|---|---|
| GT-001 | .env.test | Hardcoded JWT secret in version control |
| GT-002 | auth.model.ts | SQL injection via string interpolation |
| GT-003 | auth.controller.ts | Timing-unsafe token comparison |
| GT-004 | auth.service.ts | Math.random() for security tokens |
| GT-005 | auth.service.ts | bcrypt cost factor 4 (brute-forceable) |
| GT-006 | auth.middleware.ts | JWT accepts algorithm "none", ignores expiry |
| GT-007 | auth.controller.ts | No rate limiting on login endpoint |
| GT-008 | auth.controller.ts | User enumeration via error messages |
| GT-009 | auth.controller.ts | No input validation (bcrypt DoS) |
Tier 2 — Important signal (5 entries)
| ID | File | Issue |
|---|---|---|
| GT-010 | auth.test.ts | 8/12 tests are circular (mock-on-mock) |
| GT-011 | auth.test.ts | Spy-only assertions instead of value assertions |
| GT-012 | auth.test.ts | Zero negative and edge case tests |
| GT-013 | auth.controller.ts | Business logic coupled to HTTP handler |
| GT-014 | auth.model.ts | SELECT * returns password hash to all callers |
Industry context
For reference, here is how VCR's results compare to published benchmarks:
| Metric | Industry range | VCR (demo) |
|---|---|---|
| F1 Score | 45–64% (CR-Bench 2026, top tools) | 96% |
| False Positive Rate | 5–15% (Graphite benchmark) | 0% |
| SNR | ~5:1 (best single-shot, SNR framework) | 14:1 |
| Spotify LLM Judge veto rate | ~25% of agent output vetoed | 0% noise (nothing to veto) |
💡 Fair comparison caveat
Sources
- CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents (arxiv:2603.11078)
- Signal-to-Noise Framework for AI Code Review (Jet Xu)
- Triage: Routing SE Tasks to Cost-Effective LLM Tiers via Code Quality Signals (arxiv:2604.07494)
- Spotify: Feedback Loops for Background Coding Agents
- Expected False-Positive Rate from AI Code Review Tools (Graphite)
- AI Code Review Benchmark 2026 (CodeAnt / Martian)