Back to Reference
EvaluationCR-BenchSNRGround Truth

Evaluation Methodology

How VCR measures its own effectiveness — precision, recall, signal-to-noise, and cost efficiency against known ground truth.

Most AI code review tools report how many findings they generate. VCR measures whether those findings are correct, complete, and worth the cost. This page documents the evaluation methodology used by npm run demo:triage.

💡 Run it yourself

git clone https://github.com/VirtusLab/visdom-code-review && cd visdom-code-review/demo && npm install && npm run demo:triage

Why evaluate?

A code review tool that generates 50 comments per PR with 80% noise is worse than no tool at all. Developer trust erodes at ~15% false positive rate. Beyond that, teams ignore all findings — including the real ones.

50 comments
40 noise
+
10 real
0 acted on

The Cry Wolf Effect: noise destroys signal

VCR's design philosophy is precision over recall — better to miss a LOW finding than erode trust with false positives. The evaluation methodology quantifies whether we deliver on that promise.

Framework overview

The evaluation combines three established approaches:

1

CR-Bench Classification

arxiv:2603.11078

Every finding is classified against ground truth into one of three categories:

Bug Hit — correctly identifies a known vulnerability
Valid Suggestion — technically sound, actionable, but not in ground truth
Noise — incorrect, hallucinated, or not actionable
2

Signal-to-Noise Ratio

SNR Framework (Jet Xu)

Findings are tiered by impact, and the ratio of signal to noise determines developer trust:

Tier 1 Critical signal — security vulnerabilities, runtime errors, breaking changes
Tier 2 Important signal — architecture issues, test quality, maintainability risks
Noise Style preferences, micro-optimizations, hallucinated issues
3

Cost-Quality Tradeoff

Triage Framework (arxiv:2604.07494) & Spotify Verification Loop

Per-layer cost efficiency determines whether the layered architecture delivers on its economic promise:

  • Layer 1 (deterministic) should catch maximum findings at $0
  • Layer 2 (Haiku) should triage correctly for ~$0.02
  • Layer 3 (Sonnet) should justify its ~$0.40 cost with unique deep findings

Metrics computed

Metric Formula What it measures Target
Precision bug_hits / total_findings How often findings are real ≥80%
Recall matched_GT / total_GT How many known bugs were found ≥80%
F1 Score 2 × (P × R) / (P + R) Harmonic mean — penalizes imbalance ≥70%
Usefulness Rate (hits + valid) / total Findings a developer would act on (CR-Bench) ≥80%
Signal Ratio (T1 + T2) / total Share of findings that matter ≥80%
SNR (hits + valid) / noise Signal-to-noise ratio ≥5:1
False Positive Rate noise / total Developer trust erosion risk ≤5%
Cost per Bug Hit total_cost / bug_hits Economic efficiency of detection <$0.10

Evaluation pipeline

1

Define ground truth

Each scenario declares every planted vulnerability with tier classification (T1/T2), file location, and description.

2

Run VCR pipeline

All 4 layers execute. Findings are collected with metadata: layer, severity, confidence, file, line.

3

Classify findings

Each finding is matched against ground truth using bidirectional keyword overlap on file + title + description. Classified as Bug Hit, Valid Suggestion, or Noise.

4

Compute metrics

Precision, Recall, F1, Usefulness Rate, SNR, FPR, and cost metrics. Per-layer breakdown shows which layers contribute what.

5

Identify gaps

Ground truth entries without a matching finding are surfaced as missed vulnerabilities — the recall gap.

Current results: "The Perfect PR"

The demo scenario plants 14 known vulnerabilities (9 Tier-1 security, 5 Tier-2 architecture/test-quality) in a deliberately flawed authentication service. Here are the evaluation results:

100% Precision 0 false positives
93% Recall 13 of 14 GT matched
96% F1 Score Harmonic mean
14:1 SNR Excellent (≥5:1)
0% False Positive Rate Zero noise
$0.032 Cost per Bug Hit $0.44 total

Per-layer contribution

L1 Deterministic $0.00
4 hits
L2 AI Quick Scan $0.02
3 hits
L3 AI Deep Review $0.42
7 hits

Layer 1 catches 4 findings at zero cost. Layer 2 adds 3 findings and makes the gate decision for $0.02. Layer 3 adds 7 unique findings that only deep analysis with repository context can find — justifying its $0.42 cost.

Ground truth definition

Each scenario declares its ground truth: every vulnerability planted, classified by tier. The evaluator matches pipeline findings against this list using bidirectional keyword overlap (file match required, semantic similarity threshold of 20%).

Tier 1 — Critical signal (9 entries)

IDFileVulnerability
GT-001.env.testHardcoded JWT secret in version control
GT-002auth.model.tsSQL injection via string interpolation
GT-003auth.controller.tsTiming-unsafe token comparison
GT-004auth.service.tsMath.random() for security tokens
GT-005auth.service.tsbcrypt cost factor 4 (brute-forceable)
GT-006auth.middleware.tsJWT accepts algorithm "none", ignores expiry
GT-007auth.controller.tsNo rate limiting on login endpoint
GT-008auth.controller.tsUser enumeration via error messages
GT-009auth.controller.tsNo input validation (bcrypt DoS)

Tier 2 — Important signal (5 entries)

IDFileIssue
GT-010auth.test.ts8/12 tests are circular (mock-on-mock)
GT-011auth.test.tsSpy-only assertions instead of value assertions
GT-012auth.test.tsZero negative and edge case tests
GT-013auth.controller.tsBusiness logic coupled to HTTP handler
GT-014auth.model.tsSELECT * returns password hash to all callers

Industry context

For reference, here is how VCR's results compare to published benchmarks:

MetricIndustry rangeVCR (demo)
F1 Score 45–64% (CR-Bench 2026, top tools) 96%
False Positive Rate 5–15% (Graphite benchmark) 0%
SNR ~5:1 (best single-shot, SNR framework) 14:1
Spotify LLM Judge veto rate ~25% of agent output vetoed 0% noise (nothing to veto)

💡 Fair comparison caveat

VCR's demo results are on a prepared scenario with known ground truth, not arbitrary open-source PRs. The industry benchmarks above are measured on real-world PRs with diverse codebases. VCR's production results will be lower than demo results — but the evaluation methodology itself is directly applicable to production deployments.

Sources