Evaluation Methodology

Most AI code review tools report how many findings they generate. VCR measures whether those findings are correct, complete, and worth the cost. This page documents the evaluation methodology used by npm run demo:triage.

Why evaluate?

A code review tool that generates 50 comments per PR with 80% noise is worse than no tool at all. Developer trust erodes at ~15% false positive rate. Beyond that, teams ignore all findings — including the real ones.

50 comments

→

40 noise

10 real

→

0 acted on

The Cry Wolf Effect: noise destroys signal

VCR's design philosophy is precision over recall — better to miss a LOW finding than erode trust with false positives. The evaluation methodology quantifies whether we deliver on that promise.

Framework overview

The evaluation combines three established approaches:

arxiv:2603.11078

Every finding is classified against ground truth into one of three categories:

Bug Hit — correctly identifies a known vulnerability

Valid Suggestion — technically sound, actionable, but not in ground truth

Noise — incorrect, hallucinated, or not actionable

SNR Framework (Jet Xu)

Findings are tiered by impact, and the ratio of signal to noise determines developer trust:

Tier 1 Critical signal — security vulnerabilities, runtime errors, breaking changes

Tier 2 Important signal — architecture issues, test quality, maintainability risks

Noise Style preferences, micro-optimizations, hallucinated issues

Triage Framework (arxiv:2604.07494) & Spotify Verification Loop

Per-layer cost efficiency determines whether the layered architecture delivers on its economic promise:

Layer 1 (deterministic) should catch maximum findings at $0
Layer 2 (Haiku) should triage correctly for ~$0.02
Layer 3 (Sonnet) should justify its ~$0.40 cost with unique deep findings

Metrics computed

Metric	Formula	What it measures	Target
Precision	`bug_hits / total_findings`	How often findings are real	≥80%
Recall	`matched_GT / total_GT`	How many known bugs were found	≥80%
F1 Score	`2 × (P × R) / (P + R)`	Harmonic mean — penalizes imbalance	≥70%
Usefulness Rate	`(hits + valid) / total`	Findings a developer would act on (CR-Bench)	≥80%
Signal Ratio	`(T1 + T2) / total`	Share of findings that matter	≥80%
SNR	`(hits + valid) / noise`	Signal-to-noise ratio	≥5:1
False Positive Rate	`noise / total`	Developer trust erosion risk	≤5%
Cost per Bug Hit	`total_cost / bug_hits`	Economic efficiency of detection	<$0.10

Evaluation pipeline

Define ground truth

Each scenario declares every planted vulnerability with tier classification (T1/T2), file location, and description.

Run VCR pipeline

All 4 layers execute. Findings are collected with metadata: layer, severity, confidence, file, line.

Classify findings

Each finding is matched against ground truth using bidirectional keyword overlap on file + title + description. Classified as Bug Hit, Valid Suggestion, or Noise.

Compute metrics

Precision, Recall, F1, Usefulness Rate, SNR, FPR, and cost metrics. Per-layer breakdown shows which layers contribute what.

Identify gaps

Ground truth entries without a matching finding are surfaced as missed vulnerabilities — the recall gap.

Current results: "The Perfect PR"

The demo scenario plants 14 known vulnerabilities (9 Tier-1 security, 5 Tier-2 architecture/test-quality) in a deliberately flawed authentication service. Here are the evaluation results:

100% Precision 0 false positives

93% Recall 13 of 14 GT matched

96% F1 Score Harmonic mean

14:1 SNR Excellent (≥5:1)

0% False Positive Rate Zero noise

$0.032 Cost per Bug Hit $0.44 total

Per-layer contribution

L1 Deterministic $0.00

4 hits

L2 AI Quick Scan $0.02

3 hits

L3 AI Deep Review $0.42

7 hits

Layer 1 catches 4 findings at zero cost. Layer 2 adds 3 findings and makes the gate decision for $0.02. Layer 3 adds 7 unique findings that only deep analysis with repository context can find — justifying its $0.42 cost.

Ground truth definition

Each scenario declares its ground truth: every vulnerability planted, classified by tier. The evaluator matches pipeline findings against this list using bidirectional keyword overlap (file match required, semantic similarity threshold of 20%).

Tier 1 — Critical signal (9 entries)

ID	File	Vulnerability
GT-001	`.env.test`	Hardcoded JWT secret in version control
GT-002	`auth.model.ts`	SQL injection via string interpolation
GT-003	`auth.controller.ts`	Timing-unsafe token comparison
GT-004	`auth.service.ts`	Math.random() for security tokens
GT-005	`auth.service.ts`	bcrypt cost factor 4 (brute-forceable)
GT-006	`auth.middleware.ts`	JWT accepts algorithm "none", ignores expiry
GT-007	`auth.controller.ts`	No rate limiting on login endpoint
GT-008	`auth.controller.ts`	User enumeration via error messages
GT-009	`auth.controller.ts`	No input validation (bcrypt DoS)

Tier 2 — Important signal (5 entries)

ID	File	Issue
GT-010	`auth.test.ts`	8/12 tests are circular (mock-on-mock)
GT-011	`auth.test.ts`	Spy-only assertions instead of value assertions
GT-012	`auth.test.ts`	Zero negative and edge case tests
GT-013	`auth.controller.ts`	Business logic coupled to HTTP handler
GT-014	`auth.model.ts`	SELECT * returns password hash to all callers

Industry context

For reference, here is how VCR's results compare to published benchmarks:

Metric	Industry range	VCR (demo)
F1 Score	45–64% (CR-Bench 2026, top tools)	96%
False Positive Rate	5–15% (Graphite benchmark)	0%
SNR	~5:1 (best single-shot, SNR framework)	14:1
Spotify LLM Judge veto rate	~25% of agent output vetoed	0% noise (nothing to veto)

💡 Fair comparison caveat

VCR's demo results are on a prepared scenario with known ground truth, not arbitrary open-source PRs. The industry benchmarks above are measured on real-world PRs with diverse codebases. VCR's production results will be lower than demo results — but the evaluation methodology itself is directly applicable to production deployments.

A second caveat: LLM-judge variance is real. In our own runs, an LLM judge scored the same findings differently across runs, and the keyword matcher used above is more generous than an independent LLM judge scoring identical pipeline output. F1 is therefore not comparable across judges or across pipeline scopes (diff-only vs full-pipeline) — compare runs only under the same judge, on the same scope.

Sources

CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents (arxiv:2603.11078)
Signal-to-Noise Framework for AI Code Review (Jet Xu)
Triage: Routing SE Tasks to Cost-Effective LLM Tiers via Code Quality Signals (arxiv:2604.07494)
Spotify: Feedback Loops for Background Coding Agents
Expected False-Positive Rate from AI Code Review Tools (Graphite)
AI Code Review Benchmark 2026 (CodeAnt / Martian)