Metrics Framework | Visdom Code Review

Each layer has its own metrics, measured at its stage. End-to-end metrics align with the Visdom AI-Native SDLC metrics framework (ITS, CPI, TORS).

Layer 0: Context Collection

Metric	Target	Measured when
Context build time	<10s	After Layer 0 completes
Knowledge layer cache hit rate	>90%	On each query
Context completeness	100% required fields	Validation of `review-context.json`

Layer 1: Deterministic Gate

Metric	Target	Measured when
Gate execution time	<60s	After Layer 1 completes
Secret detection recall	100% (zero false negatives)	Quarterly audit with known secrets
SAST findings per PR	Trending down	After each PR
Blocking rate	<10% PRs blocked	After each PR
TORS	>85%	Computed from test reliability data

Layer 2: AI Quick Scan

Metric	Target	Measured when
Scan time	<2 min	After Layer 2 completes
Risk classification accuracy	>85% agreement with human	Comparison: VCR risk vs reviewer judgment
Quick findings acceptance rate	>60%	Developer reaction on findings
Layer 3 trigger rate	30-50% of PRs	After classification (too low = miss risk, too high = waste)
Token cost per scan	<$0.05 avg	Per invocation
AI-code detection precision	>80%	Comparison with known AI-generated PRs

Layer 3: AI Deep Review

Metric	Target	Measured when
Deep review time	<10 min	After Layer 3 completes
Finding severity distribution	More HIGH/CRITICAL than LOW	If L3 produces mainly LOW findings, risk classifier is too aggressive
False positive rate	<15%	Developer reaction per finding
Actionable finding rate	>80%	Finding has concrete fix suggestion
Token cost per review	<$2.00 avg	Per invocation
Circular test detection rate	Tracked, no target	Per PR with new/modified tests

Reporter

Metric	Target	Measured when
Time to first comment	<5 min (L2 only), <15 min (L2+L3)	Timestamp PR open to first VCR comment
Comment engagement rate	>50% PRs have reaction	24h after comment
Reviewer guidance accuracy	>70%	Human reviewer confirms focus areas matched
Suggested reviewer acceptance	>60%	Was the expertise-based suggestion used?

Proactive Scanner

Metric	Target	Measured when
Scan completion rate	100% scheduled scans	After each cron run
Trend detection lead time	>2 weeks before problem	Comparison: when scanner flagged vs when it became incident
Created issues resolution rate	>50% within 30 days	Tracking auto-created issues
Convention drift detection	Drift flagged within 2 weeks	Cross-team pattern comparison

End-to-End: Visdom SDLC Metrics Integration

📦 Visdom AI-Native SDLC

These metrics connect VCR to the broader Visdom AI-Native SDLC framework described in the blog series. ITS, CPI, and TORS are the three core metrics for measuring AI agent effectiveness across the software delivery lifecycle.

Metric	Definition	Target	Connection to Visdom SDLC
ITS (Iterations-to-Success)	Iterations from task assignment to passing CI	1-3 healthy, 5-10 warning, 20+ structural failure	VCR reduces ITS by filtering flaky tests (TORS) and providing early feedback before agent iterates
CPI (Cost-per-Iteration)	Tokens + compute + CI + review per iteration	Trending down	VCR reduces review component of CPI; TORS reduces wasted iterations
TORS (Test Oracle Reliability Score)	% of test failures that are real regressions	>85%	Directly measured by Layer 1; feeds into Layer 2 risk classification
Escaped defects	Bugs in production in areas covered by VCR	Trending down	Primary outcome metric
4x Hidden Tax visibility	License + compute + tokens + review breakdown	Fully tracked	VCR dashboard provides real-time cost breakdown
Senior review time	Time seniors spend on code review	-30% vs baseline	VCR pre-annotates PRs, focuses reviewer attention

Feedback Mechanism

Each VCR finding supports developer reactions that feed back into per-layer metrics:

Reaction	Meaning
👍	Finding was helpful, fixed it
👎	False positive / not relevant
🤔	Not sure, needs discussion

✅ Feedback loop maturity

During pilot, VirtusLab analyzes reactions manually. In mature deployments, reactions inform prompt tuning and risk classifier calibration automatically.