Back to Reference
MetricsITSCPITORS

Metrics Framework

Per-layer metrics with concrete targets, plus end-to-end Visdom SDLC integration.

Each layer has its own metrics, measured at its stage. End-to-end metrics align with the Visdom AI-Native SDLC metrics framework (ITS, CPI, TORS).

Layer 0: Context Collection

Metric Target Measured when
Context build time <10s After Layer 0 completes
Knowledge layer cache hit rate >90% On each query
Context completeness 100% required fields Validation of review-context.json

Layer 1: Deterministic Gate

Metric Target Measured when
Gate execution time <60s After Layer 1 completes
Secret detection recall 100% (zero false negatives) Quarterly audit with known secrets
SAST findings per PR Trending down After each PR
Blocking rate <10% PRs blocked After each PR
TORS >85% Computed from test reliability data

Layer 2: AI Quick Scan

Metric Target Measured when
Scan time <2 min After Layer 2 completes
Risk classification accuracy >85% agreement with human Comparison: VCR risk vs reviewer judgment
Quick findings acceptance rate >60% Developer reaction on findings
Layer 3 trigger rate 30-50% of PRs After classification (too low = miss risk, too high = waste)
Token cost per scan <$0.05 avg Per invocation
AI-code detection precision >80% Comparison with known AI-generated PRs

Layer 3: AI Deep Review

Metric Target Measured when
Deep review time <10 min After Layer 3 completes
Finding severity distribution More HIGH/CRITICAL than LOW If L3 produces mainly LOW findings, risk classifier is too aggressive
False positive rate <15% Developer reaction per finding
Actionable finding rate >80% Finding has concrete fix suggestion
Token cost per review <$2.00 avg Per invocation
Circular test detection rate Tracked, no target Per PR with new/modified tests

Reporter

Metric Target Measured when
Time to first comment <5 min (L2 only), <15 min (L2+L3) Timestamp PR open to first VCR comment
Comment engagement rate >50% PRs have reaction 24h after comment
Reviewer guidance accuracy >70% Human reviewer confirms focus areas matched
Suggested reviewer acceptance >60% Was the expertise-based suggestion used?

Proactive Scanner

Metric Target Measured when
Scan completion rate 100% scheduled scans After each cron run
Trend detection lead time >2 weeks before problem Comparison: when scanner flagged vs when it became incident
Created issues resolution rate >50% within 30 days Tracking auto-created issues
Convention drift detection Drift flagged within 2 weeks Cross-team pattern comparison

End-to-End: Visdom SDLC Metrics Integration

📦 Visdom AI-Native SDLC

These metrics connect VCR to the broader Visdom AI-Native SDLC framework described in the blog series. ITS, CPI, and TORS are the three core metrics for measuring AI agent effectiveness across the software delivery lifecycle.

Metric Definition Target Connection to Visdom SDLC
ITS (Iterations-to-Success) Iterations from task assignment to passing CI 1-3 healthy, 5-10 warning, 20+ structural failure VCR reduces ITS by filtering flaky tests (TORS) and providing early feedback before agent iterates
CPI (Cost-per-Iteration) Tokens + compute + CI + review per iteration Trending down VCR reduces review component of CPI; TORS reduces wasted iterations
TORS (Test Oracle Reliability Score) % of test failures that are real regressions >85% Directly measured by Layer 1; feeds into Layer 2 risk classification
Escaped defects Bugs in production in areas covered by VCR Trending down Primary outcome metric
4x Hidden Tax visibility License + compute + tokens + review breakdown Fully tracked VCR dashboard provides real-time cost breakdown
Senior review time Time seniors spend on code review -30% vs baseline VCR pre-annotates PRs, focuses reviewer attention

Feedback Mechanism

Each VCR finding supports developer reactions that feed back into per-layer metrics:

Reaction Meaning
👍 Finding was helpful, fixed it
👎 False positive / not relevant
🤔 Not sure, needs discussion

Feedback loop maturity

During pilot, VirtusLab analyzes reactions manually. In mature deployments, reactions inform prompt tuning and risk classifier calibration automatically.