Each layer has its own metrics, measured at its stage. End-to-end metrics align with the
Visdom AI-Native SDLC metrics framework (ITS, CPI, TORS).
Layer 0: Context Collection
| Metric | Target | Measured when |
| Context build time | <10s | After Layer 0 completes |
| Knowledge layer cache hit rate | >90% | On each query |
| Context completeness | 100% required fields | Validation of review-context.json |
Layer 1: Deterministic Gate
| Metric | Target | Measured when |
| Gate execution time | <60s | After Layer 1 completes |
| Secret detection recall | 100% (zero false negatives) | Quarterly audit with known secrets |
| SAST findings per PR | Trending down | After each PR |
| Blocking rate | <10% PRs blocked | After each PR |
| TORS | >85% | Computed from test reliability data |
Layer 2: AI Quick Scan
| Metric | Target | Measured when |
| Scan time | <2 min | After Layer 2 completes |
| Risk classification accuracy | >85% agreement with human | Comparison: VCR risk vs reviewer judgment |
| Quick findings acceptance rate | >60% | Developer reaction on findings |
| Layer 3 trigger rate | 30-50% of PRs | After classification (too low = miss risk, too high = waste) |
| Token cost per scan | <$0.05 avg | Per invocation |
| AI-code detection precision | >80% | Comparison with known AI-generated PRs |
Layer 3: AI Deep Review
| Metric | Target | Measured when |
| Deep review time | <10 min | After Layer 3 completes |
| Finding severity distribution | More HIGH/CRITICAL than LOW | If L3 produces mainly LOW findings, risk classifier is too aggressive |
| False positive rate | <15% | Developer reaction per finding |
| Actionable finding rate | >80% | Finding has concrete fix suggestion |
| Token cost per review | <$2.00 avg | Per invocation |
| Circular test detection rate | Tracked, no target | Per PR with new/modified tests |
Reporter
| Metric | Target | Measured when |
| Time to first comment | <5 min (L2 only), <15 min (L2+L3) | Timestamp PR open to first VCR comment |
| Comment engagement rate | >50% PRs have reaction | 24h after comment |
| Reviewer guidance accuracy | >70% | Human reviewer confirms focus areas matched |
| Suggested reviewer acceptance | >60% | Was the expertise-based suggestion used? |
Proactive Scanner
| Metric | Target | Measured when |
| Scan completion rate | 100% scheduled scans | After each cron run |
| Trend detection lead time | >2 weeks before problem | Comparison: when scanner flagged vs when it became incident |
| Created issues resolution rate | >50% within 30 days | Tracking auto-created issues |
| Convention drift detection | Drift flagged within 2 weeks | Cross-team pattern comparison |
End-to-End: Visdom SDLC Metrics Integration
📦 Visdom AI-Native SDLC
These metrics connect VCR to the broader Visdom AI-Native SDLC framework described in the blog series.
ITS, CPI, and TORS are the three core metrics for measuring AI agent effectiveness across the software delivery lifecycle.
| Metric | Definition | Target | Connection to Visdom SDLC |
| ITS (Iterations-to-Success) | Iterations from task assignment to passing CI | 1-3 healthy, 5-10 warning, 20+ structural failure | VCR reduces ITS by filtering flaky tests (TORS) and providing early feedback before agent iterates |
| CPI (Cost-per-Iteration) | Tokens + compute + CI + review per iteration | Trending down | VCR reduces review component of CPI; TORS reduces wasted iterations |
| TORS (Test Oracle Reliability Score) | % of test failures that are real regressions | >85% | Directly measured by Layer 1; feeds into Layer 2 risk classification |
| Escaped defects | Bugs in production in areas covered by VCR | Trending down | Primary outcome metric |
| 4x Hidden Tax visibility | License + compute + tokens + review breakdown | Fully tracked | VCR dashboard provides real-time cost breakdown |
| Senior review time | Time seniors spend on code review | -30% vs baseline | VCR pre-annotates PRs, focuses reviewer attention |
Feedback Mechanism
Each VCR finding supports developer reactions that feed back into per-layer metrics:
| Reaction | Meaning |
| 👍 | Finding was helpful, fixed it |
| 👎 | False positive / not relevant |
| 🤔 | Not sure, needs discussion |
✅ Feedback loop maturity
During pilot, VirtusLab analyzes reactions manually. In mature deployments, reactions inform
prompt tuning and risk classifier calibration automatically.