Metrics Framework | Visdom Testing

Traditional testing metrics (line coverage, branch coverage, test count) measure effort, not effectiveness. Visdom Testing tracks metrics that answer the question that matters: would your tests catch a bug if one existed?

Testing Metrics That Matter

Metric	What it measures	Why it matters
Mutation score	% of code mutations detected by tests	The only metric that measures whether tests would actually catch a bug
TORS	% of test failures that are real regressions	Measures trust in the test suite. Below 50%, CI is a liability
Defect escape rate	% of bugs that reach production despite testing	The ultimate outcome metric. Are the layers working?
Flake rate	% of test runs that fail non-deterministically	Flaky tests erode trust and waste CI resources

Why Coverage Is Insufficient

Line coverage is the most widely tracked testing metric. It is also the most misleading. Coverage measures which lines executed during testing, not whether the tests would detect a bug on those lines.

⚠️ The PBT Case Study

In a controlled experiment with a CRUD invoice service, traditional tests achieved 90% line coverage and 70% branch coverage across 10 test methods with 16 assertions. The test suite looked excellent by every standard metric. It caught 0 out of 2 computation bugs.

The bugs: early rounding of the discount rate (rounding before multiplication instead of after) and wrong rounding mode on VAT calculation (HALF_UP instead of HALF_EVEN). Both bugs were in covered lines. Both passed all assertions.

Metric	Traditional (10 tests)	PBT (8 properties)	Combined
Line coverage	90%	80%	90%
Branch coverage	70%	50%	70%
Mutation score (PIT)	73%	55%	73%
Computation bugs found	0/2	2/2	2/2

The critical insight: by every standard metric, the traditional suite looked better. Higher coverage, higher mutation score, more tests. If you relied on metrics alone, you would conclude PBT added no value. Meanwhile, both computation bugs would ship to production. 90% line coverage while missing every computation error that matters.

Per-Layer Metrics

Layer 0: Architecture Testing

Metric	Target	Measured when
Rule execution time	<10s	After L0 completes
Violation detection rate	100% of defined rules	Any code change touching constrained packages
False positive rate	0%	Rules are deterministic; false positives indicate a rule error
Rule coverage	All critical architecture decisions encoded	Architecture review audit

Layer 1: Property-Based Testing

Metric	Target	Measured when
Property execution time	~2s per property	After L1 completes
Inputs generated per property	>1000	Per property run
Shrinking effectiveness	Minimal counterexample found	On property failure
Computation bugs found	Tracked per release	Bugs found by PBT that other tests missed
Property coverage	>80% of business-critical computations	Architecture review

Layer 2: Mutation Testing

Metric	Target	Measured when
Mutation score	>80% critical, >60% utilities	After PIT run completes
Execution time	~5 min (incremental)	After L2 completes
Survived mutants	Trending down	Per module, per release
Equivalent mutant rate	<10%	Manual review of survived mutants
AI-augmented mutant acceptance	>70%	LLM-generated mutants that are valid, non-equivalent

Layer 3: Contract Testing

Metric	Target	Measured when
Contract verification time	~10 min	After L3 completes
Contract coverage	100% of public API endpoints	Per service boundary
Breaking change detection rate	100% detected before merge	Provider changes that break consumer contracts
False alarm rate	<5%	Contract failures that are not real compatibility issues

Mutation Score Explained

Mutation testing systematically changes the code (introduces mutants) and checks whether the test suite detects each change. It answers the question: "If the code were wrong, would my tests tell me?"

How PIT Works

PIT (PITest) is the standard mutation testing tool for the JVM. It:

Analyzes the codebase to identify mutation points (operators, return values, conditionals)
Creates mutants: modified versions of the code with one change each
Runs the test suite against each mutant
Reports which mutants were killed (test failed) vs survived (all tests passed)

Mutant Classification

Status	Meaning	Action
Killed	At least one test failed when the mutant was introduced	Good. The test suite detects this kind of bug.
Survived	No test failed. The mutant is indistinguishable from correct code.	Write a new test or strengthen an assertion. This is a gap.
Equivalent	The mutant produces identical behavior to the original code	No action needed. Exclude from score calculation.
Timed out	The mutant caused an infinite loop or timeout	Counted as killed (the mutant was detected via resource limits).

Mutation Score = killed / (killed + survived)
               = detected mutations / total non-equivalent mutations

💡 AI-augmented mutation testing

Standard PIT mutators are syntactic (negate conditionals, replace operators). AI-augmented mutant generation creates semantic mutants based on common domain errors: swapping currency codes, using wrong tax rates, reversing sort order. Meta's ACH system reported a 73% acceptance rate for LLM-generated mutation tests.

TORS: Test Oracle Reliability Score

TORS measures whether your test suite is a reliable signal or a source of noise. In AI-assisted development environments, unreliable tests cause a specific failure mode: AI agents "fix" passing tests that happened to fail due to flakiness, introducing real bugs in the process.

Definition

TORS = real_failures / total_failures

Where:
  real_failures  = test failures caused by actual code defects
  total_failures = all test failures (real + flaky + environment)

Threshold:
  TORS > 0.85 → test is reliable, include in feedback signal
  TORS < 0.85 → test is unreliable, quarantine from agent feedback

TORS Benchmarks

TORS Range	Assessment	Action
>85%	Healthy	Test failures are real signals. Include in all feedback loops.
50-85%	Degraded	Flake remediation needed. Prioritize top offenders. Quarantine worst tests from agent feedback.
<50%	Unreliable	Test suite is a liability. Developers and agents cannot trust CI. Mandatory remediation.

Defect Escape Rate

Defect escape rate measures the ultimate outcome: how many bugs reach production despite the testing layers. It is the metric that tells you whether the layers are actually working.

Defect Escape Rate = production_bugs / (production_bugs + bugs_caught_in_testing)

Benchmarks

Rate	Assessment	Context
<1%	Exceptional	Mature testing with all four layers active. Typical of teams with mutation score >80%.
1-3%	Good	Standard for teams with property-based testing and contract testing. Industry competitive.
3-5%	Average	Typical for teams relying on unit tests and integration tests only.
>5%	Investigate	Testing layers have gaps. Check which defect class is escaping and strengthen the corresponding layer.

✅ Trace escapes back to layers

When a bug reaches production, classify it: structural (L0 should have caught it), computation (L1), weak test (L2), or integration (L3). This tells you exactly which layer needs investment. Over time, the escape pattern reveals your testing strategy's blind spots.

Flaky Test Management

Flaky tests are the single largest source of wasted engineering time in CI. They erode trust, cause false urgency, and in AI-assisted workflows, trigger agents to "fix" code that is not broken.

Industry Data

📦 Google: Flaky Tests at Scale (Micco, ICST 2017)

Google's testing infrastructure data (4.2M tests, 150M executions/day) shows that 84% of pass-to-fail test transitions are caused by flakiness, not real regressions. Google manages flakiness to a 2% budget: at any given time, no more than 2% of tests may be flaky. Tests exceeding the budget are quarantined automatically. (Source: John Micco, "The State of Continuous Integration Testing @Google," ICST 2017)

📦 AI Code Quality: External Data

DORA 2025: 90% AI adoption correlates with 9% bug rate increase and 91% more code review time. (Google DORA Report)
CodeRabbit 2025: AI PRs average 10.83 issues vs 6.45 for human PRs (1.7x) across 470 pull requests. (CodeRabbit Report)
OOPSLA 2025: Each property-based test finds ~50x as many mutations as the average unit test. (UC San Diego)
Meta ACH 2025: 49% of AI-generated mutation tests caught faults invisible to line coverage. (Meta Engineering)

Detection Strategies

Strategy	How it works	Cost
Historical analysis	Track pass/fail ratio over last N runs. Flag tests with inconsistent results.	Low (metadata only)
Re-run on failure	Re-run failed tests 2-3 times. If any re-run passes, classify as flaky.	Medium (extra CI time on failures)
Quarantine	Move known flaky tests to a separate suite. Track but do not block.	Low (organizational)
TORS integration	Feed flake data into TORS. Exclude unreliable tests from agent feedback signals.	Low (automated)

Flake Budget

Following Google's model, Visdom Testing recommends a 2% flake budget:

At any time, no more than 2% of tests may have a TORS below the reliability threshold
Tests that breach the budget are automatically quarantined from the agent feedback signal
Quarantined tests generate a remediation ticket with flake frequency data
Goal is not zero flakes (impossible at scale) but managed flakes that do not erode trust

Industry Benchmarks

These benchmarks are drawn from published data by teams operating testing at scale:

Organization	Key Finding	Source
Google	84% of CI failures are flaky. 2% flake budget. Historical analysis for detection.	Google Testing Blog
Meta	73% acceptance rate for LLM-generated mutation tests (ACH system). AI augments traditional PIT mutators with semantic mutations.	Meta Engineering
Atlassian	Rovo Dev CLI closed mutation coverage gaps from 56% to 80% on real Jira projects using AI-generated tests.	Atlassian Blog
Spotify	Honeycomb testing model for microservices. Contract-focused testing reduces integration environment dependency.	Spotify Engineering

CI Integration Metrics

Testing effectiveness depends on CI pipeline speed. Slow feedback loops cause developers to context-switch away and merge without waiting for results. Visdom Testing tracks CI-specific metrics to ensure the testing layers provide value within the developer's attention window.

Test Selection

Not every change requires every test. Test impact analysis determines which tests are affected by a code change and runs only those:

Metric	Target	Why
Test selection accuracy	>95%	Selected tests must include all actually-affected tests
Test reduction ratio	>50%	Typical change affects a fraction of the test suite
Selection overhead	<5s	Analysis time must be negligible compared to test execution

Parallelization

Metric	Target	Why
L0 + L1 total time	<30s	Fast feedback on structure and computation. Must complete before developer context-switches.
L2 incremental time	<5 min	Mutation testing on changed code only. Full run on nightly.
L3 verification time	<10 min	Contract verification. Parallelized across consumer-provider pairs.
Total pipeline time	<15 min	End-to-end from push to all-green. Beyond 15 min, developers stop waiting.

Feedback Time

💡 The 15-minute threshold

Research consistently shows that CI feedback beyond 15 minutes degrades developer productivity. Developers context-switch to other tasks and lose the mental model of their change. Visdom Testing is designed to provide L0 + L1 feedback in under 30 seconds (immediate signal) and full L2 + L3 feedback within 15 minutes.

Feedback Stage	Time	What the developer learns
Immediate (L0)	<10s	Architecture violations, banned APIs, naming issues
Fast (L0 + L1)	<30s	Plus computation correctness from property-based tests
Standard (+ L2)	<5 min	Plus mutation score: are the tests actually verifying the code?
Complete (+ L3)	<15 min	Plus contract verification: will this break other services?

End-to-End: Visdom SDLC Metrics Integration

📦 Visdom AI-Native SDLC

These metrics connect Visdom Testing to the broader Visdom AI-Native SDLC framework. ITS, CPI, and TORS are the three core metrics for measuring AI agent effectiveness across the software delivery lifecycle.

Metric	Definition	Target	Connection to Visdom SDLC
ITS (Iterations-to-Success)	Iterations from task assignment to passing CI	1-3 healthy, 5-10 warning, 20+ structural failure	Faster feedback (L0 in 10s) reduces iterations. Each layer caught earlier = one fewer iteration.
CPI (Cost-per-Iteration)	Tokens + compute + CI + review per iteration	Trending down	L0/L1 near-zero cost. Incremental L2 reduces mutation testing cost. Total per-layer cost tracked.
TORS (Test Oracle Reliability Score)	% of test failures that are real regressions	>85%	Directly measured. Flaky tests quarantined. Prevents agents from "fixing" tests that aren't broken.
Mutation score	% of code mutations detected by tests	>80% critical code	Leading indicator of defect escape rate. Predicts production bugs before they happen.
Defect escape rate	Bugs reaching production despite testing	<1% exceptional, <3% good	Primary outcome metric. Traced back to layer gaps for targeted investment.
Flake rate	% of non-deterministic test failures	<2% budget	Managed, not eliminated. Exceeding budget triggers automatic quarantine.