Mental Model
Every code change passes through four testing layers. Each layer catches a different class of defect at a different cost and speed. Lower layers are deterministic and fast; upper layers are thorough and expensive. Together they form a complete testing strategy that addresses the specific failure modes of AI-generated code.
The key insight: no single testing technique catches all defect classes. Architecture tests catch structural violations. Property-based tests catch computation bugs. Mutation testing measures whether tests would actually catch a bug. Contract tests verify cross-service correctness. Each layer compensates for the blind spots of the others.
Layer Diagram
┌──────────────────────────────────────────────────────────┐
│ Code Change (PR / Commit) │
└──────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ LAYER 0: Architecture Testing (<10s) │
│ ArchUnit rules. Deterministic. Zero AI dependency. │
│ Layer boundaries, banned APIs, naming conventions. │
└──────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ LAYER 1: Property-Based Testing (~2s) │
│ jqwik properties. 1000+ generated inputs/property. │
│ Computation correctness. Breaks circular test problem. │
└──────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ LAYER 2: Mutation Testing (~5 min) │
│ PIT + AI-augmented mutants. Mutation score as the │
│ true measure of test effectiveness. Kill the mutants. │
└──────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ LAYER 3: Contract Testing (~10 min) │
│ Pact consumer-driven contracts. Cross-service API │
│ compatibility. No full integration env required. │
└──────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ DASHBOARD: Quality Intelligence │
│ Mutation score trends, TORS, defect escape rate, │
│ flake rate, per-layer health. CI feedback loop. │
└──────────────────────────────────────────────────────────┘ Layer Summary
| Layer | Type | Time | Defect Class | Purpose |
|---|---|---|---|---|
| L0 | Deterministic | <10s | Structural | Enforce architecture constraints: layer boundaries, banned APIs, naming, dependency rules |
| L1 | Generative | ~2s | Computation | Property-based testing with 1000+ inputs. Catch logic errors, rounding bugs, edge cases |
| L2 | Evaluative | ~5 min | Test quality | Mutation testing. Measure whether tests would actually catch a bug if one existed |
| L3 | Contractual | ~10 min | Integration | Consumer-driven contracts. Verify API compatibility across services without E2E environments |
Key Design Decisions
Why Layers
Each layer catches a fundamentally different class of defect. Architecture tests catch structural violations (controller calling repository directly). Property-based tests catch computation bugs (wrong rounding mode on VAT calculation). Mutation testing catches weak tests (tests that pass regardless of code correctness). Contract tests catch integration breaks (provider changed the API shape).
No single layer covers all four. Running all four layers costs less than one production incident.
💡 Defect class coverage
In our controlled experiment, ArchUnit (L0) caught 10/10 structural violations that compiled successfully. Property-based testing (L1) found 2 computation bugs that 90% line coverage and 16 hand-written tests missed. These are orthogonal defect classes: running one does not reduce the need for the other.
Deterministic Backstop
Layer 0 (ArchUnit) is immune to AI regression. It requires no AI model, no LLM call, no network access. The rules are deterministic Java code that runs in the test suite. Even if every AI model degrades, hallucinates, or becomes unavailable, L0 continues to enforce architecture constraints.
This is the floor, the minimum guarantee. In a world where AI-generated code is the majority of new code, having a deterministic backstop that prevents the most common structural violations is non-negotiable.
- 10/10 AI generations violated layer boundaries without ArchUnit rules
- 0/10 AI generations violated layer boundaries with ArchUnit in the test loop
- Violations compile successfully, so only a test gate catches them
Mutation Score Over Coverage
Line coverage measures which lines executed. Mutation testing measures which lines are actually verified. The difference is critical: a test that runs a function but asserts nothing increases coverage but catches zero bugs.
⚠️ The coverage trap
In the PBT case study, traditional tests achieved 90% line coverage and 70% branch coverage while missing both computation bugs (early rounding of discount rate, wrong rounding mode on VAT). By every coverage metric, the test suite looked excellent. Mutation testing would have revealed these as survived mutants: change the rounding mode, tests still pass.
Mutation score is the true metric for test effectiveness:
- Killed mutant: at least one test failed when the code was changed. The test is doing its job.
- Survived mutant: no test failed. Either a missing test or a weak assertion.
- Target: mutation score >80% for business-critical code, >60% for utilities.
Test Oracle Reliability Score (TORS)
TORS measures the reliability of the test suite itself. In environments where 84% of test failures are flaky (Google data), treating every failure as a real regression wastes engineering time and erodes trust in CI.
TORS = (real failures) / (total failures)
If TORS < threshold → test is flagged as unreliable Unreliable tests are quarantined: they still run, but their results are tracked separately and do not block the pipeline. This prevents the common failure mode where developers stop trusting CI and start merging despite failures.
- TORS > 85%: healthy test suite, failures are real signals
- TORS 50-85%: flake remediation needed, prioritize top offenders
- TORS < 50%: test suite is a liability, not an asset
Testing Strategy Selection
The classic test pyramid is not wrong; it is incomplete. The right testing shape depends on the architecture and risk profile of the system. Visdom Testing supports multiple strategies:
| Strategy | Shape | Best For | Emphasis |
|---|---|---|---|
| Pyramid | Many unit, fewer integration, minimal E2E | Monoliths, well-bounded domains | L0 + L1 heavy, L3 light |
| Trophy | Integration-heavy (Kent C. Dodds model) | Modern frontend, full-stack apps | L1 + L2 heavy, static analysis + integration |
| Honeycomb | Contract-focused (Spotify model) | Microservices | L3 heavy, L0 for shared conventions |
| Diamond | Wide integration layer | Domain services, data pipelines | L1 + L2 heavy, wide property coverage |
✅ Start with the pyramid, adapt to your architecture
If you don't know which strategy fits, start with the pyramid (maximize L0 and L1, add L2 for critical paths, add L3 when you have service boundaries). The dashboard will show where defects escape, telling you which layer needs investment.
Visdom SDLC Integration
Visdom Testing connects to the broader Visdom AI-Native SDLC metrics framework:
| Metric | What it measures | Visdom Testing's role |
|---|---|---|
| ITS (Iterations-to-Success) | Iterations from task to passing CI | Reduces ITS by catching defects earlier (L0 in <10s vs L3 in ~10min). Fast feedback = fewer iterations. |
| CPI (Cost-per-Iteration) | Tokens + compute + CI + review per iteration | L0 and L1 are near-zero cost. L2 runs only on changed code (incremental). Total testing cost tracked per layer. |
| TORS (Test Oracle Reliability Score) | % of test failures that are real regressions | Directly measured. Flaky tests quarantined. Prevents the lying oracle problem where agents "fix" tests that aren't broken. |