Back to Overview
ArchitectureLayered Model

Architecture

Layered Testing Model: each layer catches a different class of defect.

Mental Model

Every code change passes through four testing layers. Each layer catches a different class of defect at a different cost and speed. Lower layers are deterministic and fast; upper layers are thorough and expensive. Together they form a complete testing strategy that addresses the specific failure modes of AI-generated code.

The key insight: no single testing technique catches all defect classes. Architecture tests catch structural violations. Property-based tests catch computation bugs. Mutation testing measures whether tests would actually catch a bug. Contract tests verify cross-service correctness. Each layer compensates for the blind spots of the others.

Layer Diagram

┌──────────────────────────────────────────────────────────┐
│                   Code Change (PR / Commit)               │
└──────────────┬───────────────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────────────┐
│  LAYER 0: Architecture Testing               (<10s)      │
│  ArchUnit rules. Deterministic. Zero AI dependency.      │
│  Layer boundaries, banned APIs, naming conventions.      │
└──────────────┬───────────────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────────────┐
│  LAYER 1: Property-Based Testing              (~2s)      │
│  jqwik properties. 1000+ generated inputs/property.      │
│  Computation correctness. Breaks circular test problem.  │
└──────────────┬───────────────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────────────┐
│  LAYER 2: Mutation Testing                    (~5 min)   │
│  PIT + AI-augmented mutants. Mutation score as the       │
│  true measure of test effectiveness. Kill the mutants.   │
└──────────────┬───────────────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────────────┐
│  LAYER 3: Contract Testing                    (~10 min)  │
│  Pact consumer-driven contracts. Cross-service API       │
│  compatibility. No full integration env required.        │
└──────────────┬───────────────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────────────┐
│  DASHBOARD: Quality Intelligence                         │
│  Mutation score trends, TORS, defect escape rate,        │
│  flake rate, per-layer health. CI feedback loop.         │
└──────────────────────────────────────────────────────────┘

Layer Summary

Layer Type Time Defect Class Purpose
L0 Deterministic <10s Structural Enforce architecture constraints: layer boundaries, banned APIs, naming, dependency rules
L1 Generative ~2s Computation Property-based testing with 1000+ inputs. Catch logic errors, rounding bugs, edge cases
L2 Evaluative ~5 min Test quality Mutation testing. Measure whether tests would actually catch a bug if one existed
L3 Contractual ~10 min Integration Consumer-driven contracts. Verify API compatibility across services without E2E environments

Key Design Decisions

Why Layers

Each layer catches a fundamentally different class of defect. Architecture tests catch structural violations (controller calling repository directly). Property-based tests catch computation bugs (wrong rounding mode on VAT calculation). Mutation testing catches weak tests (tests that pass regardless of code correctness). Contract tests catch integration breaks (provider changed the API shape).

No single layer covers all four. Running all four layers costs less than one production incident.

💡 Defect class coverage

In our controlled experiment, ArchUnit (L0) caught 10/10 structural violations that compiled successfully. Property-based testing (L1) found 2 computation bugs that 90% line coverage and 16 hand-written tests missed. These are orthogonal defect classes: running one does not reduce the need for the other.

Deterministic Backstop

Layer 0 (ArchUnit) is immune to AI regression. It requires no AI model, no LLM call, no network access. The rules are deterministic Java code that runs in the test suite. Even if every AI model degrades, hallucinates, or becomes unavailable, L0 continues to enforce architecture constraints.

This is the floor, the minimum guarantee. In a world where AI-generated code is the majority of new code, having a deterministic backstop that prevents the most common structural violations is non-negotiable.

Mutation Score Over Coverage

Line coverage measures which lines executed. Mutation testing measures which lines are actually verified. The difference is critical: a test that runs a function but asserts nothing increases coverage but catches zero bugs.

⚠️ The coverage trap

In the PBT case study, traditional tests achieved 90% line coverage and 70% branch coverage while missing both computation bugs (early rounding of discount rate, wrong rounding mode on VAT). By every coverage metric, the test suite looked excellent. Mutation testing would have revealed these as survived mutants: change the rounding mode, tests still pass.

Mutation score is the true metric for test effectiveness:

Test Oracle Reliability Score (TORS)

TORS measures the reliability of the test suite itself. In environments where 84% of test failures are flaky (Google data), treating every failure as a real regression wastes engineering time and erodes trust in CI.

TORS = (real failures) / (total failures)
If TORS < threshold → test is flagged as unreliable

Unreliable tests are quarantined: they still run, but their results are tracked separately and do not block the pipeline. This prevents the common failure mode where developers stop trusting CI and start merging despite failures.

Testing Strategy Selection

The classic test pyramid is not wrong; it is incomplete. The right testing shape depends on the architecture and risk profile of the system. Visdom Testing supports multiple strategies:

Strategy Shape Best For Emphasis
Pyramid Many unit, fewer integration, minimal E2E Monoliths, well-bounded domains L0 + L1 heavy, L3 light
Trophy Integration-heavy (Kent C. Dodds model) Modern frontend, full-stack apps L1 + L2 heavy, static analysis + integration
Honeycomb Contract-focused (Spotify model) Microservices L3 heavy, L0 for shared conventions
Diamond Wide integration layer Domain services, data pipelines L1 + L2 heavy, wide property coverage

Start with the pyramid, adapt to your architecture

If you don't know which strategy fits, start with the pyramid (maximize L0 and L1, add L2 for critical paths, add L3 when you have service boundaries). The dashboard will show where defects escape, telling you which layer needs investment.

Visdom SDLC Integration

Visdom Testing connects to the broader Visdom AI-Native SDLC metrics framework:

Metric What it measures Visdom Testing's role
ITS (Iterations-to-Success) Iterations from task to passing CI Reduces ITS by catching defects earlier (L0 in <10s vs L3 in ~10min). Fast feedback = fewer iterations.
CPI (Cost-per-Iteration) Tokens + compute + CI + review per iteration L0 and L1 are near-zero cost. L2 runs only on changed code (incremental). Total testing cost tracked per layer.
TORS (Test Oracle Reliability Score) % of test failures that are real regressions Directly measured. Flaky tests quarantined. Prevents the lying oracle problem where agents "fix" tests that aren't broken.

Full metrics framework →