Architecture | Visdom Testing

Mental Model

Every code change passes through four testing layers. Each layer catches a different class of defect at a different cost and speed. Lower layers are deterministic and fast; upper layers are thorough and expensive. Together they form a complete testing strategy that addresses the specific failure modes of AI-generated code.

The key insight: no single testing technique catches all defect classes. Architecture tests catch structural violations. Property-based tests catch computation bugs. Mutation testing measures whether tests would actually catch a bug. Contract tests verify cross-service correctness. Each layer compensates for the blind spots of the others.

Layer Diagram

┌──────────────────────────────────────────────────────────┐
│                   Code Change (PR / Commit)               │
└──────────────┬───────────────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────────────┐
│  LAYER 0: Architecture Testing               (<10s)      │
│  ArchUnit rules. Deterministic. Zero AI dependency.      │
│  Layer boundaries, banned APIs, naming conventions.      │
└──────────────┬───────────────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────────────┐
│  LAYER 1: Property-Based Testing              (~2s)      │
│  jqwik properties. 1000+ generated inputs/property.      │
│  Computation correctness. Breaks circular test problem.  │
└──────────────┬───────────────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────────────┐
│  LAYER 2: Mutation Testing                    (~5 min)   │
│  PIT + AI-augmented mutants. Mutation score as the       │
│  true measure of test effectiveness. Kill the mutants.   │
└──────────────┬───────────────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────────────┐
│  LAYER 3: Contract Testing                    (~10 min)  │
│  Pact consumer-driven contracts. Cross-service API       │
│  compatibility. No full integration env required.        │
└──────────────┬───────────────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────────────┐
│  DASHBOARD: Quality Intelligence                         │
│  Mutation score trends, TORS, defect escape rate,        │
│  flake rate, per-layer health. CI feedback loop.         │
└──────────────────────────────────────────────────────────┘

Layer Summary

Layer	Type	Time	Defect Class	Purpose
L0	Deterministic	<10s	Structural	Enforce architecture constraints: layer boundaries, banned APIs, naming, dependency rules
L1	Generative	~2s	Computation	Property-based testing with 1000+ inputs. Catch logic errors, rounding bugs, edge cases
L2	Evaluative	~5 min	Test quality	Mutation testing. Measure whether tests would actually catch a bug if one existed
L3	Contractual	~10 min	Integration	Consumer-driven contracts. Verify API compatibility across services without E2E environments

Key Design Decisions

Why Layers

Each layer catches a fundamentally different class of defect. Architecture tests catch structural violations (controller calling repository directly). Property-based tests catch computation bugs (wrong rounding mode on VAT calculation). Mutation testing catches weak tests (tests that pass regardless of code correctness). Contract tests catch integration breaks (provider changed the API shape).

No single layer covers all four. Running all four layers costs less than one production incident.

💡 Defect class coverage

In our controlled experiment, ArchUnit (L0) caught 10/10 structural violations that compiled successfully. Property-based testing (L1) found 2 computation bugs that 90% line coverage and 16 hand-written tests missed. These are orthogonal defect classes: running one does not reduce the need for the other.

Deterministic Backstop

Layer 0 (ArchUnit) is immune to AI regression. It requires no AI model, no LLM call, no network access. The rules are deterministic Java code that runs in the test suite. Even if every AI model degrades, hallucinates, or becomes unavailable, L0 continues to enforce architecture constraints.

This is the floor, the minimum guarantee. In a world where AI-generated code is the majority of new code, having a deterministic backstop that prevents the most common structural violations is non-negotiable.

10/10 AI generations violated layer boundaries without ArchUnit rules
0/10 AI generations violated layer boundaries with ArchUnit in the test loop
Violations compile successfully, so only a test gate catches them

Mutation Score Over Coverage

Line coverage measures which lines executed. Mutation testing measures which lines are actually verified. The difference is critical: a test that runs a function but asserts nothing increases coverage but catches zero bugs.

⚠️ The coverage trap

In the PBT case study, traditional tests achieved 90% line coverage and 70% branch coverage while missing both computation bugs (early rounding of discount rate, wrong rounding mode on VAT). By every coverage metric, the test suite looked excellent. Mutation testing would have revealed these as survived mutants: change the rounding mode, tests still pass.

Mutation score is the true metric for test effectiveness:

Killed mutant: at least one test failed when the code was changed. The test is doing its job.
Survived mutant: no test failed. Either a missing test or a weak assertion.
Target: mutation score >80% for business-critical code, >60% for utilities.

Test Oracle Reliability Score (TORS)

TORS measures the reliability of the test suite itself. In environments where 84% of test failures are flaky (Google data), treating every failure as a real regression wastes engineering time and erodes trust in CI.

TORS = (real failures) / (total failures)
If TORS < threshold → test is flagged as unreliable

Unreliable tests are quarantined: they still run, but their results are tracked separately and do not block the pipeline. This prevents the common failure mode where developers stop trusting CI and start merging despite failures.

TORS > 85%: healthy test suite, failures are real signals
TORS 50-85%: flake remediation needed, prioritize top offenders
TORS < 50%: test suite is a liability, not an asset

Testing Strategy Selection

The classic test pyramid is not wrong; it is incomplete. The right testing shape depends on the architecture and risk profile of the system. Visdom Testing supports multiple strategies:

Strategy	Shape	Best For	Emphasis
Pyramid	Many unit, fewer integration, minimal E2E	Monoliths, well-bounded domains	L0 + L1 heavy, L3 light
Trophy	Integration-heavy (Kent C. Dodds model)	Modern frontend, full-stack apps	L1 + L2 heavy, static analysis + integration
Honeycomb	Contract-focused (Spotify model)	Microservices	L3 heavy, L0 for shared conventions
Diamond	Wide integration layer	Domain services, data pipelines	L1 + L2 heavy, wide property coverage

✅ Start with the pyramid, adapt to your architecture

If you don't know which strategy fits, start with the pyramid (maximize L0 and L1, add L2 for critical paths, add L3 when you have service boundaries). The dashboard will show where defects escape, telling you which layer needs investment.

Visdom SDLC Integration

Visdom Testing connects to the broader Visdom AI-Native SDLC metrics framework:

Metric	What it measures	Visdom Testing's role
ITS (Iterations-to-Success)	Iterations from task to passing CI	Reduces ITS by catching defects earlier (L0 in <10s vs L3 in ~10min). Fast feedback = fewer iterations.
CPI (Cost-per-Iteration)	Tokens + compute + CI + review per iteration	L0 and L1 are near-zero cost. L2 runs only on changed code (incremental). Total testing cost tracked per layer.
TORS (Test Oracle Reliability Score)	% of test failures that are real regressions	Directly measured. Flaky tests quarantined. Prevents the lying oracle problem where agents "fix" tests that aren't broken.

Full metrics framework →