Why your tests aren't catching bugs
Coverage measures which lines executed, not whether your tests would catch a bug on those lines. This creates the 90%/0 paradox: 90% line coverage, 0 computation bugs caught.
In our case study, 16 hand-written tests achieved 90% line coverage on a pricing module. They tested specific examples: "price of item A is $10.00." But they never tested the edge cases where rounding and discount interactions produced wrong results. The tests executed the code without actually verifying its correctness.
When AI generates both code and tests, this problem gets worse. The AI encodes the same assumptions in both. If the implementation rounds too early, the test expects the too-early-rounded result. Everything passes. Everything is wrong.
ArchUnit in 5 minutes
ArchUnit lets you write architecture rules as executable tests. They run in your normal test suite, take less than 10 seconds, and catch structural violations that code review misses.
Step 1: Add the dependency
<!-- Maven -->
<dependency>
<groupId>com.tngtech.archunit</groupId>
<artifactId>archunit-junit5</artifactId>
<version>1.3.0</version>
<scope>test</scope>
</dependency> Step 2: Write your first rule
@ArchTest
static final ArchRule controllers_should_not_access_repositories =
noClasses()
.that().resideInAPackage("..controller..")
.should().accessClassesThat()
.resideInAPackage("..repository..");
Step 3: See it catch AI violations
Ask Copilot to generate a Spring controller. Without this rule, 10/10 times it will call the repository directly, bypassing the service layer. With this rule in your test suite, the AI gets a build failure and fixes the violation on retry.
✅ Start with these 3 rules
1. Controllers must not access repositories directly.
2. No field injection (use constructor injection).
3. No generic exceptions (throw specific types).
These three rules catch the most common AI-generated architecture violations.
Writing your first property
Property-based testing flips the script: instead of writing specific examples, you declare properties that must hold for all valid inputs. The framework generates thousands of inputs and checks each one.
Start with invariants
@Property
void totalShouldNeverBeNegative(
@ForAll @Positive BigDecimal unitPrice,
@ForAll @IntRange(min = 1, max = 10000) int quantity,
@ForAll @BigRange(min = "0", max = "50") BigDecimal discountPercent
) {
BigDecimal total = pricingService.calculateTotal(
unitPrice, quantity, discountPercent
);
assertThat(total).isGreaterThanOrEqualTo(BigDecimal.ZERO);
} This single property tests with 1000+ random combinations. It catches edge cases you would never think to write by hand: tiny unit prices with large quantities, discounts that round to exactly zero, boundary interactions.
Then add oracle properties
An oracle is a simpler reference implementation that you compare against. This is where the real bugs are caught.
The oracle pattern
Compare your optimized production function against a naive but obviously correct reference. Any disagreement is a bug in one of them.
@Property
void pricingShouldMatchReferenceImplementation(
@ForAll @Positive BigDecimal unitPrice,
@ForAll @IntRange(min = 1, max = 10000) int quantity,
@ForAll @BigRange(min = "0", max = "50") BigDecimal discountPercent
) {
// Production implementation (optimized)
BigDecimal actual = pricingService.calculateTotal(
unitPrice, quantity, discountPercent
);
// Reference oracle (naive but correct)
BigDecimal subtotal = unitPrice.multiply(BigDecimal.valueOf(quantity));
BigDecimal discountAmount = subtotal.multiply(
discountPercent.divide(BigDecimal.valueOf(100), 10, RoundingMode.HALF_UP)
);
BigDecimal expected = subtotal.subtract(discountAmount)
.setScale(2, RoundingMode.HALF_UP);
assertThat(actual).isEqualByComparingTo(expected);
} 💡 Case study: pricing module
This exact pattern found 2 bugs that 90% coverage missed: early rounding of the discount rate
and wrong rounding mode on VAT calculation. The minimal counterexample was:
unitPrice=0.01, qty=48, discount=1.05%.
When NOT to use PBT
- Integration tests — PBT generates thousands of inputs. If each input hits a database or external service, your tests will be too slow. Use PBT for pure computation logic.
- Simple CRUD — If your method just saves an entity and returns it, there is no interesting property to declare. Example-based tests are fine here.
- UI rendering — There are rarely meaningful properties to declare about visual output. Use snapshot tests or visual regression tools instead.
Reading mutation reports
Mutation testing modifies your code (creates "mutants") and checks if your tests catch the change. Here is what the results mean:
- Killed — Your tests detected the mutant. Good. This means the test would catch a real bug in that code.
- Survived — Your tests did NOT detect the mutant. Bad. This means a real bug in that code would go unnoticed. Focus your testing effort here.
- Equivalent — The mutant is functionally identical to the original (e.g.,
replacing
x > 0withx >= 1for integers). Not a test gap.
⚠️ Focus on critical code
Do not chase 100% mutation score across the entire codebase. Focus on survived mutants in critical code paths: pricing, authentication, authorization, financial calculations. A survived mutant in a toString() method is not worth your time.
Anti-patterns to avoid
- Testing implementation, not behavior — If your test breaks when you refactor internals without changing behavior, it is testing the wrong thing. Test what the code does, not how it does it.
- Mocking everything — When every dependency is mocked, your test only verifies that you called the mocks in the expected order. It does not verify that the system actually works. Use real dependencies where possible; mock only external services.
- Ignoring flaky tests — A flaky test is not "probably fine." It is either testing something nondeterministic (fix it) or has a real race condition (that is a real bug).
- 100% coverage as a goal — Coverage is a minimum bar, not a target. Chasing 100% leads to trivial tests on getters/setters that catch nothing. Mutation score is a better measure of test effectiveness.
TDD in the AI era
Test-driven development becomes more important, not less, when AI generates code. The workflow:
- Write a failing test first — Define what the code should do before any code exists. This is your specification.
- Let AI implement — Give the AI the failing test and let it write the implementation. The test constrains what the AI can produce.
- Verify with PBT — Add property-based tests to catch edge cases the AI did not consider. The properties are your safety net.
- Check with mutation testing — Confirm that your tests actually verify behavior, not just execute code.
The tests are your specification. When AI generates code, the tests tell it what "correct" means. Without tests-first, you are asking the AI to guess your requirements.