Back to Home
GuideDevelopers

For Developers

Practical guide to writing better tests for AI-generated code.

Why your tests aren't catching bugs

Coverage measures which lines executed, not whether your tests would catch a bug on those lines. This creates the 90%/0 paradox: 90% line coverage, 0 computation bugs caught.

In our case study, 16 hand-written tests achieved 90% line coverage on a pricing module. They tested specific examples: "price of item A is $10.00." But they never tested the edge cases where rounding and discount interactions produced wrong results. The tests executed the code without actually verifying its correctness.

When AI generates both code and tests, this problem gets worse. The AI encodes the same assumptions in both. If the implementation rounds too early, the test expects the too-early-rounded result. Everything passes. Everything is wrong.

ArchUnit in 5 minutes

ArchUnit lets you write architecture rules as executable tests. They run in your normal test suite, take less than 10 seconds, and catch structural violations that code review misses.

Step 1: Add the dependency

<!-- Maven -->
<dependency>
  <groupId>com.tngtech.archunit</groupId>
  <artifactId>archunit-junit5</artifactId>
  <version>1.3.0</version>
  <scope>test</scope>
</dependency>

Step 2: Write your first rule

@ArchTest
static final ArchRule controllers_should_not_access_repositories =
    noClasses()
        .that().resideInAPackage("..controller..")
        .should().accessClassesThat()
        .resideInAPackage("..repository..");

Step 3: See it catch AI violations

Ask Copilot to generate a Spring controller. Without this rule, 10/10 times it will call the repository directly, bypassing the service layer. With this rule in your test suite, the AI gets a build failure and fixes the violation on retry.

Start with these 3 rules

1. Controllers must not access repositories directly.
2. No field injection (use constructor injection).
3. No generic exceptions (throw specific types).

These three rules catch the most common AI-generated architecture violations.

Writing your first property

Property-based testing flips the script: instead of writing specific examples, you declare properties that must hold for all valid inputs. The framework generates thousands of inputs and checks each one.

Start with invariants

@Property
void totalShouldNeverBeNegative(
    @ForAll @Positive BigDecimal unitPrice,
    @ForAll @IntRange(min = 1, max = 10000) int quantity,
    @ForAll @BigRange(min = "0", max = "50") BigDecimal discountPercent
) {
    BigDecimal total = pricingService.calculateTotal(
        unitPrice, quantity, discountPercent
    );
    assertThat(total).isGreaterThanOrEqualTo(BigDecimal.ZERO);
}

This single property tests with 1000+ random combinations. It catches edge cases you would never think to write by hand: tiny unit prices with large quantities, discounts that round to exactly zero, boundary interactions.

Then add oracle properties

An oracle is a simpler reference implementation that you compare against. This is where the real bugs are caught.

The oracle pattern

Compare your optimized production function against a naive but obviously correct reference. Any disagreement is a bug in one of them.

@Property
void pricingShouldMatchReferenceImplementation(
    @ForAll @Positive BigDecimal unitPrice,
    @ForAll @IntRange(min = 1, max = 10000) int quantity,
    @ForAll @BigRange(min = "0", max = "50") BigDecimal discountPercent
) {
    // Production implementation (optimized)
    BigDecimal actual = pricingService.calculateTotal(
        unitPrice, quantity, discountPercent
    );

    // Reference oracle (naive but correct)
    BigDecimal subtotal = unitPrice.multiply(BigDecimal.valueOf(quantity));
    BigDecimal discountAmount = subtotal.multiply(
        discountPercent.divide(BigDecimal.valueOf(100), 10, RoundingMode.HALF_UP)
    );
    BigDecimal expected = subtotal.subtract(discountAmount)
        .setScale(2, RoundingMode.HALF_UP);

    assertThat(actual).isEqualByComparingTo(expected);
}

💡 Case study: pricing module

This exact pattern found 2 bugs that 90% coverage missed: early rounding of the discount rate and wrong rounding mode on VAT calculation. The minimal counterexample was: unitPrice=0.01, qty=48, discount=1.05%.

When NOT to use PBT

Reading mutation reports

Mutation testing modifies your code (creates "mutants") and checks if your tests catch the change. Here is what the results mean:

⚠️ Focus on critical code

Do not chase 100% mutation score across the entire codebase. Focus on survived mutants in critical code paths: pricing, authentication, authorization, financial calculations. A survived mutant in a toString() method is not worth your time.

Anti-patterns to avoid

TDD in the AI era

Test-driven development becomes more important, not less, when AI generates code. The workflow:

  1. Write a failing test first — Define what the code should do before any code exists. This is your specification.
  2. Let AI implement — Give the AI the failing test and let it write the implementation. The test constrains what the AI can produce.
  3. Verify with PBT — Add property-based tests to catch edge cases the AI did not consider. The properties are your safety net.
  4. Check with mutation testing — Confirm that your tests actually verify behavior, not just execute code.

The tests are your specification. When AI generates code, the tests tell it what "correct" means. Without tests-first, you are asking the AI to guess your requirements.