Layer 1: Property-Based Testing

The mismatch

Traditional tests are specific input-output pairs written by humans for human-written code. A developer thinks of edge cases based on their understanding of the implementation. They pick 5, maybe 10 examples, and assert expected outputs.

AI-generated code is stochastic. The same prompt produces different implementations across runs. The code may handle the 10 hand-picked examples perfectly while silently mishandling the millions of inputs nobody thought to test. Testing stochastic code with fixed, hand-picked examples is a fundamental mismatch.

Property-based testing closes this gap. Instead of specifying individual examples, you declare properties — rules that must hold for all valid inputs — and the framework generates hundreds or thousands of test cases automatically.

What is property-based testing?

In property-based testing, you declare invariants about your code's behavior, and the framework generates random valid inputs to verify those invariants. If any generated input violates a property, the framework reports the counterexample.

Consider a pricing module that calculates order totals with discounts and VAT:

@Property
void totalShouldNeverBeNegative(
    @ForAll @DoubleRange(min = 0.01, max = 10_000) double unitPrice,
    @ForAll @IntRange(min = 1, max = 1000) int quantity,
    @ForAll @DoubleRange(min = 0, max = 0.5) double discountRate,
    @ForAll @DoubleRange(min = 0, max = 0.3) double vatRate
) {
    Money total = pricingService.calculateTotal(unitPrice, quantity, discountRate, vatRate);
    assertThat(total.amount()).isGreaterThanOrEqualTo(BigDecimal.ZERO);
}

This single property tests the pricing function with 1,000 randomly generated combinations of prices, quantities, discounts, and VAT rates. It catches edge cases no human would think to write: extreme discount-price-quantity interactions, floating-point rounding surprises, boundary overflows.

Three core patterns

Invariant properties

A condition that must hold for all valid inputs, regardless of the specific values. The simplest and most common pattern.

@Property
void totalShouldNeverBeNegative(...) {
    Money total = pricingService.calculateTotal(unitPrice, quantity, discountRate, vatRate);
    assertThat(total.amount()).isGreaterThanOrEqualTo(BigDecimal.ZERO);
}

@Property
void discountShouldNeverExceedSubtotal(...) {
    PriceBreakdown breakdown = pricingService.breakdown(unitPrice, quantity, discountRate);
    assertThat(breakdown.discount()).isLessThanOrEqualTo(breakdown.subtotal());
}

Oracle properties

Compare the system under test to a known-correct reference implementation. The most powerful pattern for catching computation bugs.

@Property
void totalShouldMatchReferenceCalculation(...) {
    Money actual = pricingService.calculateTotal(unitPrice, quantity, discountRate, vatRate);

    // Reference: step-by-step calculation with explicit rounding
    BigDecimal subtotal = BigDecimal.valueOf(unitPrice)
        .multiply(BigDecimal.valueOf(quantity));
    BigDecimal discount = subtotal.multiply(BigDecimal.valueOf(discountRate))
        .setScale(2, RoundingMode.HALF_UP);
    BigDecimal afterDiscount = subtotal.subtract(discount);
    BigDecimal vat = afterDiscount.multiply(BigDecimal.valueOf(vatRate))
        .setScale(2, RoundingMode.HALF_UP);
    BigDecimal expected = afterDiscount.add(vat);

    assertThat(actual.amount()).isEqualByComparingTo(expected);
}

Round-trip properties

Apply an operation and its inverse, verifying that you get back the original value. Ideal for serialization, encoding, and data transformation.

@Property
void serializeDeserializeRoundTrip(
    @ForAll("validOrders") Order original
) {
    String json = serializer.toJson(original);
    Order restored = serializer.fromJson(json, Order.class);
    assertThat(restored).isEqualTo(original);
}

Shrinking

When a property fails, the generated input is often large and complex: a list of 47 items, a string of 200 characters, a deeply nested object. Shrinking is the process by which the framework automatically reduces the failing input to the smallest counterexample that still triggers the failure.

Instead of reporting "fails for quantity=847, unitPrice=3291.44, discountRate=0.4731," the framework shrinks to "fails for quantity=1, unitPrice=0.01, discountRate=0.005" — the minimal reproduction case that pinpoints exactly where the logic breaks.

Case Study 1: Pricing Module

We gave an AI agent a pricing module specification and asked it to implement the module and write tests. Then we applied property-based testing to the same module.

Metric	AI-generated example tests	Property-based tests
Number of tests	10 example-based	8 properties
Line coverage	90%	80%
Bug 1: early rounding of discount rate	Not detected	Detected
Bug 2: wrong rounding mode on VAT	Not detected	Detected
Bugs found	0/2	2/2

⚠️ Coverage is not correctness

The AI-generated test suite achieved 90% line coverage while missing every computation error that matters. High coverage creates false confidence. Property-based tests at 80% coverage found both bugs because they tested the correctness of the computation, not just which lines executed.

Bug 1 was an early rounding error: the discount rate was rounded to 2 decimal places before being applied to the subtotal, losing precision on large orders. The AI-written examples used clean discount rates (10%, 20%, 50%) that masked the issue.

Bug 2 was a wrong rounding mode: VAT calculation used RoundingMode.FLOOR instead of RoundingMode.HALF_UP, producing amounts that were consistently 1 cent too low on certain inputs. The hand-picked examples happened to produce values where both modes agreed.

Case Study 2: Promotion Engine

A more complex module: a promotion engine with stacking rules, minimum order thresholds, and category-specific discounts. The pattern repeated at a wider gap.

Metric	AI-generated example tests	Property-based tests
Number of tests	16 example-based	8 properties
Line coverage	100%	85%
Interaction bugs found	0/3	3/3

All three bugs were interaction effects: promotions combining in ways that violated business rules. The example tests tested each promotion in isolation. The property-based tests generated random combinations and found the interactions that broke invariants.

Not all properties are equal

In the pricing case study, we wrote 8 properties total. Of those, 6 were structural properties (non-negative total, discount not exceeding subtotal, VAT non-negative) and 2 were oracle properties (total matches reference calculation).

All 6 structural properties passed on the buggy code. Only the 2 oracle properties caught both bugs.

Structural properties verify shape ("is this value in the right range?"). Oracle properties verify correctness ("is this the right value?"). Both matter, but for critical business logic, you need oracle properties.

✅ Key takeaway

Write at least one oracle property per critical formula. An oracle can be a simplified reference implementation, a lookup table for known values, or a mathematical identity that the result must satisfy.

Where PBT adds most value

High-value targets

Pure functions — pricing calculations, tax engines, financial computations
Rule engines — promotion stacking, eligibility checks, validation rules
Serialization / parsing — JSON, XML, protocol buffers, CSV round-trips
Data transformations — mapping, aggregation, filtering pipelines
Cryptographic / hashing — encoding round-trips, hash distribution

Low-value targets

Integration tests — external service calls, database interactions (too slow)
Simple CRUD — trivial mapping with no computation
UI rendering — hard to express meaningful properties
Configuration loading — typically tested once with a known file

CI integration

Property-based tests are parameterized by the number of tries. More tries means more input coverage but longer execution. Use a tiered approach:

Stage	Tries	Purpose
Pre-commit / local	100	Fast feedback during development
CI (pull request)	1,000	Thorough validation before merge
Nightly	10,000	Deep exploration for rare edge cases

💡 PBT failures are NOT flaky tests

When a property-based test fails, it found a real bug. The input is random, but the failure is deterministic — you can replay the exact seed. If a PBT test fails in CI, do not re-run it hoping it passes. Investigate the counterexample.

Framework landscape

Framework	Language	Notes
QuickCheck	Haskell	The original. Invented property-based testing in 1999.
Hypothesis	Python	Best-in-class shrinking, extensive type strategy library.
jqwik	Java	JUnit 5 integration, excellent arbitrary generators, recommended for JVM.
fast-check	JavaScript / TypeScript	Feature-rich, supports async properties, good shrinking.
Hedgehog	Haskell / Scala / F#	Integrated shrinking — generators and shrinkers defined together.
PropEr	Erlang	QuickCheck-inspired, stateful testing support.
Rapid	Go	Lightweight, integrates with Go's testing package.