CI integration
Each testing layer runs at a different point in the pipeline. The goal is fast feedback: cheap checks first, expensive checks later.
| Layer | Where it runs | Time | Trigger |
|---|---|---|---|
| L0 ArchUnit | Pre-push hook + CI | <10s | Every commit |
| L1 PBT | Unit test phase in CI | ~2s per property | Every commit |
| L2 Mutation | PR check (changed files only) | ~5 min | PR open/update |
| L3 Contracts | PR check + nightly full verification | ~10 min | PR + scheduled |
โ Layer ordering matters
Run L0 and L1 first. They are deterministic and fast. If architecture rules or properties fail, there is no point running mutation analysis or contract verification. Fail fast, save CI minutes.
Test Impact Analysis
As your test suite grows, running everything on every PR becomes impractical. Test Impact Analysis (TIA) identifies which tests are affected by a change and runs only those.
- Spotify model — With 50K+ tests, Spotify runs only tests affected by the changed code paths. Their honeycomb testing model prioritizes integration tests with contract verification.
- Predictive test selection — Tools like Launchable use ML to predict which tests are most likely to fail for a given change, running the high-risk subset first for faster feedback.
- Parallelization strategies — Split test suites by module, by layer, or by estimated duration. Run L0+L1 in parallel with L2 on changed files. Run L3 contracts asynchronously.
Flaky test management
Flaky tests are the #1 reason developers stop trusting CI. The goal is not zero flakiness (impossible at scale) but a managed budget.
Detection
Track per-test pass rates over a rolling window (e.g., last 100 runs). Any test with a pass rate below 98% gets flagged. Use the Test Observability and Reliability Score (TORS) metric to measure suite-level health.
Quarantine
Move flaky tests to a quarantine suite that runs but does not block the pipeline. Flaky tests still execute (so you see when they stabilize) but do not erode developer trust.
Ownership assignment
Every quarantined test gets an owner and a fix-by date. Unowned flaky tests accumulate indefinitely. Owned flaky tests get fixed or deleted.
๐ก Flakiness budget
Target: <2% per-run flakiness. Google manages to this budget across millions of tests. It is achievable with discipline.
Quality gates
Quality gates are merge requirements that go beyond "tests pass." They measure whether your tests actually verify behavior.
- Mutation score threshold — Require a minimum mutation score (e.g., 60%) for changed files. Survives-only reports highlight where tests are weakest.
- Contract verification — All consumer-driven contracts must pass before merge. No breaking API changes without consumer coordination.
- Architecture compliance — Zero ArchUnit violations on any PR. This is a hard gate, not a warning.
- Flakiness budget — PR cannot introduce new flaky tests. If a newly added test flakes in CI, the PR is blocked.
Metrics dashboard
A quality dashboard surfaces the metrics that matter. Here is what to track and display:
| Metric | What it tells you | Target |
|---|---|---|
| Mutation score trend | Whether tests are getting better at catching real bugs | Rising toward 70%+ |
| TORS | Overall test suite reliability | >98% |
| Defect escape rate | Bugs reaching production that tests should have caught | Declining quarter over quarter |
| Test execution time | CI feedback speed | <10 min for PR suite |
| Flake rate | Percentage of test runs with at least one flaky failure | <2% |
Tool selection
| Layer | Java / Kotlin | JavaScript / TypeScript | Python | .NET |
|---|---|---|---|---|
| L0 Architecture | ArchUnit | dependency-cruiser | import-linter | NetArchTest |
| L1 PBT | jqwik | fast-check | Hypothesis | FsCheck |
| L2 Mutation | PIT (pitest) | Stryker | mutmut | Stryker.NET |
| L3 Contracts | Pact JVM | Pact JS | Pact Python | Pact .NET |