For Platform Engineers | Visdom Testing

CI integration

Each testing layer runs at a different point in the pipeline. The goal is fast feedback: cheap checks first, expensive checks later.

Layer	Where it runs	Time	Trigger
L0 ArchUnit	Pre-push hook + CI	<10s	Every commit
L1 PBT	Unit test phase in CI	~2s per property	Every commit
L2 Mutation	PR check (changed files only)	~5 min	PR open/update
L3 Contracts	PR check + nightly full verification	~10 min	PR + scheduled

✅ Layer ordering matters

Run L0 and L1 first. They are deterministic and fast. If architecture rules or properties fail, there is no point running mutation analysis or contract verification. Fail fast, save CI minutes.

Test Impact Analysis

As your test suite grows, running everything on every PR becomes impractical. Test Impact Analysis (TIA) identifies which tests are affected by a change and runs only those.

Spotify model — With 50K+ tests, Spotify runs only tests affected by the changed code paths. Their honeycomb testing model prioritizes integration tests with contract verification.
Predictive test selection — Tools like Launchable use ML to predict which tests are most likely to fail for a given change, running the high-risk subset first for faster feedback.
Parallelization strategies — Split test suites by module, by layer, or by estimated duration. Run L0+L1 in parallel with L2 on changed files. Run L3 contracts asynchronously.

Flaky test management

Flaky tests are the #1 reason developers stop trusting CI. The goal is not zero flakiness (impossible at scale) but a managed budget.

Detection

Track per-test pass rates over a rolling window (e.g., last 100 runs). Any test with a pass rate below 98% gets flagged. Use the Test Observability and Reliability Score (TORS) metric to measure suite-level health.

Quarantine

Move flaky tests to a quarantine suite that runs but does not block the pipeline. Flaky tests still execute (so you see when they stabilize) but do not erode developer trust.

Ownership assignment

Every quarantined test gets an owner and a fix-by date. Unowned flaky tests accumulate indefinitely. Owned flaky tests get fixed or deleted.

💡 Flakiness budget

Target: <2% per-run flakiness. Google manages to this budget across millions of tests. It is achievable with discipline.

Quality gates

Quality gates are merge requirements that go beyond "tests pass." They measure whether your tests actually verify behavior.

Mutation score threshold — Require a minimum mutation score (e.g., 60%) for changed files. Survives-only reports highlight where tests are weakest.
Contract verification — All consumer-driven contracts must pass before merge. No breaking API changes without consumer coordination.
Architecture compliance — Zero ArchUnit violations on any PR. This is a hard gate, not a warning.
Flakiness budget — PR cannot introduce new flaky tests. If a newly added test flakes in CI, the PR is blocked.

Metrics dashboard

A quality dashboard surfaces the metrics that matter. Here is what to track and display:

Metric	What it tells you	Target
Mutation score trend	Whether tests are getting better at catching real bugs	Rising toward 70%+
TORS	Overall test suite reliability	>98%
Defect escape rate	Bugs reaching production that tests should have caught	Declining quarter over quarter
Test execution time	CI feedback speed	<10 min for PR suite
Flake rate	Percentage of test runs with at least one flaky failure	<2%

Tool selection

Layer	Java / Kotlin	JavaScript / TypeScript	Python	.NET
L0 Architecture	ArchUnit	dependency-cruiser	import-linter	NetArchTest
L1 PBT	jqwik	fast-check	Hypothesis	FsCheck
L2 Mutation	PIT (pitest)	Stryker	mutmut	Stryker.NET
L3 Contracts	Pact JVM	Pact JS	Pact Python	Pact .NET