Back to Guide
GuideAdoption

Adopting AI Code Review

A staged path from pilot to org-wide policy — with the noise-control knobs that keep trust intact.

The most common adoption failure is a big-bang org-wide rollout with a hand-written prompt. This page describes the opposite: six steps, each a small, reversible config change, each with a measurable checkpoint. It is a distillation of the full practices document, which carries the evidence behind every claim here.

The staged path

1. Pilot one repo on tool defaults — zero config

VCR ships with a versioned defaults file: five review lenses (security, correctness, test-quality, and performance on by default; maintainability opt-in), 16 out-of-the-box pattern rules, a findings cap, and confidence thresholds. A repo with no .visdom.yaml gets exactly these defaults. The point is to observe baseline behavior on your code before you tune anything — tuning against a baseline you never measured is guesswork.

2. Make opt-in explicit

The first line of repo config is consent, not customization. The inverse also works: enabled: false skips the entire review before any layer instantiates — zero AI calls, zero findings, immediate exit. Teams that know they can turn the tool off in one line resist it less.

3. Treat review policy as code

.visdom.yaml lives in the repo, is versioned, and goes through code review like any other change. The merge semantics fit in one sentence: tool defaults merge with the repo file — the repo wins on scalars, rule lists concatenate, and inherited rules can be disabled by id. A platform team owns the defaults; each repo owns its deltas; git blame on the config answers "who decided this lens runs here, and when."

# .visdom.yaml
enabled: true           # consent first — set false to skip entirely, zero AI calls
lenses:                 # policy as code
  maintainability:
    enabled: true
    min_severity: medium
rules:
  - .visdom/rules/org.rules.yaml
standards:
  sources: ["docs/standards/*.md"]

4. Add org rules — pattern first, then llm

Both rule types share one YAML envelope, discriminated by type. Start with a pattern rule: it is mechanical, free, and has zero variance — the same input produces the same finding every run. Only after the team trusts that loop, add an llm rule for a policy that needs judgment — and give it explicit success criteria, failure criteria, and paired good/bad examples. An llm rule without failure criteria produces run-to-run variance. Scope org rules to org-specific policy and leave generic code quality to the built-in lenses; a rule that restates a lens produces doubled findings.

5. Inject standards you already have

standards.sources takes globs over existing repo documents — no forced duplication into a vendor format. Matched documents are injected into deep-review prompts, and findings cite them by path: in the controlled scenario, a broad-exception-catch finding quoted the org's own docs/standards/error-handling.md as the violated rule.

6. Watch the metrics before widening

The numbers to watch: findings per PR against a baseline, the noise rate developers actually report, and cost per PR. On a bulk run over 38 real PRs, the average was $0.078 per PR at roughly 22 seconds each. If the pilot's finding volume or noise rate looks worse than expected, fix the config on one repo — do not export the problem to fifty.

💡 The whole path in one pull request

The full adoption path was executed end-to-end on a real repository in a single PR — llama3-java-hat #50 — where the review produced 4 findings in 18.6 seconds for $0.0476, and the org regex rule, the org LLM rule, and a standards citation all fired in one run.

Controlling noise

The doctrine, stated once and applied everywhere: better to miss a LOW finding than erode trust with a false positive. Developer trust behaves like a threshold function — past a certain noise rate, developers stop reading the comments and eventually disable the tool. Noise compounds; trust does not recover. The constraint is expressed through four knobs, all in config rather than code:

Knob Default What it does
limits.max_findings 5 Ceiling on findings per lens; forces ranking over enumeration. The default (5) sits above built-in per-lens caps (2–3), so lowering it is the relevant use.
lenses.<name>.min_severity varies per lens Per-lens severity floor; performance ships at medium, maintainability at high.
limits.confidence_buckets high ≥ 0.8, medium ≥ 0.5 Findings below 0.5 are labeled low-confidence — "requires verification" — rather than silently dropped or silently trusted.
lenses.maintainability.enabled false The noisiest finding family is opt-in by design.

The low-confidence bucket exists because deleting uncertain findings hides information while presenting them as facts destroys trust; labeling them "requires verification" is the honest middle. And tuning is not free in either direction — raising a confidence threshold cuts noise but costs recall, a trade you have to measure to see. Precision-over-recall means choosing your point on the curve deliberately, not pretending the curve isn't there.

What not to do

Four anti-patterns, each measured the expensive way during VCR's own iteration history:

Go deeper