Back to Guide
GuidePlatform

For Platform Engineers

How to evaluate, pilot, and operate Visdom Code Review on your infrastructure.

Architecture in 60 seconds

Every pull request passes through layers of increasing depth and cost. A risk classifier at Layer 2 gates whether the expensive Layer 3 runs. The Proactive Scanner operates independently on a cron schedule.

Layer What it does Time AI?
Layer 0: Context Collection Collects diff, metadata, coverage, file classifications, repo knowledge, test reliability data <10s No
Layer 1: Deterministic Gate Linters, SAST, secret scan, coverage delta, TORS filtering. Cannot be prompt-injected. <60s No
Layer 2: AI Quick Scan Fast AI pass over diff. Risk classification (LOW→CRITICAL). Max 5 quick findings. AI-code detection. <2 min Yes (Haiku-class)
Layer 3: AI Deep Review Full analysis with repo context, history, conventions. Multiple review lenses in parallel. MEDIUM+ risk only. <10 min Yes (Sonnet/Opus-class)
Reporter Aggregates all layers into structured PR comment, inline comments, GitHub Check, optional Slack. <30s No
Proactive Scanner Cron-based repo analysis: coverage trends, tech debt, convention drift, security baseline. Scheduled Yes

๐Ÿ“ฆ Full architecture reference

For the complete layer diagram, data flows, and output schemas, see the Architecture Reference.

What you need before starting

The following prerequisites are required for a pilot deployment. The first reference implementation targets GitHub; other platforms follow the same process with different adapters.

Prerequisite Details Required?
GitHub repository with PRs The v1 reference implementation is GitHub-only. Active PR flow needed for meaningful pilot data. Yes
CI pipeline with test coverage reports VCR reads coverage deltas to assess risk. Any format supported by your coverage tool. Yes
AI API key Anthropic (default). OpenAI and Azure OpenAI are configurable alternatives. Yes
30 days of test history Required for TORS (Test Oracle Reliability Score) bootstrap. Without it, start with TORS disabled and build up data over the first month. Recommended
.visdom.yaml in repo root Repo-level config file. Deep-merged over tool defaults; lists concatenate. Only the knobs you set differ from defaults. Yes (created during setup)

โœ… No test history?

You can start without TORS and build up reliability data during the pilot. Set layer1.tors.enabled: false in your config, then enable it after 30 days of CI data have been collected.

Running a pilot

A pilot typically runs on 1-2 teams over 4-6 weeks. The steps below assume GitHub Actions as the CI platform.

Step 1: Install with zero config

Add the GitHub Actions workflow to your repository. On day one, run with tool defaults โ€” no .visdom.yaml required. The tool ships opinionated defaults; you only override what you need to change.

Reference cost at defaults: $0.078/PR, ~22 s/PR on a representative sample of 38 PRs. Layer 3 ran on 27 of those 38 PRs โ€” the triage gate controlled spend on the rest.

Step 2: Opt the repo in explicitly

Create .visdom.yaml at the repo root and set enabled: true. This is the explicit opt-in; without it the tool operates in observation mode only.

# .visdom.yaml โ€” minimal opt-in
enabled: true

From here, every key you add deep-merges over tool defaults. Lists (e.g. ignore.paths) concatenate; scalar values override.

Step 3: Understand automatic path classification

File risk classification is an engine heuristic โ€” there is no user-configurable classification map in .visdom.yaml. The engine classifies files automatically: paths matching /auth/, /middleware/, security, or crypto patterns become critical; test files (.test.*, .spec.*, test/) and config files (.env, *.config.ts, *.json) are classified by extension. Everything else is standard. The user-facing knobs that interact with classification are ignore.paths (exclude paths from all review layers entirely) and per-lens min_severity (raise the bar for what gets reported).

Step 4: Start conservative, Layer 2 only

For the first week, run Layer 2 (AI Quick Scan) only. Observe findings, check false positive rates, and calibrate risk classification against your team's expectations. Layer 3 stays disabled.

Step 5: Enable Layer 3

After the first week, enable Layer 3 for MEDIUM-risk and above. Monitor finding quality, acceptance rates, and cost. Tune risk thresholds based on actual data.

Step 6: Add custom rules

If your domain has specific review needs (compliance, regulatory, domain-specific patterns), add rule files under .visdom/rules/. Each file follows the unified *.rules.yaml schema (pattern-match or LLM-judge).

Step 7: Enable the Proactive Scanner

Set up a weekly cron job for convention drift detection, coverage trends, and security baseline scanning. This runs independently of the PR flow and creates GitHub Issues for critical findings.

Key configuration decisions

All knobs live in .visdom.yaml. Every key deep-merges over tool defaults; you only need to declare what differs. The following decisions have the most impact on effectiveness and cost.

Knob (.visdom.yaml) What it controls Guidance
lenses.<name>.enabled Which review lenses run. Five lenses ship: security, correctness, test-quality, performance, maintainability. The first four are on by default. Maintainability is opt-in โ€” it raises more findings, precision-over-recall. Enable it when the team is ready for that volume.
lenses.<name>.min_severity Minimum severity threshold per lens before a finding is reported. Start at medium for all lenses. Lower to low for security and correctness once false-positive rates are understood.
limits.max_findings A ceiling that can only lower built-in per-lens caps โ€” never raise them. Each built-in lens has its own default cap (2โ€“3 findings per PR); the config value (default 5) sits above all of them and has no effect at default. Lower it to throttle a noisy pilot; the default has no effect on built-in lenses.
disable_rules List of rule IDs to suppress globally. Applies to OOTB rules that do not fit your codebase. Prefer per-finding dismissal first. Add to disable_rules only for rules that are structurally wrong for your stack (e.g., a correctness rule that conflicts with your framework's idiom).
standards.sources Glob patterns over existing repo docs that the AI reads as standards context (800-line cap per source). Point at your existing ADRs, style guides, or API conventions. The tool reads files already in your repo โ€” no duplication needed.
instructions Free-text reviewer steering appended to every AI prompt. Use sparingly for team-specific context the AI consistently misses. Example: "This codebase uses event-sourcing; mutations outside aggregate roots are intentional."
ignore.paths Path globs excluded from all review layers. Concatenates with tool defaults. Add generated code directories, vendored third-party code, and migration files.
confidence_buckets Thresholds separating high / medium / low confidence bands (high: 0.8, medium: 0.5 by default). Adjust if your team finds the default banding too aggressive or too permissive.

๐Ÿ“ฆ Full configuration reference

For the complete .visdom.yaml schema and all configurable options, see the Configuration Reference.

Metrics to set up

Track these metrics from day one of the pilot. They provide the minimum signal needed to evaluate whether the tool is working and where to tune.

Metric Why it matters for a pilot Target
Time to first comment Measures whether developers get feedback before context-switching. The primary developer experience metric. <5 min (Layer 2 only), <15 min (Layer 2 + Layer 3)
Finding acceptance rate Are findings useful? Low acceptance means prompts or risk classification need tuning. >60%
Layer 3 trigger rate What percentage of PRs trigger the expensive deep review? Too low means you are missing risk. Too high means you are overspending. 30โ€“50% of PRs
Cost per PR Total AI cost per pull request across all layers. Validates budget assumptions. Reference run: $0.078/PR across 38 PRs. $0.05โ€“2.00 depending on risk level
Per-finding rule attribution Every finding in the per-PR result JSON carries its rule ID and lens. Use this to spot which rules generate the most noise โ€” candidates for disable_rules. Available from day one in result artifacts
Cross-layer confirmation rate Findings flagged by more than one layer independently (e.g., Layer 1 SAST + Layer 3 correctness). A higher rate signals real issues. Reference run: 7 of 154 findings confirmed cross-layer. Use as a trust signal; no hard target
TORS Test Oracle Reliability Score: what percentage of test failures are real. If TORS is low, your agents and developers are wasting time on flaky tests. >85%

Pilot setup checklist

Two setup actions to complete in the first week โ€” each has a concrete observable outcome:

๐Ÿ“ฆ Full metrics framework

For the complete per-layer metrics, end-to-end SDLC integration, and feedback mechanism, see the Metrics Framework reference.

Known risks

The following risks are inherent to any AI-assisted review system. VCR mitigates each through its layered architecture, but you should be aware of them when evaluating the system.

Risk Mitigation in VCR
LLM hallucination (false findings) Layer 1 is fully deterministic. Layer 2 has confidence thresholds. Layer 3 findings require concrete file/line references.
Prompt injection via PR content Layer 1 cannot be injected (no AI). AI layers use structured prompts with diff isolation.
Over-reliance on AI review VCR explicitly directs human reviewers to focus areas. It supplements, not replaces.
Cost runaway on high-risk PRs Daily budget caps. Risk-based gating. Layer 3 only triggers for MEDIUM+ risk.
Circular Test Trap (AI tests verify AI code) Layer 2 detects AI-generated code. Layer 3 Test Quality lens identifies circular tests.
Flaky test noise (Lying Oracle) TORS filters unreliable tests from feedback signal. Agents do not iterate on flaky failures.
Convention drift across teams Proactive Scanner detects diverging patterns weekly. Convention enforcement in the PR flow uses org-defined rules (type: llm or type: pattern in .visdom/rules/) combined with instructions steering โ€” there is no separate Conventions lens in v8.
Model degradation over time Feedback mechanism (developer reactions) detects declining finding quality. Model selection is configurable.
LLM-judge variance Identical findings can be scored differently across runs due to LLM non-determinism. Do not compare F1 or precision metrics across different judges, model versions, or evaluation scopes โ€” results are not comparable. See the practices doc for measurement guidance.

๐Ÿ“ฆ Detailed risk analysis

For the full risk analysis including mitigation strategies and monitoring guidance, see the AI Quick Scan layer reference.

Reference implementations

VCR is a process framework. VirtusLab provides reference implementations for each component, but every piece is substitutable with equivalent tooling that your organization already operates.

Component Reference implementation Alternatives
Repository knowledge layer Context Fabric (VirtusLab, MIT) Sourcegraph, custom DuckDB/SQLite, GitHub CODEOWNERS + scripts
CI infrastructure Visdom Machine-Speed CI Bazel + EngFlow, Nx, Turborepo, Gradle remote cache
SAST Semgrep (open source) CodeQL, SonarQube, Snyk Code
Secret scanning gitleaks (open source) truffleHog, GitHub secret scanning
AI provider Anthropic (Claude Haiku/Sonnet/Opus) OpenAI GPT-4o, Azure OpenAI, Google Gemini
CI/CD platform GitHub Actions GitLab CI, Azure Pipelines, Jenkins

๐Ÿ“ฆ Full reference implementations

For detailed component descriptions and integration guidance, see the Reference Implementations page.