For Platform Engineers | Visdom Code Review

Architecture in 60 seconds

Every pull request passes through layers of increasing depth and cost. A risk classifier at Layer 2 gates whether the expensive Layer 3 runs. The Proactive Scanner operates independently on a cron schedule.

Layer	What it does	Time	AI?
Layer 0: Context Collection	Collects diff, metadata, coverage, file classifications, repo knowledge, test reliability data	<10s	No
Layer 1: Deterministic Gate	Linters, SAST, secret scan, coverage delta, TORS filtering. Cannot be prompt-injected.	<60s	No
Layer 2: AI Quick Scan	Fast AI pass over diff. Risk classification (LOW→CRITICAL). Max 5 quick findings. AI-code detection.	<2 min	Yes (Haiku-class)
Layer 3: AI Deep Review	Full analysis with repo context, history, conventions. Multiple review lenses in parallel. MEDIUM+ risk only.	<10 min	Yes (Sonnet/Opus-class)
Reporter	Aggregates all layers into structured PR comment, inline comments, GitHub Check, optional Slack.	<30s	No
Proactive Scanner	Cron-based repo analysis: coverage trends, tech debt, convention drift, security baseline.	Scheduled	Yes

📦 Full architecture reference

For the complete layer diagram, data flows, and output schemas, see the Architecture Reference.

What you need before starting

The following prerequisites are required for a pilot deployment. The first reference implementation targets GitHub; other platforms follow the same process with different adapters.

Prerequisite	Details	Required?
GitHub repository with PRs	The v1 reference implementation is GitHub-only. Active PR flow needed for meaningful pilot data.	Yes
CI pipeline with test coverage reports	VCR reads coverage deltas to assess risk. Any format supported by your coverage tool.	Yes
AI API key	Anthropic (default). OpenAI and Azure OpenAI are configurable alternatives.	Yes
30 days of test history	Required for TORS (Test Oracle Reliability Score) bootstrap. Without it, start with TORS disabled and build up data over the first month.	Recommended
`.visdom.yaml` in repo root	Repo-level config file. Deep-merged over tool defaults; lists concatenate. Only the knobs you set differ from defaults.	Yes (created during setup)

✅ No test history?

You can start without TORS and build up reliability data during the pilot. Set layer1.tors.enabled: false in your config, then enable it after 30 days of CI data have been collected.

Running a pilot

A pilot typically runs on 1-2 teams over 4-6 weeks. The steps below assume GitHub Actions as the CI platform.

Step 1: Install with zero config

Add the GitHub Actions workflow to your repository. On day one, run with tool defaults — no .visdom.yaml required. The tool ships opinionated defaults; you only override what you need to change.

.github/workflows/visdom-review.yaml, triggers on PR open/update

Reference cost at defaults: $0.078/PR, ~22 s/PR on a representative sample of 38 PRs. Layer 3 ran on 27 of those 38 PRs — the triage gate controlled spend on the rest.

Step 2: Opt the repo in explicitly

Create .visdom.yaml at the repo root and set enabled: true. This is the explicit opt-in; without it the tool operates in observation mode only.

# .visdom.yaml — minimal opt-in
enabled: true

From here, every key you add deep-merges over tool defaults. Lists (e.g. ignore.paths) concatenate; scalar values override.

Step 3: Understand automatic path classification

File risk classification is an engine heuristic — there is no user-configurable classification map in .visdom.yaml. The engine classifies files automatically: paths matching /auth/, /middleware/, security, or crypto patterns become critical; test files (.test.*, .spec.*, test/) and config files (.env, *.config.ts, *.json) are classified by extension. Everything else is standard. The user-facing knobs that interact with classification are ignore.paths (exclude paths from all review layers entirely) and per-lens min_severity (raise the bar for what gets reported).

Step 4: Start conservative, Layer 2 only

For the first week, run Layer 2 (AI Quick Scan) only. Observe findings, check false positive rates, and calibrate risk classification against your team's expectations. Layer 3 stays disabled.

Step 5: Enable Layer 3

After the first week, enable Layer 3 for MEDIUM-risk and above. Monitor finding quality, acceptance rates, and cost. Tune risk thresholds based on actual data.

Step 6: Add custom rules

If your domain has specific review needs (compliance, regulatory, domain-specific patterns), add rule files under .visdom/rules/. Each file follows the unified *.rules.yaml schema (pattern-match or LLM-judge).

Step 7: Enable the Proactive Scanner

Set up a weekly cron job for convention drift detection, coverage trends, and security baseline scanning. This runs independently of the PR flow and creates GitHub Issues for critical findings.

Key configuration decisions

All knobs live in .visdom.yaml. Every key deep-merges over tool defaults; you only need to declare what differs. The following decisions have the most impact on effectiveness and cost.

Knob (`.visdom.yaml`)	What it controls	Guidance
`lenses.<name>.enabled`	Which review lenses run. Five lenses ship: security, correctness, test-quality, performance, maintainability.	The first four are on by default. Maintainability is opt-in — it raises more findings, precision-over-recall. Enable it when the team is ready for that volume.
`lenses.<name>.min_severity`	Minimum severity threshold per lens before a finding is reported.	Start at `medium` for all lenses. Lower to `low` for security and correctness once false-positive rates are understood.
`limits.max_findings`	A ceiling that can only lower built-in per-lens caps — never raise them. Each built-in lens has its own default cap (2–3 findings per PR); the config value (default 5) sits above all of them and has no effect at default.	Lower it to throttle a noisy pilot; the default has no effect on built-in lenses.
`disable_rules`	List of rule IDs to suppress globally. Applies to OOTB rules that do not fit your codebase.	Prefer per-finding dismissal first. Add to `disable_rules` only for rules that are structurally wrong for your stack (e.g., a correctness rule that conflicts with your framework's idiom).
`standards.sources`	Glob patterns over existing repo docs that the AI reads as standards context (800-line cap per source).	Point at your existing ADRs, style guides, or API conventions. The tool reads files already in your repo — no duplication needed.
`instructions`	Free-text reviewer steering appended to every AI prompt.	Use sparingly for team-specific context the AI consistently misses. Example: "This codebase uses event-sourcing; mutations outside aggregate roots are intentional."
`ignore.paths`	Path globs excluded from all review layers. Concatenates with tool defaults.	Add generated code directories, vendored third-party code, and migration files.
`confidence_buckets`	Thresholds separating high / medium / low confidence bands (`high: 0.8`, `medium: 0.5` by default).	Adjust if your team finds the default banding too aggressive or too permissive.

📦 Full configuration reference

For the complete .visdom.yaml schema and all configurable options, see the Configuration Reference.

Metrics to set up

Track these metrics from day one of the pilot. They provide the minimum signal needed to evaluate whether the tool is working and where to tune.

Metric	Why it matters for a pilot	Target
Time to first comment	Measures whether developers get feedback before context-switching. The primary developer experience metric.	<5 min (Layer 2 only), <15 min (Layer 2 + Layer 3)
Finding acceptance rate	Are findings useful? Low acceptance means prompts or risk classification need tuning.	>60%
Layer 3 trigger rate	What percentage of PRs trigger the expensive deep review? Too low means you are missing risk. Too high means you are overspending.	30–50% of PRs
Cost per PR	Total AI cost per pull request across all layers. Validates budget assumptions. Reference run: $0.078/PR across 38 PRs.	$0.05–2.00 depending on risk level
Per-finding rule attribution	Every finding in the per-PR result JSON carries its rule ID and lens. Use this to spot which rules generate the most noise — candidates for `disable_rules`.	Available from day one in result artifacts
Cross-layer confirmation rate	Findings flagged by more than one layer independently (e.g., Layer 1 SAST + Layer 3 correctness). A higher rate signals real issues. Reference run: 7 of 154 findings confirmed cross-layer.	Use as a trust signal; no hard target
TORS	Test Oracle Reliability Score: what percentage of test failures are real. If TORS is low, your agents and developers are wasting time on flaky tests.	>85%

Pilot setup checklist

Two setup actions to complete in the first week — each has a concrete observable outcome:

Wire SARIF into code-scanning. Run with --format=sarif to emit SARIF 2.1.0 output; feed it into GitHub Advanced Security or your code-scanning dashboard. Observable outcome: findings trend visible in the security tab after a week of PRs.
Establish a bench baseline. Run npm run demo:bench against a curated set of PRs where findings are known. Record the F1 score. Observable outcome: you have a regression baseline to compare against after model upgrades.

📦 Full metrics framework

For the complete per-layer metrics, end-to-end SDLC integration, and feedback mechanism, see the Metrics Framework reference.

Known risks

The following risks are inherent to any AI-assisted review system. VCR mitigates each through its layered architecture, but you should be aware of them when evaluating the system.

Risk	Mitigation in VCR
LLM hallucination (false findings)	Layer 1 is fully deterministic. Layer 2 has confidence thresholds. Layer 3 findings require concrete file/line references.
Prompt injection via PR content	Layer 1 cannot be injected (no AI). AI layers use structured prompts with diff isolation.
Over-reliance on AI review	VCR explicitly directs human reviewers to focus areas. It supplements, not replaces.
Cost runaway on high-risk PRs	Daily budget caps. Risk-based gating. Layer 3 only triggers for MEDIUM+ risk.
Circular Test Trap (AI tests verify AI code)	Layer 2 detects AI-generated code. Layer 3 Test Quality lens identifies circular tests.
Flaky test noise (Lying Oracle)	TORS filters unreliable tests from feedback signal. Agents do not iterate on flaky failures.
Convention drift across teams	Proactive Scanner detects diverging patterns weekly. Convention enforcement in the PR flow uses org-defined rules (`type: llm` or `type: pattern` in `.visdom/rules/`) combined with `instructions` steering — there is no separate Conventions lens in v8.
Model degradation over time	Feedback mechanism (developer reactions) detects declining finding quality. Model selection is configurable.
LLM-judge variance	Identical findings can be scored differently across runs due to LLM non-determinism. Do not compare F1 or precision metrics across different judges, model versions, or evaluation scopes — results are not comparable. See the practices doc for measurement guidance.

📦 Detailed risk analysis

For the full risk analysis including mitigation strategies and monitoring guidance, see the AI Quick Scan layer reference.

Reference implementations

VCR is a process framework. VirtusLab provides reference implementations for each component, but every piece is substitutable with equivalent tooling that your organization already operates.

Component	Reference implementation	Alternatives
Repository knowledge layer	Context Fabric (VirtusLab, MIT)	Sourcegraph, custom DuckDB/SQLite, GitHub CODEOWNERS + scripts
CI infrastructure	Visdom Machine-Speed CI	Bazel + EngFlow, Nx, Turborepo, Gradle remote cache
SAST	Semgrep (open source)	CodeQL, SonarQube, Snyk Code
Secret scanning	gitleaks (open source)	truffleHog, GitHub secret scanning
AI provider	Anthropic (Claude Haiku/Sonnet/Opus)	OpenAI GPT-4o, Azure OpenAI, Google Gemini
CI/CD platform	GitHub Actions	GitLab CI, Azure Pipelines, Jenkins

📦 Full reference implementations

For detailed component descriptions and integration guidance, see the Reference Implementations page.