Architecture in 60 seconds
Every pull request passes through layers of increasing depth and cost. A risk classifier at Layer 2 gates whether the expensive Layer 3 runs. The Proactive Scanner operates independently on a cron schedule.
| Layer | What it does | Time | AI? |
|---|---|---|---|
| Layer 0: Context Collection | Collects diff, metadata, coverage, file classifications, repo knowledge, test reliability data | <10s | No |
| Layer 1: Deterministic Gate | Linters, SAST, secret scan, coverage delta, TORS filtering. Cannot be prompt-injected. | <60s | No |
| Layer 2: AI Quick Scan | Fast AI pass over diff. Risk classification (LOW→CRITICAL). Max 5 quick findings. AI-code detection. | <2 min | Yes (Haiku-class) |
| Layer 3: AI Deep Review | Full analysis with repo context, history, conventions. Multiple review lenses in parallel. MEDIUM+ risk only. | <10 min | Yes (Sonnet/Opus-class) |
| Reporter | Aggregates all layers into structured PR comment, inline comments, GitHub Check, optional Slack. | <30s | No |
| Proactive Scanner | Cron-based repo analysis: coverage trends, tech debt, convention drift, security baseline. | Scheduled | Yes |
๐ฆ Full architecture reference
For the complete layer diagram, data flows, and output schemas, see the Architecture Reference.
What you need before starting
The following prerequisites are required for a pilot deployment. The first reference implementation targets GitHub; other platforms follow the same process with different adapters.
| Prerequisite | Details | Required? |
|---|---|---|
| GitHub repository with PRs | The v1 reference implementation is GitHub-only. Active PR flow needed for meaningful pilot data. | Yes |
| CI pipeline with test coverage reports | VCR reads coverage deltas to assess risk. Any format supported by your coverage tool. | Yes |
| AI API key | Anthropic (default). OpenAI and Azure OpenAI are configurable alternatives. | Yes |
| 30 days of test history | Required for TORS (Test Oracle Reliability Score) bootstrap. Without it, start with TORS disabled and build up data over the first month. | Recommended |
.visdom.yaml in repo root | Repo-level config file. Deep-merged over tool defaults; lists concatenate. Only the knobs you set differ from defaults. | Yes (created during setup) |
โ No test history?
You can start without TORS and build up reliability data during the pilot. Set
layer1.tors.enabled: false in your config, then enable it after 30 days of CI data
have been collected.
Running a pilot
A pilot typically runs on 1-2 teams over 4-6 weeks. The steps below assume GitHub Actions as the CI platform.
Step 1: Install with zero config
Add the GitHub Actions workflow to your repository. On day one, run with tool defaults โ no
.visdom.yaml required. The tool ships opinionated defaults; you only override what
you need to change.
.github/workflows/visdom-review.yaml, triggers on PR open/update
Reference cost at defaults: $0.078/PR, ~22 s/PR on a representative sample of 38 PRs. Layer 3 ran on 27 of those 38 PRs โ the triage gate controlled spend on the rest.
Step 2: Opt the repo in explicitly
Create .visdom.yaml at the repo root and set enabled: true. This is
the explicit opt-in; without it the tool operates in observation mode only.
# .visdom.yaml โ minimal opt-in
enabled: true
From here, every key you add deep-merges over tool defaults. Lists (e.g.
ignore.paths) concatenate; scalar values override.
Step 3: Understand automatic path classification
File risk classification is an engine heuristic โ there is no user-configurable classification
map in .visdom.yaml. The engine classifies files automatically: paths matching
/auth/, /middleware/, security, or crypto
patterns become critical; test files (.test.*, .spec.*,
test/) and config files (.env, *.config.ts,
*.json) are classified by extension. Everything else is standard.
The user-facing knobs that interact with classification are ignore.paths (exclude
paths from all review layers entirely) and per-lens min_severity (raise the bar for
what gets reported).
Step 4: Start conservative, Layer 2 only
For the first week, run Layer 2 (AI Quick Scan) only. Observe findings, check false positive rates, and calibrate risk classification against your team's expectations. Layer 3 stays disabled.
Step 5: Enable Layer 3
After the first week, enable Layer 3 for MEDIUM-risk and above. Monitor finding quality, acceptance rates, and cost. Tune risk thresholds based on actual data.
Step 6: Add custom rules
If your domain has specific review needs (compliance, regulatory, domain-specific patterns),
add rule files under .visdom/rules/. Each file follows the unified
*.rules.yaml schema (pattern-match or LLM-judge).
Step 7: Enable the Proactive Scanner
Set up a weekly cron job for convention drift detection, coverage trends, and security baseline scanning. This runs independently of the PR flow and creates GitHub Issues for critical findings.
Key configuration decisions
All knobs live in .visdom.yaml. Every key deep-merges over tool defaults; you only
need to declare what differs. The following decisions have the most impact on effectiveness and
cost.
Knob (.visdom.yaml) | What it controls | Guidance |
|---|---|---|
lenses.<name>.enabled | Which review lenses run. Five lenses ship: security, correctness, test-quality, performance, maintainability. | The first four are on by default. Maintainability is opt-in โ it raises more findings, precision-over-recall. Enable it when the team is ready for that volume. |
lenses.<name>.min_severity | Minimum severity threshold per lens before a finding is reported. | Start at medium for all lenses. Lower to low for security and correctness once false-positive rates are understood. |
limits.max_findings | A ceiling that can only lower built-in per-lens caps โ never raise them. Each built-in lens has its own default cap (2โ3 findings per PR); the config value (default 5) sits above all of them and has no effect at default. | Lower it to throttle a noisy pilot; the default has no effect on built-in lenses. |
disable_rules | List of rule IDs to suppress globally. Applies to OOTB rules that do not fit your codebase. | Prefer per-finding dismissal first. Add to disable_rules only for rules that are structurally wrong for your stack (e.g., a correctness rule that conflicts with your framework's idiom). |
standards.sources | Glob patterns over existing repo docs that the AI reads as standards context (800-line cap per source). | Point at your existing ADRs, style guides, or API conventions. The tool reads files already in your repo โ no duplication needed. |
instructions | Free-text reviewer steering appended to every AI prompt. | Use sparingly for team-specific context the AI consistently misses. Example: "This codebase uses event-sourcing; mutations outside aggregate roots are intentional." |
ignore.paths | Path globs excluded from all review layers. Concatenates with tool defaults. | Add generated code directories, vendored third-party code, and migration files. |
confidence_buckets | Thresholds separating high / medium / low confidence bands (high: 0.8, medium: 0.5 by default). | Adjust if your team finds the default banding too aggressive or too permissive. |
๐ฆ Full configuration reference
For the complete .visdom.yaml schema and all configurable options, see the
Configuration Reference.
Metrics to set up
Track these metrics from day one of the pilot. They provide the minimum signal needed to evaluate whether the tool is working and where to tune.
| Metric | Why it matters for a pilot | Target |
|---|---|---|
| Time to first comment | Measures whether developers get feedback before context-switching. The primary developer experience metric. | <5 min (Layer 2 only), <15 min (Layer 2 + Layer 3) |
| Finding acceptance rate | Are findings useful? Low acceptance means prompts or risk classification need tuning. | >60% |
| Layer 3 trigger rate | What percentage of PRs trigger the expensive deep review? Too low means you are missing risk. Too high means you are overspending. | 30โ50% of PRs |
| Cost per PR | Total AI cost per pull request across all layers. Validates budget assumptions. Reference run: $0.078/PR across 38 PRs. | $0.05โ2.00 depending on risk level |
| Per-finding rule attribution | Every finding in the per-PR result JSON carries its rule ID and lens. Use this to spot which rules generate the most noise โ candidates for disable_rules. | Available from day one in result artifacts |
| Cross-layer confirmation rate | Findings flagged by more than one layer independently (e.g., Layer 1 SAST + Layer 3 correctness). A higher rate signals real issues. Reference run: 7 of 154 findings confirmed cross-layer. | Use as a trust signal; no hard target |
| TORS | Test Oracle Reliability Score: what percentage of test failures are real. If TORS is low, your agents and developers are wasting time on flaky tests. | >85% |
Pilot setup checklist
Two setup actions to complete in the first week โ each has a concrete observable outcome:
- Wire SARIF into code-scanning. Run with
--format=sarifto emit SARIF 2.1.0 output; feed it into GitHub Advanced Security or your code-scanning dashboard. Observable outcome: findings trend visible in the security tab after a week of PRs. - Establish a bench baseline. Run
npm run demo:benchagainst a curated set of PRs where findings are known. Record the F1 score. Observable outcome: you have a regression baseline to compare against after model upgrades.
๐ฆ Full metrics framework
For the complete per-layer metrics, end-to-end SDLC integration, and feedback mechanism, see the Metrics Framework reference.
Known risks
The following risks are inherent to any AI-assisted review system. VCR mitigates each through its layered architecture, but you should be aware of them when evaluating the system.
| Risk | Mitigation in VCR |
|---|---|
| LLM hallucination (false findings) | Layer 1 is fully deterministic. Layer 2 has confidence thresholds. Layer 3 findings require concrete file/line references. |
| Prompt injection via PR content | Layer 1 cannot be injected (no AI). AI layers use structured prompts with diff isolation. |
| Over-reliance on AI review | VCR explicitly directs human reviewers to focus areas. It supplements, not replaces. |
| Cost runaway on high-risk PRs | Daily budget caps. Risk-based gating. Layer 3 only triggers for MEDIUM+ risk. |
| Circular Test Trap (AI tests verify AI code) | Layer 2 detects AI-generated code. Layer 3 Test Quality lens identifies circular tests. |
| Flaky test noise (Lying Oracle) | TORS filters unreliable tests from feedback signal. Agents do not iterate on flaky failures. |
| Convention drift across teams | Proactive Scanner detects diverging patterns weekly. Convention enforcement in the PR flow uses org-defined rules (type: llm or type: pattern in .visdom/rules/) combined with instructions steering โ there is no separate Conventions lens in v8. |
| Model degradation over time | Feedback mechanism (developer reactions) detects declining finding quality. Model selection is configurable. |
| LLM-judge variance | Identical findings can be scored differently across runs due to LLM non-determinism. Do not compare F1 or precision metrics across different judges, model versions, or evaluation scopes โ results are not comparable. See the practices doc for measurement guidance. |
๐ฆ Detailed risk analysis
For the full risk analysis including mitigation strategies and monitoring guidance, see the AI Quick Scan layer reference.
Reference implementations
VCR is a process framework. VirtusLab provides reference implementations for each component, but every piece is substitutable with equivalent tooling that your organization already operates.
| Component | Reference implementation | Alternatives |
|---|---|---|
| Repository knowledge layer | Context Fabric (VirtusLab, MIT) | Sourcegraph, custom DuckDB/SQLite, GitHub CODEOWNERS + scripts |
| CI infrastructure | Visdom Machine-Speed CI | Bazel + EngFlow, Nx, Turborepo, Gradle remote cache |
| SAST | Semgrep (open source) | CodeQL, SonarQube, Snyk Code |
| Secret scanning | gitleaks (open source) | truffleHog, GitHub secret scanning |
| AI provider | Anthropic (Claude Haiku/Sonnet/Opus) | OpenAI GPT-4o, Azure OpenAI, Google Gemini |
| CI/CD platform | GitHub Actions | GitLab CI, Azure Pipelines, Jenkins |
๐ฆ Full reference implementations
For detailed component descriptions and integration guidance, see the Reference Implementations page.