For Developers | Visdom Code Review

You've seen bots on PRs before. They post 40 comments, half are wrong, and you ignore all of them by the second week. VCR is built to be the opposite of that. Here's what actually changes in your day-to-day.

What you'll see on your PRs

VCR posts inline comments directly on the relevant lines of your diff. Each comment carries structured metadata so you can decide immediately how much weight to give it.

The example below is illustrative — it shows the anatomy of a v8 inline comment, not actual tool output.

VCR · HIGH · Security · Confidence: high · CWE-89 · OWASP A03 · confirmed by layers 1+3

SQL query built by string concatenation with unsanitized user input. An attacker can manipulate the query to read or delete arbitrary rows.

Suggestion: Use a parameterized query instead of string concatenation.

```suggestion
- const query = "SELECT * FROM users WHERE id = " + userId;
+ const query = "SELECT * FROM users WHERE id = ?";
+ db.execute(query, [userId]);
```

Each finding includes:

Severity — HIGH, MEDIUM, or LOW, matching the risk level of the change
Category — which lens flagged it (security, correctness, test-quality, performance, maintainability)
Confidence bucket — high (≥0.8), medium (≥0.5), or requires verification (below 0.5)
CWE / OWASP reference — included when the rule carries that metadata
Cross-layer confirmation — "confirmed by layers 1+3" means two independent analysis passes found the same issue; these are the most reliable findings
Suggestion prose — a plain-language description of what to fix and why
One-click GitHub suggestion block — when the engine can produce an exact patch, it renders as a GitHub suggestion block you can apply directly from the PR review UI
What to do with each tier: high confidence + cross-layer confirmed — address before merge; medium — use your judgment; low ("requires verification") — treat as a question for the author or reviewer, not a verdict.

✅ Silence is the default

Most PRs are fine. VCR is designed to stay quiet when there's nothing worth saying. If you're not hearing from it, that's working as intended.

How the risk levels work

Every PR gets a risk level. This determines how deep VCR looks and how much of your (and your reviewer's) time it asks for.

Risk	What triggers it	What happens
LOW	Small change, safe paths, tests pass	Fast scan only, done in ~2 min. No deep review.
MEDIUM	Sensitive path or coverage drop	Deep review kicks in. More thorough analysis.
HIGH	Critical path (auth, payments, infra)	Full multi-lens analysis. ~10 min.
CRITICAL	All the above + AI-generated code on a critical path	Full analysis + your senior gets pinged directly.

The mental model is simple: if you touch auth, expect CRITICAL. If you touch docs, expect LOW. That's it.

What VCR catches that you might miss

VCR reviews code through five lenses. Each lens is focused on a distinct failure mode that tends to slip through human review, especially on AI-assisted code.

Security

Injection flaws, authentication gaps, insecure deserialization, missing authorization checks. VCR maps findings to CWE and OWASP references where the rule carries that metadata, so you can look up the full vulnerability class rather than just reading a one-line description.

Correctness

Logic that compiles and passes type checks but is wrong at runtime. One example caught in the v8 llama3 run: Double-checked locking pattern without proper volatile semantics — a concurrency bug that's invisible to the type system and produces intermittent failures under load.

Test quality

Circular tests pass but prove nothing. Copilot generated both the code and the test, so the test is a mirror of the implementation. VCR flags this explicitly: "This test verifies implementation, not specification." It also checks for missing coverage on new entry points and tests that assert on internal state rather than observable behavior.

Performance

N+1 queries, blocking I/O in async paths, accidentally-quadratic patterns, and unbounded caches. This lens is on by default at medium severity and above. In the llama3 run the lens flagged, for example: Shared mutable `rowBuf`/`vecBuf` fields cause data corruption under concurrent inference — a finding that sits on the performance/correctness boundary, since the shared mutable state corrupts results under concurrent load rather than merely slowing things down; the lens overlap is real.

Maintainability (opt-in)

Duplication, God objects, and dead code. This lens is off by default — enable it in .visdom.yaml (lenses: maintainability: { enabled: true }) if your team wants it. When on, it flags patterns like three wrapper classes for one function and tells you: "This could be a single function call."

💡 It's not a linter

VCR doesn't care about your semicolons, import order, or variable names. That's what ESLint, Prettier, and your IDE are for. VCR looks at the stuff that actually causes incidents.

How to give feedback

VCR learns from your reactions. Every finding has reaction buttons on the PR comment. Use them. It takes one click and directly shapes what VCR comments on next time.

Reaction	Meaning	What happens
👍	Helpful, you fixed it	VCR gains confidence in this category for your codebase
👎	False positive, not relevant	VCR reduces weight for this pattern in your context
🤔	Not sure, needs discussion	Flagged for team review, helps calibrate edge cases

✅ Your feedback matters

Thumbs-down a bad finding and VCR learns. The more your team reacts, the fewer false positives you'll see over time. This is how VCR avoids becoming another bot you ignore.

What VCR does NOT do

Let's be clear about the boundaries so you don't have wrong expectations.

Won't auto-fix your code. v8 reports findings and can render one-click GitHub suggestion blocks where an exact patch is possible, but you apply the change — VCR never commits to your branch.
Won't replace human review. Your senior still approves the PR. VCR tells them where to look, not what to decide.
Won't block your PR unless your team explicitly configures it to. By default, VCR advises.
Won't comment on formatting, naming, or import order. That's what your linter is for. VCR focuses on things that cause actual problems.
Low-confidence findings are labeled "requires verification" — treat them as questions worth investigating, not as verdicts. High-confidence findings and cross-layer confirmations are where you should focus first.

For senior developers

If you're the person who approves PRs and sets standards for your team, here's what changes for you specifically.

You'll review fewer PRs, but with better context. VCR pre-annotates every PR with risk level, specific findings, and exactly which files need your attention. Instead of reading every line of a 400-line diff, you focus on the 3 files that matter.

VCR tells you WHERE to look, not what to decide. It's a guide, not a replacement. "Focus your review on auth.ts (security findings) and test coverage gap." That's the kind of guidance you get.

Customization you'll care about

Org rules: add domain-specific checks as rule files in .visdom/rules/. If your team has specific patterns for database migrations or event sourcing, you can teach VCR to check for them.
Standards and instructions: point standards.sources in .visdom.yaml at your existing docs and steer the review with instructions. "We use direct instantiation, not Factory pattern." That kind of thing. VCR will flag deviations.

Full details on lenses and configuration: Configuration Reference

What developers are saying

Junior in BangaloreFrontend Developer

VCR taught me about circular tests. I didn't know my tests were just mirrors of my implementation. They passed, so I thought they were fine. Now I actually write tests that check behavior, not just that the code runs.

Senior in KrakowTech Lead

I review 40% fewer PRs but catch more real issues. VCR caught a hallucinated API last week that I would have missed. It looked completely correct at first glance. VCR told me exactly which line to look at and why.

## AI code review: state of the market Two independent benchmarks now measure what AI review tools actually catch on real pull requests. The data below is drawn from published, reproducible evaluations — not vendor self-reports.

Tool	Precision	Recall	F1	Source
Propel	68%	61%	64%	Propel Benchmark
Cubic	56%	69%	62%	Martian Bench
Qodo	—	57%	60%	Martian Bench
Augment	65%	55%	59%	Propel Benchmark
CodeRabbit	36–48%	43–55%	39–51%	Martian + Propel
Baz	#1 on Martian	—	~50%	Martian Bench
Claude Code	23%	51%	31%	Propel Benchmark
GitHub Copilot	20%	34%	25%	Propel Benchmark

About these benchmarks

Martian Code Review Bench 50 curated PRs + 200k online PRs across Sentry, Grafana, Cal.com, Discourse, Keycloak. Human-verified golden comments. LLM-as-judge (Claude/GPT). MIT licensed, open source. Created by researchers from DeepMind, Anthropic, and Meta.

Propel Benchmark 50 PRs from production open-source repos. Externally authored — Propel did not influence repo selection, PR selection, or labeling. Tools tested with default settings, no customization.

F1 50–65% is current state of the art. No tool exceeds 70% F1 on real-world PRs. Precision measures how often a tool's comments lead to a code change. Recall measures how many real issues the tool catches.