Back to Guide
GuideDevelopers

For Developers

What changes in your daily PR workflow, and why it's not another noisy bot.

You've seen bots on PRs before. They post 40 comments, half are wrong, and you ignore all of them by the second week. VCR is built to be the opposite of that. Here's what actually changes in your day-to-day.

What you'll see on your PRs

VCR posts inline comments directly on the relevant lines of your diff. Each comment carries structured metadata so you can decide immediately how much weight to give it.

The example below is illustrative — it shows the anatomy of a v8 inline comment, not actual tool output.

VCR · HIGH · Security · Confidence: high · CWE-89 · OWASP A03 · confirmed by layers 1+3

SQL query built by string concatenation with unsanitized user input. An attacker can manipulate the query to read or delete arbitrary rows.

Suggestion: Use a parameterized query instead of string concatenation.

```suggestion
- const query = "SELECT * FROM users WHERE id = " + userId;
+ const query = "SELECT * FROM users WHERE id = ?";
+ db.execute(query, [userId]);
```

Each finding includes:

Silence is the default

Most PRs are fine. VCR is designed to stay quiet when there's nothing worth saying. If you're not hearing from it, that's working as intended.

How the risk levels work

Every PR gets a risk level. This determines how deep VCR looks and how much of your (and your reviewer's) time it asks for.

Risk What triggers it What happens
LOW Small change, safe paths, tests pass Fast scan only, done in ~2 min. No deep review.
MEDIUM Sensitive path or coverage drop Deep review kicks in. More thorough analysis.
HIGH Critical path (auth, payments, infra) Full multi-lens analysis. ~10 min.
CRITICAL All the above + AI-generated code on a critical path Full analysis + your senior gets pinged directly.

The mental model is simple: if you touch auth, expect CRITICAL. If you touch docs, expect LOW. That's it.

What VCR catches that you might miss

VCR reviews code through five lenses. Each lens is focused on a distinct failure mode that tends to slip through human review, especially on AI-assisted code.

Security

Injection flaws, authentication gaps, insecure deserialization, missing authorization checks. VCR maps findings to CWE and OWASP references where the rule carries that metadata, so you can look up the full vulnerability class rather than just reading a one-line description.

Correctness

Logic that compiles and passes type checks but is wrong at runtime. One example caught in the v8 llama3 run: Double-checked locking pattern without proper volatile semantics — a concurrency bug that's invisible to the type system and produces intermittent failures under load.

Test quality

Circular tests pass but prove nothing. Copilot generated both the code and the test, so the test is a mirror of the implementation. VCR flags this explicitly: "This test verifies implementation, not specification." It also checks for missing coverage on new entry points and tests that assert on internal state rather than observable behavior.

Performance

N+1 queries, blocking I/O in async paths, accidentally-quadratic patterns, and unbounded caches. This lens is on by default at medium severity and above. In the llama3 run the lens flagged, for example: Shared mutable `rowBuf`/`vecBuf` fields cause data corruption under concurrent inference — a finding that sits on the performance/correctness boundary, since the shared mutable state corrupts results under concurrent load rather than merely slowing things down; the lens overlap is real.

Maintainability (opt-in)

Duplication, God objects, and dead code. This lens is off by default — enable it in .visdom.yaml (lenses: maintainability: { enabled: true }) if your team wants it. When on, it flags patterns like three wrapper classes for one function and tells you: "This could be a single function call."

💡 It's not a linter

VCR doesn't care about your semicolons, import order, or variable names. That's what ESLint, Prettier, and your IDE are for. VCR looks at the stuff that actually causes incidents.

How to give feedback

VCR learns from your reactions. Every finding has reaction buttons on the PR comment. Use them. It takes one click and directly shapes what VCR comments on next time.

Reaction Meaning What happens
👍 Helpful, you fixed it VCR gains confidence in this category for your codebase
👎 False positive, not relevant VCR reduces weight for this pattern in your context
🤔 Not sure, needs discussion Flagged for team review, helps calibrate edge cases

Your feedback matters

Thumbs-down a bad finding and VCR learns. The more your team reacts, the fewer false positives you'll see over time. This is how VCR avoids becoming another bot you ignore.

What VCR does NOT do

Let's be clear about the boundaries so you don't have wrong expectations.

For senior developers

If you're the person who approves PRs and sets standards for your team, here's what changes for you specifically.

You'll review fewer PRs, but with better context. VCR pre-annotates every PR with risk level, specific findings, and exactly which files need your attention. Instead of reading every line of a 400-line diff, you focus on the 3 files that matter.

VCR tells you WHERE to look, not what to decide. It's a guide, not a replacement. "Focus your review on auth.ts (security findings) and test coverage gap." That's the kind of guidance you get.

Customization you'll care about

Full details on lenses and configuration: Configuration Reference

What developers are saying

Junior in BangaloreFrontend Developer

VCR taught me about circular tests. I didn't know my tests were just mirrors of my implementation. They passed, so I thought they were fine. Now I actually write tests that check behavior, not just that the code runs.

Senior in KrakowTech Lead

I review 40% fewer PRs but catch more real issues. VCR caught a hallucinated API last week that I would have missed. It looked completely correct at first glance. VCR told me exactly which line to look at and why.

## AI code review: state of the market Two independent benchmarks now measure what AI review tools actually catch on real pull requests. The data below is drawn from published, reproducible evaluations — not vendor self-reports.
Tool Precision Recall F1 Source
Propel 68% 61% 64% Propel Benchmark
Cubic 56% 69% 62% Martian Bench
Qodo 57% 60% Martian Bench
Augment 65% 55% 59% Propel Benchmark
CodeRabbit 36–48% 43–55% 39–51% Martian + Propel
Baz #1 on Martian ~50% Martian Bench
Claude Code 23% 51% 31% Propel Benchmark
GitHub Copilot 20% 34% 25% Propel Benchmark

About these benchmarks

Martian Code Review Bench 50 curated PRs + 200k online PRs across Sentry, Grafana, Cal.com, Discourse, Keycloak. Human-verified golden comments. LLM-as-judge (Claude/GPT). MIT licensed, open source. Created by researchers from DeepMind, Anthropic, and Meta.
Propel Benchmark 50 PRs from production open-source repos. Externally authored — Propel did not influence repo selection, PR selection, or labeling. Tools tested with default settings, no customization.

F1 50–65% is current state of the art. No tool exceeds 70% F1 on real-world PRs. Precision measures how often a tool's comments lead to a code change. Recall measures how many real issues the tool catches.