Layer 2: AI Quick Scan | Visdom Code Review

Layer 2 is the first AI contact with the code. It has two goals: (1) fast feedback on obvious problems, and (2) risk classification of the PR to determine whether the expensive Layer 3 runs.

Input

review-context.json from Layer 0
layer1-results.json from Layer 1
Diff only (not full files) + ~50 lines surrounding context per hunk

Prompt Structure

You are a code reviewer. Review this diff.

## Context
- PR: {title}, author: {author}
- Affected paths: {paths} (classification: {critical/sensitive/standard})
- Layer 1 findings: {summary}
- Repository knowledge: {ownership, recent changes, module stability}

## Tasks
1. RISK CLASSIFICATION: Assess PR risk (LOW/MEDIUM/HIGH/CRITICAL)
   Signals: size, affected paths, complexity, test coverage delta, module stability
2. QUICK FINDINGS: Report obvious problems (max 5):
   - Obvious bugs, missing error handling
   - Missing tests for new code paths
   - Copy-paste / dead code
   - DO NOT comment on: naming style, missing docs, import order, formatting
     (these are handled by Layer 1 linters)
3. AI-CODE DETECTION: Does this code appear AI-generated?
   Signals: over-engineering, unnecessary abstractions, generic variable names
4. CIRCULAR TEST DETECTION: Do new tests mirror implementation logic
   rather than testing against a specification?

## Output format
Respond with JSON matching the Layer 2 output schema:
- risk_classification: LOW | MEDIUM | HIGH | CRITICAL
- risk_signals: array of { signal, value, weight }
- findings: array of { severity, file, line, category, description,
                        suggestion, confidence }
- ai_generated: { detected: bool, confidence: float, signals: array }
- circular_tests: array of { test_file, test_name, reason }

Risk Classification Logic

Risk classification is primarily deterministic, with AI judgment as one input among several:

Signal	Source	Weight
Path classification	Config (deterministic)	High
Diff size	Git (deterministic)	Medium
Coverage delta	CI (deterministic)	Medium
Module stability	Repository knowledge layer (deterministic)	Medium
AI-generated flag	Layer 2 AI detection	Medium
AI complexity assessment	Layer 2 AI judgment	Low

CRITICAL: critical path + large diff + coverage drop
HIGH:     critical path OR (large diff + sensitive path)
MEDIUM:   sensitive path OR AI-generated flag OR coverage drop >5%
LOW:      small diff + standard/low_risk paths + coverage stable

Gate Decision

Risk	Layer 3?	Estimated cost
LOW	Skip	~$0.01–0.05 per PR
MEDIUM	Yes, standard depth	~$0.10–0.50 per PR
HIGH	Yes, full depth + extra lenses	~$0.50–2.00 per PR
CRITICAL	Yes, full depth + mandatory senior review flag	~$0.50–2.00 per PR

Client can override: "always run Layer 3 on all PRs" or "skip Layer 3 for docs/**".

Risk Analysis: What Can Go Wrong

Layer 2 is the hinge of the system. It determines cost, depth, and trust. The following risks have been researched and mitigated in the design.

Risk 1: Non-determinism

⚠️ Risk: Non-determinism

LLMs produce different outputs for identical inputs, even at temperature=0. The same PR reviewed twice may receive different risk scores.

Evidence: Research measuring consistency across 5 identical runs found Claude Sonnet at 0.85 correlation, GPT-4o at 0.79. Subjective assessments (e.g., "maintainability") dropped to 0.53 correlation.

Mitigation: Risk classification uses deterministic signals as primary inputs (path classification, diff size, coverage delta). AI judgment is one signal among several, not the sole decider. For borderline cases near risk thresholds, optional consensus (2–3 runs, majority vote).

Risk 2: Cry Wolf Effect, Developer Alert Fatigue

⚠️ Risk: Cry Wolf Effect

Too many comments lead developers to auto-dismiss everything, including real findings.

Evidence: CodeRabbit produces 8–20 comments per PR. After ~10 days, teammates auto-dismissed all of them. GitHub Copilot intentionally limits to 2–5 comments with 71% actionable rate and stays silent in 29% of cases. Industry rule: <30–40% action rate = noise.

Mitigation:

Hard cap: Layer 2 max 5 findings, prioritized by severity
Confidence threshold: do not report findings below 0.8 confidence
Silence is OK: if nothing important, report VCR: No issues found (risk: LOW)
Precision over recall: better to miss a LOW finding than report a false positive

Risk 3: Large Diff Degradation

⚠️ Risk: Large Diff Degradation

AI accuracy degrades on large PRs (>500 changed lines). The "lost in the middle" phenomenon means content at the beginning and end of context gets 85–95% accuracy, while the middle drops to 76–82%.

Evidence: Models with claimed 200K token context become unreliable around 130K tokens, with sudden performance drops.

Mitigation:

Chunk strategy: for PRs >500 lines, Layer 2 analyzes per-file and aggregates at the end
Repository knowledge layer provides targeted context (relevant files only, not everything)
PR size warning: PRs >1000 lines get automatic recommendation to split
Layer 3 uses selective context informed by dependency graph, not "entire module"

Risk 4: Prompt Injection via PR Content

⚠️ Risk: Prompt Injection

Malicious or "creative" code comments, PR descriptions, or commit messages can contain instructions that manipulate the AI reviewer.

Evidence: Anthropic's own Claude Code Security Review action warns it is "not hardened against prompt injection." OWASP ranks prompt injection as #1 risk for LLMs. Every file, comment, and PR description is a potential injection surface.

Mitigation:

Input sanitization: strip suspicious instruction patterns before passing to AI
Segregated prompts: system prompt with hard rules ("NEVER skip security analysis") separated from user content
Canary checks: include known-bad patterns in test. If AI misses them, injection likely occurred.
Layer 1 as backstop: deterministic SAST/secret scan is immune to prompt injection. Even if Layer 2 AI is manipulated, Layer 1 still blocks.

Risk 5: Cost Explosion at Scale

⚠️ Risk: Cost Explosion

Enterprise with 200 developers at 1–2 PRs/day = 200–400 PRs/day. Deep review at $0.50–2.00/PR scales to $2,000–16,000/month.

Evidence: Claude Code Review averages $15–25 per PR (full agentic review). At 100 PRs/day, monthly cost reaches $45,000–75,000.

Mitigation:

Layered cost model: Layer 2 (small/fast model) at $0.01–0.05. Layer 3 (capable model) only for MEDIUM+ risk
Budget caps in configuration: max_daily_layer3_budget: $50
Model tiering: fast model for Layer 2, capable model for Layer 3 standard, most capable for CRITICAL
Prompt caching reduces cost of repeated context
Track cost-per-useful-finding, not cost-per-PR

Risk 6: Generic / Surface-Level Feedback

⚠️ Risk: Generic Feedback

Without repo context, AI falls back to generic "code reviewer" mode, commenting on naming, suggesting docstrings, flagging missing type hints. Nothing a senior would not see in 5 seconds.

Evidence: Augment Code found early versions using "pattern-based grep-search" for context produced generic findings. Quality improved only after semantic retrieval + organizational context.

Mitigation:

Repository knowledge layer provides repo-specific conventions, patterns, ownership
Anti-patterns in prompts: "DO NOT comment on naming style, missing docs, import order, formatting"
Org conventions: client defines "in our repo we do X, not Y" via instructions and standards.sources in .visdom.yaml, plus org rules in .visdom/rules/
Minimum severity in Layer 2: Quick Scan does not report below MEDIUM

Risk 7: AI Reviewing AI, Blind Spots

⚠️ Risk: AI Reviewing AI

AI-generated code has patterns that another AI model may not catch because it produces similar patterns itself. Over-engineering, unnecessary abstractions, and hallucinated APIs look "clean" to an AI reviewer.

Evidence: Veracode 2025 tested 100+ LLMs: 45% of AI-generated code contains OWASP vulnerabilities. Models' own tests caught none of them. Spotify's LLM-as-judge vetoes 25% of agent output, meaning 1 in 4 passes CI but is wrong.

Mitigation:

AI-specific patterns — hallucinated APIs, unnecessary abstractions, over-engineering, Factory/Builder where direct instantiation is convention — are explicit targets of the correctness and security lenses plus Layer 1 rules
Repository knowledge layer can compare new code complexity against module baseline
Optional cross-model review: if code was generated by model A, review with model B

Risk 8: Cross-Cultural Interpretation

⚠️ Risk: Cross-Cultural Interpretation

"This code needs refactoring" lands differently for a senior in Krakow, a mid in London, and a junior in Bangalore. AI comments without cultural sensitivity can be demotivating or ignored.

Evidence: Shopify research found that feedback must be constructive, not evaluative, for distributed teams.

Mitigation:

Structured finding format: always Problem → Why it matters → Concrete fix suggestion
Configurable tone: tone: direct | constructive | educational. Default: constructive
No blame attribution: VCR says "this code has..." never "you wrote..."

Risk Priority Matrix

#	Risk	Severity	Likelihood	Primary mitigation
2	Cry Wolf	CRITICAL	HIGH	Hard cap, confidence threshold, silence is OK
6	Generic feedback	HIGH	HIGH	Repo context, anti-patterns in prompts, min severity
1	Non-determinism	HIGH	HIGH	Deterministic signals primary in risk classifier
7	AI reviewing AI	HIGH	MEDIUM	AI patterns in correctness/security lenses + L1 rules
3	Large diff degradation	HIGH	MEDIUM	Chunk per-file, selective context, PR size warning
4	Prompt injection	CRITICAL	LOW–MED	Input sanitization, Layer 1 backstop
5	Cost explosion	MEDIUM	MEDIUM	Layered cost model, budget caps
8	Cross-cultural	MEDIUM	MEDIUM	Structured format, configurable tone

⚠️ Most Dangerous Risks

Risks #2 (Cry Wolf) and #6 (Generic feedback) are the most dangerous. They lead to abandonment. If developers stop reading VCR comments, all other mitigations are irrelevant.