Layer 2 is the first AI contact with the code. It has two goals: (1) fast feedback on obvious problems, and (2) risk classification of the PR to determine whether the expensive Layer 3 runs.
Input
review-context.jsonfrom Layer 0layer1-results.jsonfrom Layer 1- Diff only (not full files) + ~50 lines surrounding context per hunk
Prompt Structure
You are a code reviewer. Review this diff.
## Context
- PR: {title}, author: {author}
- Affected paths: {paths} (classification: {critical/sensitive/standard})
- Layer 1 findings: {summary}
- Repository knowledge: {ownership, recent changes, module stability}
## Tasks
1. RISK CLASSIFICATION: Assess PR risk (LOW/MEDIUM/HIGH/CRITICAL)
Signals: size, affected paths, complexity, test coverage delta, module stability
2. QUICK FINDINGS: Report obvious problems (max 5):
- Obvious bugs, missing error handling
- Missing tests for new code paths
- Copy-paste / dead code
- DO NOT comment on: naming style, missing docs, import order, formatting
(these are handled by Layer 1 linters)
3. AI-CODE DETECTION: Does this code appear AI-generated?
Signals: over-engineering, unnecessary abstractions, generic variable names
4. CIRCULAR TEST DETECTION: Do new tests mirror implementation logic
rather than testing against a specification?
## Output format
Respond with JSON matching the Layer 2 output schema:
- risk_classification: LOW | MEDIUM | HIGH | CRITICAL
- risk_signals: array of { signal, value, weight }
- findings: array of { severity, file, line, category, description,
suggestion, confidence }
- ai_generated: { detected: bool, confidence: float, signals: array }
- circular_tests: array of { test_file, test_name, reason } Risk Classification Logic
Risk classification is primarily deterministic, with AI judgment as one input among several:
| Signal | Source | Weight |
|---|---|---|
| Path classification | Config (deterministic) | High |
| Diff size | Git (deterministic) | Medium |
| Coverage delta | CI (deterministic) | Medium |
| Module stability | Repository knowledge layer (deterministic) | Medium |
| AI-generated flag | Layer 2 AI detection | Medium |
| AI complexity assessment | Layer 2 AI judgment | Low |
CRITICAL: critical path + large diff + coverage drop
HIGH: critical path OR (large diff + sensitive path)
MEDIUM: sensitive path OR AI-generated flag OR coverage drop >5%
LOW: small diff + standard/low_risk paths + coverage stable Gate Decision
| Risk | Layer 3? | Estimated cost |
|---|---|---|
| LOW | Skip | ~$0.01–0.05 per PR |
| MEDIUM | Yes, standard depth | ~$0.10–0.50 per PR |
| HIGH | Yes, full depth + extra lenses | ~$0.50–2.00 per PR |
| CRITICAL | Yes, full depth + mandatory senior review flag | ~$0.50–2.00 per PR |
Client can override: "always run Layer 3 on all PRs" or "skip Layer 3 for docs/**".
Risk Analysis: What Can Go Wrong
Layer 2 is the hinge of the system. It determines cost, depth, and trust. The following risks have been researched and mitigated in the design.
Risk 1: Non-determinism
⚠️ Risk: Non-determinism
LLMs produce different outputs for identical inputs, even at temperature=0. The same PR reviewed twice may receive different risk scores.
Evidence: Research measuring consistency across 5 identical runs found Claude Sonnet at 0.85 correlation, GPT-4o at 0.79. Subjective assessments (e.g., "maintainability") dropped to 0.53 correlation.
Mitigation: Risk classification uses deterministic signals as primary inputs (path classification, diff size, coverage delta). AI judgment is one signal among several, not the sole decider. For borderline cases near risk thresholds, optional consensus (2–3 runs, majority vote).
Risk 2: Cry Wolf Effect, Developer Alert Fatigue
⚠️ Risk: Cry Wolf Effect
Too many comments lead developers to auto-dismiss everything, including real findings.
Evidence: CodeRabbit produces 8–20 comments per PR. After ~10 days, teammates auto-dismissed all of them. GitHub Copilot intentionally limits to 2–5 comments with 71% actionable rate and stays silent in 29% of cases. Industry rule: <30–40% action rate = noise.
Mitigation:
- Hard cap: Layer 2 max 5 findings, prioritized by severity
- Confidence threshold: do not report findings below 0.8 confidence
- Silence is OK: if nothing important, report
VCR: No issues found (risk: LOW) - Precision over recall: better to miss a LOW finding than report a false positive
Risk 3: Large Diff Degradation
⚠️ Risk: Large Diff Degradation
AI accuracy degrades on large PRs (>500 changed lines). The "lost in the middle" phenomenon means content at the beginning and end of context gets 85–95% accuracy, while the middle drops to 76–82%.
Evidence: Models with claimed 200K token context become unreliable around 130K tokens, with sudden performance drops.
Mitigation:
- Chunk strategy: for PRs >500 lines, Layer 2 analyzes per-file and aggregates at the end
- Repository knowledge layer provides targeted context (relevant files only, not everything)
- PR size warning: PRs >1000 lines get automatic recommendation to split
- Layer 3 uses selective context informed by dependency graph, not "entire module"
Risk 4: Prompt Injection via PR Content
⚠️ Risk: Prompt Injection
Malicious or "creative" code comments, PR descriptions, or commit messages can contain instructions that manipulate the AI reviewer.
Evidence: Anthropic's own Claude Code Security Review action warns it is "not hardened against prompt injection." OWASP ranks prompt injection as #1 risk for LLMs. Every file, comment, and PR description is a potential injection surface.
Mitigation:
- Input sanitization: strip suspicious instruction patterns before passing to AI
- Segregated prompts: system prompt with hard rules ("NEVER skip security analysis") separated from user content
- Canary checks: include known-bad patterns in test. If AI misses them, injection likely occurred.
- Layer 1 as backstop: deterministic SAST/secret scan is immune to prompt injection. Even if Layer 2 AI is manipulated, Layer 1 still blocks.
Risk 5: Cost Explosion at Scale
⚠️ Risk: Cost Explosion
Enterprise with 200 developers at 1–2 PRs/day = 200–400 PRs/day. Deep review at $0.50–2.00/PR scales to $2,000–16,000/month.
Evidence: Claude Code Review averages $15–25 per PR (full agentic review). At 100 PRs/day, monthly cost reaches $45,000–75,000.
Mitigation:
- Layered cost model: Layer 2 (small/fast model) at $0.01–0.05. Layer 3 (capable model) only for MEDIUM+ risk
- Budget caps in configuration:
max_daily_layer3_budget: $50 - Model tiering: fast model for Layer 2, capable model for Layer 3 standard, most capable for CRITICAL
- Prompt caching reduces cost of repeated context
- Track cost-per-useful-finding, not cost-per-PR
Risk 6: Generic / Surface-Level Feedback
⚠️ Risk: Generic Feedback
Without repo context, AI falls back to generic "code reviewer" mode, commenting on naming, suggesting docstrings, flagging missing type hints. Nothing a senior would not see in 5 seconds.
Evidence: Augment Code found early versions using "pattern-based grep-search" for context produced generic findings. Quality improved only after semantic retrieval + organizational context.
Mitigation:
- Repository knowledge layer provides repo-specific conventions, patterns, ownership
- Anti-patterns in prompts: "DO NOT comment on naming style, missing docs, import order, formatting"
- Conventions file (
.vcr/conventions.md): client defines "in our repo we do X, not Y" - Minimum severity in Layer 2: Quick Scan does not report below MEDIUM
Risk 7: AI Reviewing AI, Blind Spots
⚠️ Risk: AI Reviewing AI
AI-generated code has patterns that another AI model may not catch because it produces similar patterns itself. Over-engineering, unnecessary abstractions, and hallucinated APIs look "clean" to an AI reviewer.
Evidence: Veracode 2025 tested 100+ LLMs: 45% of AI-generated code contains OWASP vulnerabilities. Models' own tests caught none of them. Spotify's LLM-as-judge vetoes 25% of agent output, meaning 1 in 4 passes CI but is wrong.
Mitigation:
- Dedicated "AI-Code Safety" lens targeting specific AI patterns: hallucinated APIs, unnecessary abstractions, over-engineering, Factory/Builder where direct instantiation is convention
- Repository knowledge layer can compare new code complexity against module baseline
- Optional cross-model review: if code was generated by model A, review with model B
Risk 8: Cross-Cultural Interpretation
⚠️ Risk: Cross-Cultural Interpretation
"This code needs refactoring" lands differently for a senior in Krakow, a mid in London, and a junior in Bangalore. AI comments without cultural sensitivity can be demotivating or ignored.
Evidence: Shopify research found that feedback must be constructive, not evaluative, for distributed teams.
Mitigation:
- Structured finding format: always Problem → Why it matters → Concrete fix suggestion
- Configurable tone:
tone: direct | constructive | educational. Default:constructive - No blame attribution: VCR says "this code has..." never "you wrote..."
Risk Priority Matrix
| # | Risk | Severity | Likelihood | Primary mitigation |
|---|---|---|---|---|
| 2 | Cry Wolf | CRITICAL | HIGH | Hard cap, confidence threshold, silence is OK |
| 6 | Generic feedback | HIGH | HIGH | Repo context, anti-patterns in prompts, min severity |
| 1 | Non-determinism | HIGH | HIGH | Deterministic signals primary in risk classifier |
| 7 | AI reviewing AI | HIGH | MEDIUM | Dedicated AI-Code Safety lens |
| 3 | Large diff degradation | HIGH | MEDIUM | Chunk per-file, selective context, PR size warning |
| 4 | Prompt injection | CRITICAL | LOW–MED | Input sanitization, Layer 1 backstop |
| 5 | Cost explosion | MEDIUM | MEDIUM | Layered cost model, budget caps |
| 8 | Cross-cultural | MEDIUM | MEDIUM | Structured format, configurable tone |
⚠️ Most Dangerous Risks
Risks #2 (Cry Wolf) and #6 (Generic feedback) are the most dangerous. They lead to abandonment. If developers stop reading VCR comments, all other mitigations are irrelevant.