Zarif Automates

How to Build an AI Agent for Code Review

ZarifZarif
||Updated April 6, 2026

Code review is where 62% of bugs slip through before hitting production. Most teams rely on humans reading code line by line—slow, inconsistent, and exhausting. But you don't have to settle for that.

AI code review agents work. CodeRabbit catches security issues in seconds. Qodo finds runtime bugs with 85% accuracy. PR-Agent runs in your CI/CD. But here's what most articles skip: these tools work well for what they're designed to do, but they don't know your codebase's architecture, your team's conventions, or where your code is fragile.

That's where building your own agent matters.

Definition

An AI code review agent is an autonomous system that analyzes pull requests, retrieves relevant context from your codebase, and surfaces bugs, security issues, and style violations—without human review. It combines code understanding (semantic search, AST parsing), retrieval strategies (context augmentation), and reasoning to make better recommendations than static tools alone.

TL;DR

  • 84% of developers use AI coding tools; AI code review adoption hits 60%+ by 2027
  • AI code review reduces review time by 62%, but AI-authored code has 1.7x more issues—hybrid review (AI + human) is the real win
  • Custom agents outperform pre-built tools by understanding your architecture, security model, and codebase conventions
  • Webhook-based + event-driven patterns scale better than CI/CD blocking
  • Semantic code search (AST + vector embeddings) improves factual correctness by 8% over keyword search alone
  • Implementation uses LangGraph, FastAPI, and your Git host's webhook API

Why Build Your Own Code Review Agent?

The pre-built tools (CodeRabbit, Qodo, Snyk) are solid. They catch obvious bugs, enforce lint rules, flag security patterns. But they're generic. They don't know that your service is stateless and shouldn't maintain session state. They don't understand that your team's PR naming convention tells the agent what type of changes to expect. They can't weight your most fragile modules differently.

Custom agents let you do this:

Understand your architecture. Embed your microservice boundaries, dependency rules, and threat model into the agent's reasoning. When an API handler talks directly to the database instead of through a repository layer, the agent flags it because it violates YOUR architecture, not some generic pattern.

Integrate tribal knowledge. Your team knows which modules have had security issues, which are performance-critical, which require extra scrutiny. A custom agent can weight reviews accordingly.

Control the workflow. Pre-built tools run inside your CI/CD, blocking merges. Custom agents can run async, comment on PRs, escalate to humans only when high-risk changes are detected.

Reduce false positives. Generic rules trigger on false alarms. Semantic code search reduces false positives by 8% because it understands code meaning, not just patterns.

Here's what the numbers show: AI-authored code has 10.83 issues per 100 lines versus 6.45 for human-only code. But AI-assisted review (AI + human feedback loop) drops that to 4.2. Your custom agent's job is to be that assistant, not the final reviewer.

Architecture Patterns: Which One to Use?

You have three main patterns. Pick based on your infrastructure and risk tolerance.

Pattern 1: Webhook-Based (Fastest to Ship)

GitHub fires a webhook when a PR opens. Your FastAPI server receives it, spins up the agent, and posts comments back. The agent runs in parallel with development; it doesn't block merges.

Pros:

  • Real-time feedback (agent responds within 30 seconds of PR opening)
  • Non-blocking (developers keep working)
  • Simple to debug (one process, easy logs)

Cons:

  • If the agent is slow, comments arrive late (not helpful for fast-moving teams)
  • Requires always-on server

Best for: Teams with 10-50 engineers, moderate PR velocity, willing to run a small server.

Pattern 2: CI/CD Pipeline Integration

Your CI workflow triggers the agent, agent comments, then CI reports pass/fail. Blocks merge if issues are critical.

Pros:

  • Blocks bad code at the gate
  • Integrated with existing CI signals
  • Familiar to teams already using GitHub Actions

Cons:

  • Slows down merge process (agent runtime + CI overhead)
  • Coupling between review and deployment increases false negatives (stricter rules = more blocks)

Best for: Teams with strict compliance requirements (fintech, healthcare), strong DevOps culture.

Pattern 3: Event-Driven Microservices

Webhook → Message queue (RabbitMQ, SQS) → Worker pool → Agent pool → Results storage → GitHub API call. Scales horizontally.

Pros:

  • Handles 1000+ PRs per day without degradation
  • Workers scale independently
  • Decoupled (queue absorbs spikes)

Cons:

  • Operational overhead (queue, workers, monitoring)
  • Debugging is harder (distributed tracing needed)

Best for: Companies with 500+ engineers, high PR velocity, mature DevOps.

Most teams should start with Pattern 1 (webhook + FastAPI). It's the sweet spot: simple, effective, and scales to 50+ engineers.

Step 1: Define Your Review Rules and Context Strategy

Before you write code, define what your agent actually cares about. Don't try to review everything on day one.

Start narrow. Pick one category:

  • Security: SQL injection, auth bypass, exposed secrets
  • Performance: N+1 queries, unnecessary loops, memory leaks
  • Architecture: Layer violations, contract breaches, dependency inversions
  • Style: Naming conventions, test coverage, documentation

Write 5-10 rules for that category. Example rules for a Python API:

1. If a function in handlers/ queries the database directly (not through repository layer), flag it
2. If a secret (API key, password) appears in code (not env config), block and alert human
3. If a new endpoint has no rate limiting, flag as security risk
4. If a query loops over results and calls database per row, flag N+1
5. If a change touches auth/* but has no test addition, flag as risky

These rules are your agent's "constitution." They should reflect your actual risk model, not generic best practices.

Define your context strategy. The agent needs access to:

  • Related files (imports, dependencies)
  • Similar code patterns (for consistency)
  • Architecture documentation (your threat model)
  • Recent commits (to understand intent)

For a Python agent, this looks like:

When analyzing PR:
1. Extract files changed
2. Parse imports to find related modules
3. Fetch last 3 commits to those modules
4. Vector search for similar patterns in codebase
5. Look up module in architecture registry
6. Inject all of this as context into the LLM prompt

The retrieval step is critical. It's where you beat generic tools. Pre-built tools can't do this because they don't have access to your codebase internals.

Tip

Use hybrid retrieval: combine AST-based structural search (exact imports, function calls) with vector embeddings (semantic similarity). This improves factual correctness by 8% over keyword search alone.

Step 2: Set Up Your Semantic Code Search Index

You need a fast way to search your codebase by meaning, not just keywords. This is what separates good custom agents from bad ones.

Option A: Cheap and Fast (pgvector + PostgreSQL)

Embed your codebase at build time using OpenAI or Claude embeddings. Store vectors in pgvector. Search with cosine similarity.

# 1. Index your codebase
python scripts/embed_codebase.py --output-db postgres://...

# 2. Query at review time
results = db.query("""
  SELECT file, content, 1 - (embedding <=> query_embedding) as similarity
  FROM code_chunks
  WHERE 1 - (embedding <=> query_embedding) > 0.75
  ORDER BY similarity DESC
  LIMIT 10
""")

Cost: Free for small codebases (<100K lines). $5-20/month for larger ones.

Option B: Production Grade (Pinecone, Weaviate)

Use a dedicated vector DB if you're indexing millions of lines.

# 1. Embed at CI time
pinecone_client.upsert(
  vectors=[
    (chunk_id, embedding, {"file": path, "content": code})
    for chunk_id, embedding, code, path in embeddings
  ]
)

# 2. Query at review time
results = index.query(query_embedding, top_k=10, include_metadata=True)

Cost: $0.07 per 1M tokens.

For most teams, pgvector is the move. It's fast (10ms query time), cheap, and integrates seamlessly with your database.

Here's the workflow:

1. PR opens → webhook fires
2. Agent fetches changed files from GitHub
3. For each changed file:
   a. Vector search for similar patterns in codebase
   b. AST parse to find all imports and function calls
   c. Fetch full content of imported modules
4. Build context: [changed code] + [similar patterns] + [related modules] + [your rules]
5. Send to Claude or GPT-4 with your review prompt
6. Parse response, post comments to PR
Warning

Embedding outdated code costs money and produces bad results. Rebuild your index every time the main branch updates. In your CI/CD, add a step: python scripts/embed_codebase.py after every successful merge to main.

Step 3: Build the Agent with LangGraph

Use LangGraph (open source, 10,500 GitHub stars) to orchestrate your agent. It handles state management, looping, and error recovery.

Here's a minimal agent:

from langgraph.graph import StateGraph, START, END
from typing import TypedDict, Annotated
import operator

class CodeReviewState(TypedDict):
    pr_number: int
    files_changed: list[str]
    file_contents: dict[str, str]
    context: str
    review_comments: Annotated[list[str], operator.add]
    issues_found: Annotated[list[dict], operator.add]

def fetch_pr_files(state: CodeReviewState) -> CodeReviewState:
    """Fetch changed files from GitHub API"""
    files = github_client.get_pr_files(state["pr_number"])
    contents = {f["filename"]: f["content"] for f in files}
    return {**state, "files_changed": list(contents.keys()), "file_contents": contents}

def retrieve_context(state: CodeReviewState) -> CodeReviewState:
    """Semantic search + AST to build context"""
    context_parts = []

    for file_path in state["files_changed"]:
        # Vector search for similar code
        similar = vector_db.search(state["file_contents"][file_path], top_k=5)
        context_parts.append(f"# Similar patterns in codebase:\n{similar}")

        # AST analysis for imports
        imports = extract_imports(state["file_contents"][file_path])
        related_code = {imp: state["file_contents"].get(imp, "NOT FOUND") for imp in imports}
        context_parts.append(f"# Related imports:\n{related_code}")

    return {**state, "context": "\n".join(context_parts)}

def analyze_with_llm(state: CodeReviewState) -> CodeReviewState:
    """Call Claude to generate review"""
    prompt = f"""
You are an expert code reviewer. Review this PR against these rules:
{YOUR_REVIEW_RULES}

Changed files:
{state['files_changed']}

Context (similar code, related modules):
{state['context']}

Full code:
{state['file_contents']}

For each issue found, respond with:
- Issue: [brief description]
- Severity: [critical|high|medium|low]
- Location: [file:line]
- Suggestion: [how to fix]

Be specific. Reference your rules and the code exactly.
"""

    response = claude_client.messages.create(
        model="claude-opus-4",
        max_tokens=2000,
        messages=[{"role": "user", "content": prompt}]
    )

    issues = parse_issues(response.content[0].text)
    comments = format_comments(issues)

    return {**state, "issues_found": issues, "review_comments": comments}

def post_comments(state: CodeReviewState) -> CodeReviewState:
    """Post review comments to PR"""
    for comment in state["review_comments"]:
        github_client.create_review_comment(state["pr_number"], comment)
    return state

# Build graph
graph = StateGraph(CodeReviewState)
graph.add_node("fetch_files", fetch_pr_files)
graph.add_node("retrieve_context", retrieve_context)
graph.add_node("analyze", analyze_with_llm)
graph.add_node("post", post_comments)

graph.add_edge(START, "fetch_files")
graph.add_edge("fetch_files", "retrieve_context")
graph.add_edge("retrieve_context", "analyze")
graph.add_edge("analyze", "post")
graph.add_edge("post", END)

agent = graph.compile()

Run it:

result = agent.invoke({
    "pr_number": 1234,
    "files_changed": [],
    "file_contents": {},
    "context": "",
    "review_comments": [],
    "issues_found": []
})

This agent runs each step in sequence, retrieves context, calls Claude, and posts results. It's not fancy, but it works.

Step 4: Deploy as a Webhook Server

Wrap your agent in FastAPI:

from fastapi import FastAPI, Request
from pydantic import BaseModel

app = FastAPI()

class GitHubWebhook(BaseModel):
    action: str
    pull_request: dict
    repository: dict

@app.post("/webhooks/github")
async def handle_pr(payload: GitHubWebhook):
    if payload["action"] != "opened":
        return {"status": "ignored"}

    pr_number = payload["pull_request"]["number"]
    repo = payload["repository"]["full_name"]

    # Run agent asynchronously
    result = agent.invoke({
        "pr_number": pr_number,
        "files_changed": [],
        "file_contents": {},
        "context": "",
        "review_comments": [],
        "issues_found": []
    })

    return {"status": "reviewed", "issues": len(result["issues_found"])}

Deploy to your server (Heroku, Railway, your own VPS):

# Install dependencies
pip install fastapi uvicorn langgraph

# Run server
uvicorn app:app --host 0.0.0.0 --port 8000

Add the webhook URL to GitHub:

  1. Go to your repo → Settings → Webhooks
  2. Add webhook: https://your-domain.com/webhooks/github
  3. Select "Pull requests" events
  4. Save

Now every PR that opens triggers your agent.

Tip

Add a timeout. If your agent takes more than 30 seconds, post a comment anyway: "Review in progress. I'll update this comment when done." Finish async and edit the comment. Users hate stale feedback.

Step 5: Integrate Your Codebase Knowledge

This is where custom agents beat pre-built tools. Inject your architecture and conventions.

Option 1: Architecture Registry (YAML)

Create docs/architecture.yaml:

modules:
  handlers:
    rules:
      - must_use: repository_layer
      - must_have: rate_limiting
      - must_test: all_changes
    risk_level: high

  repository:
    rules:
      - must_not: call_handlers
      - must_use: typed_queries
    risk_level: medium

  utils:
    rules:
      - must_be: pure_functions
    risk_level: low

security:
  threat_model:
    - api_injection
    - auth_bypass
    - secrets_in_code

  review_extra_strict:
    - auth/
    - payments/
    - admin/

In your agent, load this and inject it into the LLM prompt:

import yaml

with open("docs/architecture.yaml") as f:
    architecture = yaml.safe_load(f)

# In analyze_with_llm():
prompt += f"\n\nArchitecture rules:\n{architecture}"

Option 2: Semantic Rules Engine

Instead of hardcoded rules, use embeddings to find rule violations:

# At index time, embed your architecture docs
architecture_embeddings = {
    "stateless": embed("API handlers must be stateless. Use dependency injection for state."),
    "layer_separation": embed("Handlers should not query the database directly. Use repository layer."),
}

# At review time, for each change:
change_embedding = embed(changed_code)
violations = []

for rule_name, rule_embedding in architecture_embeddings.items():
    similarity = cosine_similarity(change_embedding, rule_embedding)
    if similarity > 0.8:  # High relevance
        violations.append({rule_name, similarity})

This is more flexible than keyword matching and adapts to paraphrasing.

Step 6: Handle Edge Cases and Errors

Real agents need error handling:

def analyze_with_llm(state: CodeReviewState) -> CodeReviewState:
    try:
        # ... existing code ...
    except RateLimitError:
        # Backoff and retry
        import time
        time.sleep(60)
        return analyze_with_llm(state)

    except TokenLimitError:
        # File too large. Summarize instead.
        summary = summarize_large_file(state["file_contents"])
        state["file_contents"] = {k: summary if len(v) > 20000 else v
                                 for k, v in state["file_contents"].items()}
        return analyze_with_llm(state)

    except Exception as e:
        # Post failure comment and alert
        github_client.create_review_comment(
            state["pr_number"],
            f"Code review agent failed: {str(e)}. Check logs."
        )
        logger.error(f"Review failed for PR {state['pr_number']}: {e}")
        return state

Also set a timeout. If the agent runs longer than 5 minutes, kill it and retry:

from concurrent.futures import ThreadPoolExecutor, TimeoutError
import threading

executor = ThreadPoolExecutor(max_workers=4)

@app.post("/webhooks/github")
async def handle_pr(payload: GitHubWebhook):
    try:
        future = executor.submit(agent.invoke, initial_state)
        result = future.result(timeout=300)  # 5 minute timeout
    except TimeoutError:
        logger.warning(f"Review timeout for PR {payload['pull_request']['number']}")
        # Retry later or post "review timed out" comment

    return {"status": "done"}

Comparing Pre-Built Tools vs. Custom Agents

FeatureCodeRabbitQodo PR AgentPR-Agent (OSS)Custom Agent
Bug detection rate40-50%85% (F1: 60.1%)35-45%70-80%*
Security focusGeneralGeneralGeneralYOUR threat model
Knows your architectureNoNoNoYes (configurable)
Cost (per developer)$24-30/mo$30+/moFree$5-50/mo* (LLM usage)
Setup time5 min5 min30 min2-4 weeks
CustomizationLimited (UI config)Limited (UI config)Full (open source)Full
Multi-language support20+ languages15+ languages10+ languagesAny (depends on your model)

*Custom agent rates assume good architecture definition and semantic search. Results vary widely.

Pick pre-built if: You want it running today with zero operational overhead. CodeRabbit and Qodo are solid.

Pick custom if: You have a specific threat model, architectural patterns your team enforces, or you're willing to invest 2-4 weeks for better detection that understands your codebase.

Most teams should start with CodeRabbit or Qodo, then migrate to custom if the generic rules don't fit your needs.

Real-World Optimization: Reducing False Positives

The biggest complaint about AI code review: too many false positives. Your agent flags things that aren't actually problems.

Tactic 1: Raise the threshold for what counts as an issue

Instead of every potential problem, only flag high-confidence issues:

issues = parse_issues(response.content[0].text)

# Filter to high-confidence only
issues = [
    issue for issue in issues
    if issue["confidence"] >= 0.85  # Claude's own confidence score
]

Tactic 2: Use context to eliminate false positives

If the agent finds a potential issue, check if it was intentional:

# Agent flags: "Database called in handler (layer violation)"
# But check: does the handler have a comment explaining why?

file_content = state["file_contents"][issue["file"]]
context_around_issue = extract_lines(file_content,
                                     issue["line"] - 3,
                                     issue["line"] + 3)

if "TODO:" in context_around_issue or "HACK:" in context_around_issue:
    issue["severity"] = "low"  # Developer already knows about it
    issue["skip"] = True

Tactic 3: Weight severity by module

Don't treat all violations equally:

SEVERITY_WEIGHTS = {
    "handlers/auth": 2.0,      # Double severity in auth code
    "handlers/payments": 2.0,
    "utils/": 0.5,             # Half severity in utils
}

for issue in issues:
    for module, weight in SEVERITY_WEIGHTS.items():
        if issue["file"].startswith(module):
            issue["severity_score"] *= weight

Apply these tactics and your false positive rate drops dramatically.

Warning

Never auto-merge based on agent approval alone. Qodo's research shows AI-authored code has 1.7x more issues than human-written code. Your agent is an assistant, not a gatekeeper. Always require human review for merges.

Security Considerations

Your agent has GitHub access. That's powerful and dangerous.

  1. Use fine-grained tokens: Create a GitHub personal access token with only these permissions:

    • pull_requests:read
    • contents:read
    • pull_request_reviews:write
  2. Never log code: Your logs might be searchable by others. Never log the full changed code. Log only the file path and line number.

  3. Encrypt webhook secrets: GitHub sends a secret with each webhook. Verify it:

import hmac
import hashlib

@app.post("/webhooks/github")
async def handle_pr(request: Request):
    signature = request.headers.get("X-Hub-Signature-256")
    body = await request.body()

    expected = "sha256=" + hmac.new(
        WEBHOOK_SECRET.encode(),
        body,
        hashlib.sha256
    ).hexdigest()

    if not hmac.compare_digest(signature, expected):
        return {"error": "invalid signature"}, 401

    # ... handle PR ...
  1. Limit agent scope: Your agent shouldn't access secrets. It should only see code, not .env files or credential files. In your GitHub app permissions, disable access to sensitive files.

Want to go deeper? Read:

FAQ

Should I use Qodo, CodeRabbit, or build custom?

Start with pre-built (CodeRabbit or Qodo). They're cheaper, faster to deploy, and cover 80% of cases. Build custom only if:

  • You have specific architectural rules the generic tools miss
  • You're already running your own infrastructure
  • False positive rate from pre-built tools is too high

Most startups and small teams never need custom.

How long does code review take with an agent?

With a webhook-based agent on a 2-core server, expect 10-30 seconds per PR. Hybrid retrieval (AST + vectors) adds 2-5 seconds. If you use pre-built tools, it's instant (their servers are faster). Custom agents are slower because they do more work, but the quality is higher.

Can the agent review the entire codebase, or just changed files?

Just changed files. Reviewing the entire codebase at every PR is too slow and expensive. Focus on:

  1. The diff (what changed)
  2. Related files (imports, dependencies)
  3. Similar patterns in codebase (via vector search)

This is 95% as useful as reviewing everything, but 50x faster.

What LLM should I use: Claude, GPT-4, or open-source?

Claude Opus is the best for code review (most accurate, understands nuance). GPT-4 is a close second. Open-source models (Llama, Mistral) are cheaper but less accurate for complex reasoning. For a custom agent, use Claude Opus. For pre-built tools, it doesn't matter (they pick their own model).

How do I handle PRs that are too large to fit in the LLM context?

Strategies:

  1. Reject PRs that touch more than 20 files (encourage smaller PRs)
  2. Summarize large files: extract only changed functions, not full file
  3. Use multi-turn review: analyze files in batches, aggregate results
  4. Split context: semantic search to include only MOST relevant code, drop the rest

Most teams should do #1: enforce small PRs in your contribution guide.

What if the agent gives bad reviews? How do I improve it?
  • Log all reviews and outcomes (good reviews vs. bad reviews marked by humans)
  • Retrain your rules based on feedback (if the agent consistently misses a pattern, add a rule)
  • Adjust thresholds (lower confidence threshold = more issues caught, but more false positives)
  • Improve context retrieval (if the agent lacks context about a module, improve your semantic search)
  • Use human feedback loops: if 20% of agent flags are wrong, reduce severity or disable that rule

This is an iterative process. Expect 2-3 months to fine-tune a good agent.


Ready to build? Start with PR-Agent (open source) or CodeRabbit (easiest), then move to custom when your needs outgrow them. The pattern is the same: webhook, context retrieval, LLM call, comment. Build it once, adapt it forever.

Zarif

Zarif

Zarif is an AI automation educator helping thousands of professionals and businesses leverage AI tools and workflows to save time, cut costs, and scale operations.