Zarif Automates

How to Build an AI Agent That Learns from Feedback

ZarifZarif
|

Most AI agents in production fail at the same thing: they are good on day one and worse by week four because nothing about their behavior gets better when users correct them. A static prompt is not learning. A vector store of past chats is not learning. Learning means the agent identifies what went wrong, stores that lesson in a way that influences future decisions, and produces measurably better outputs over time. In 2026 there are two viable patterns to make this real, plus a hybrid that combines them. This is the practitioner's walkthrough.

Definition

A feedback-learning AI agent is an autonomous system that captures explicit or implicit signals about the quality of its outputs, stores those signals as structured memory or as gradient updates to a reward model, and uses them to adjust future behavior without manual prompt changes.

TL;DR

  • Two viable architectures dominate in 2026: Reflexion-style linguistic feedback (no weight updates) and RLHF-style reward-model fine-tuning (weight updates)
  • Reflexion agents store textual reflections in memory and read them as context — fast to ship, no infrastructure beyond a vector store
  • RLHF requires a base model, a reward model trained on preference data, and a policy optimization loop using PPO or DPO
  • Three memory layers matter: working memory for the current task, episodic memory for past attempts, semantic memory for learned rules
  • LangGraph's interrupt and checkpoint primitives are the cleanest way to ship a human-in-the-loop feedback agent today
  • For 90% of business use cases, Reflexion plus a structured feedback log beats trying to run your own RLHF pipeline

The Three Ways an Agent Can "Learn"

Before any architecture decisions, get the vocabulary straight. There are three fundamentally different mechanisms by which an AI agent can incorporate feedback, and confusing them produces broken systems.

The first mechanism is in-context learning — the agent reads examples or rules inside its prompt and behaves accordingly. There is no persistent change. Restart the agent without those examples and the behavior reverts. This is what most "learning" agents actually do.

The second mechanism is memory-based learning — the agent stores feedback, reflections, or lessons in an external store (vector database, structured database, or file system) and retrieves them when relevant tasks come up. Persistent across restarts. Behavior compounds. This is the Reflexion pattern.

The third mechanism is parameter-based learning — the model's weights themselves are updated based on feedback signals. This is RLHF, DPO, or fine-tuning. Most expensive to implement, most powerful when done right, and rarely the right choice for a business application.

The pattern you build with depends on which of these mechanisms you actually need. The most common mistake: builders reach for RLHF when memory-based learning would have solved the problem at 1% of the cost.

Pattern 1: Reflexion — The Pragmatic Default

Reflexion is the architecture I reach for first in 95% of agent projects. The idea is simple: after every task attempt, ask the agent to critique its own work, store that critique in memory, and feed relevant critiques back into the prompt the next time a similar task comes up. No model weights are updated. The agent gets better by reading its past mistakes.

The canonical Reflexion architecture has three components. The Actor is the agent that attempts the task. The Evaluator measures whether the attempt succeeded — this can be a programmatic check (test passed, output validated against schema) or a model-as-judge. The Self-Reflection module takes the failed attempt plus the evaluator's signal and produces a textual reflection explaining what to do differently next time. Those reflections are written to episodic memory and surfaced as additional context on the next attempt.

The reason this works is that LLMs are remarkably good at metacognition when prompted explicitly. Given an input, an output, and a signal saying "this output was wrong because X," the model can produce a useful rule like "when the user asks for currency conversion, always ask which currency before guessing." Stored and retrieved, that rule prevents the same mistake forever.

Tip

The biggest mistake builders make with Reflexion is letting the agent generate reflections after every task. You want reflections only on tasks that failed or scored below threshold. Reflecting on successful tasks pollutes memory with platitudes and slows retrieval.

Pattern 2: RLHF — When the Output Space Is Too Big for Memory

Reinforcement Learning from Human Feedback updates the model's parameters using a learned reward model that approximates human preference. It is the technique behind Claude, GPT-4o, Llama 3, and most modern instruction-tuned models. It is also overkill for almost every business agent project.

The RLHF pipeline has three stages. Stage one is base model training — start with a pre-trained language model. Stage two is reward model training — collect preference data (humans rank pairs of outputs as better or worse) and train a model that predicts which output a human would prefer. Stage three is policy optimization — fine-tune the policy (the agent) using reinforcement learning with the reward model as the objective, typically with PPO (Proximal Policy Optimization) or the more modern DPO (Direct Preference Optimization).

KL regularization — penalizing the policy for drifting too far from the original model — is what keeps the agent from collapsing into reward hacking. Without it, the model will discover degenerate strategies that score high on the reward model but produce gibberish to humans.

When does RLHF actually pay off? When the output space is too large to enumerate as memory rules. Coding assistants that improve over millions of code-review signals. Creative writing assistants where "good" cannot be reduced to retrievable rules. Voice agents where prosody and timing matter. For most business agents — extracting fields from invoices, drafting emails, scheduling appointments — RLHF is the wrong tool.

Pattern 3: Hybrid (Reflexion + Lightweight Fine-Tuning)

The pattern I increasingly use for serious agents in 2026 is a hybrid. Reflexion handles short-term and medium-term learning — anything where a textual rule will fix the behavior. Lightweight fine-tuning (LoRA on a smaller open model, or DPO on a few hundred preference pairs for a managed model) handles long-term drift — adjusting the agent's overall style or domain knowledge once you have enough preference data to justify it.

The trigger for switching from pure Reflexion to hybrid: when your memory store hits roughly 500-1,000 high-quality reflections and retrieval starts thrashing. That is the signal that the lessons should be baked into the model rather than read from a database every call.

Memory Architecture: The Three Layers That Actually Matter

Whichever pattern you pick, the agent's memory needs to be designed in three layers. Conflating them is the #1 reason agent memory systems break.

Working memory is what the agent has in context for the current task — the user message, the tool results so far, the most recent reasoning steps. It is wiped after the task. Working memory exists to handle the task at hand, not to learn.

Episodic memory is the log of past attempts: the input, the output, the success/failure signal, and the reflection. Stored in a vector database with metadata for retrieval (task type, user, time, outcome). The agent retrieves the most relevant 3-10 episodes when starting a new task.

Semantic memory is the distilled knowledge — rules, facts, preferences — extracted from episodic memory through a periodic consolidation step. Stored in a structured store (often a key-value database or graph) where the agent can look up "what do I know about how this user wants invoices formatted" without scanning through 200 episodes.

The consolidation step matters. Every N reflections, or every X days, a background process reviews recent episodic memory and extracts stable patterns into semantic memory. Without consolidation, your agent's retrieval becomes a junk drawer.

Step-by-Step: Building a Reflexion Agent in LangGraph

Here is the minimum viable architecture. The stack is LangGraph (orchestration), Postgres or SQLite (episodic memory with vector extension), and any modern LLM (Claude, GPT-4o, Gemini). The example task is an agent that drafts customer support replies and learns from human edits.

Step 1: Define the Graph

The graph has four nodes: attempt, evaluate, await_feedback, and reflect. Edges run sequentially with a conditional branch from evaluate — if the score is above threshold, skip reflection.

LangGraph's interrupt primitive is the key piece. When the graph reaches await_feedback, it pauses execution and writes a checkpoint. A human reviews the agent's output via UI, submits an edit or a thumbs up/down, and the graph resumes from the checkpoint with that feedback in state. This is how you build human-in-the-loop without polling, timeouts, or fragile webhook chains.

Step 2: Build the Episodic Memory Store

Schema (Postgres + pgvector):

The columns you need are: id (UUID primary key), task_type (string, e.g. "support_reply"), user_id (FK), input (text), output (text), score (numeric, normalized 0-1), reflection (text, nullable), embedding (vector(1536) of the input), created_at (timestamp).

On every completed task, write a row. On every new task, retrieve the top 5 rows by cosine similarity of the input embedding, filtered to the same task_type and user_id.

Step 3: Wire the Reflect Node

The reflect node fires only when the evaluator's score is below threshold (say, 0.7). Its prompt looks like this in concept: take the input, the output, the human edit or downvote signal, and the past 3 reflections for similar tasks. Produce a single-paragraph rule the agent should follow next time.

Critically — store the reflection as both a free-text rule in the reflection column AND have a consolidation job that periodically extracts stable rules into a separate semantic_rules table where they can be retrieved without similarity search.

Step 4: Inject Memory Into the Attempt Node

When the attempt node fires, it retrieves the top 5 episodic memories and the top 3 semantic rules for the task type, then injects them into the system prompt under a header like "Past lessons for tasks like this one." The prompt explicitly instructs the model to apply these lessons.

This is where most implementations fail: people dump all retrieved memories into context without ranking, summarizing, or deduping. The agent's working memory gets polluted and the model regresses to ignoring it. Keep retrieved context tight — never more than 1,000 tokens of past lessons in any single attempt.

Step 5: Build the Evaluator

The evaluator can be programmatic, LLM-as-judge, or human. For customer support replies, a hybrid works best: programmatic checks for hard rules (no profanity, included a ticket number, responded under 300 words), LLM-as-judge for soft quality (tone, helpfulness, accuracy against context), and human override on a sampled subset.

The output of the evaluator is a single score 0-1 plus a structured reason ("missed-greeting", "wrong-tone", "factually-incorrect"). The structured reason is what drives the reflection — the model writes much better reflections when given a specific failure mode to address.

Step 6: Ship and Monitor

The first version goes into production with logging on every retrieval — which memories were pulled, what the score was, whether the agent actually applied the rules. Without this telemetry you cannot tell whether memory is helping or whether you have built an expensive vector lookup that influences nothing.

After two weeks, check: are scores trending up? Are the same failure modes reappearing? Is retrieval surfacing the rules that should be helping? If the trends are flat, the loop is broken somewhere — usually in retrieval relevance or in the reflection prompt being too vague.

Common Pitfalls and How to Avoid Them

Pitfall 1: Reflecting on success. Every successful task generates platitudes ("the agent did well by responding clearly") that pollute memory. Reflect only when the score is below threshold or when a human explicitly edits the output.

Pitfall 2: Treating memory retrieval as search. Cosine similarity on raw input is too noisy. Add metadata filters (task type, user, time window) and rerank top-K results by a smaller model before injecting. Quality of retrieved context matters more than quantity.

Pitfall 3: No consolidation. Episodic memory grows unboundedly. Without a consolidation step distilling stable rules into semantic memory, retrieval performance degrades as the store grows.

Pitfall 4: Reward hacking with LLM-as-judge. If your evaluator and your actor share the same base model, the actor will learn to game the judge. Use a different model family for judging, or use programmatic checks where possible.

Pitfall 5: Conflating session state with learning. Session memory (this conversation) is not the same as learning memory (lessons across users and time). Build them as separate systems with separate retention policies.

Warning

Do not store personally identifiable information from user interactions in episodic memory without explicit data handling policies. Reflections can inadvertently capture names, emails, and confidential business details. Sanitize inputs and outputs before writing to memory, or scope episodic memory strictly per-user with hard tenant isolation.

When Each Pattern Wins

Use Reflexion (memory-based learning) when feedback is sparse (under 1,000 labeled examples per month), failure modes are diverse (lots of different things go wrong), and rules can be expressed in natural language ("always confirm the timezone before scheduling").

Use RLHF or DPO when you have dense preference data (10,000+ ranked pairs), the output space is high-dimensional (code, long-form writing, creative work), and you control the base model or have access to fine-tune a managed model.

Use the hybrid when you start with Reflexion, accumulate clean preference data over months, and want to bake stable behaviors into the model without losing the flexibility of the memory layer.

What This Costs in Production

A Reflexion agent running on Claude 3.5 Sonnet or GPT-4o with 10,000 tasks per month and standard memory injection runs roughly $400-$1,200 per month in inference plus minimal infra cost for the vector store. The hybrid adds $200-$800 per month for occasional fine-tuning runs and the LoRA serving overhead.

Pure RLHF — running your own preference data collection, reward model training, and PPO loop — starts at $5,000-$15,000 per month in compute alone for a meaningful pipeline, plus the engineering time to maintain it. This is why I tell most clients to start with Reflexion and only move further if the evidence demands it.

What is the difference between RLHF and DPO?

RLHF uses a reward model and reinforcement learning (typically PPO) to optimize the policy against the reward model. DPO (Direct Preference Optimization) skips the reward model and directly optimizes the policy on preference pairs using a contrastive objective. DPO is simpler to implement, more stable, and increasingly preferred in 2026 for most preference-tuning use cases — though RLHF still has the edge on certain complex reasoning tasks.

Can I build a learning AI agent without fine-tuning?

Yes — the Reflexion pattern produces measurable behavior improvement using only memory and prompt engineering, without ever updating model weights. For most business agents (customer support, scheduling, data extraction, content drafting), Reflexion is the right approach. Fine-tuning becomes worthwhile only when you have thousands of high-quality preference signals and the cost of inference plus memory retrieval exceeds the cost of training and serving a fine-tuned model.

How much human feedback do I need to train an agent?

For Reflexion-style learning, you can ship a useful agent with as few as 50-100 corrected outputs — the model generates rules from each correction and stores them. For DPO, you generally need 500-2,000 preference pairs to see meaningful policy shifts. For full RLHF with a custom reward model, plan for 5,000-50,000 ranked pairs depending on task complexity.

What does human-in-the-loop mean for an AI agent?

Human-in-the-loop means the agent's workflow includes one or more points where execution pauses, a human reviews the agent's proposed action or output, and the agent resumes based on the human's decision (approve, edit, reject). In LangGraph, this is implemented with the interrupt primitive and checkpointing — the graph pauses at a designated node, persists state, and resumes when the human submits feedback through a UI.

How do I prevent my AI agent from forgetting past lessons?

Three protections. First, separate episodic memory (raw past attempts) from semantic memory (distilled rules) — losing one does not destroy the other. Second, run a consolidation job on a schedule that promotes stable patterns from episodic to semantic memory. Third, version your prompts and your semantic rule store so you can roll back if a deployment breaks behavior the agent had previously learned.

Is Reflexion better than fine-tuning?

Reflexion is better than fine-tuning when feedback volume is low (under a few thousand examples), when failure modes are diverse and best expressed as rules, and when you need fast iteration without retraining cycles. Fine-tuning wins when you have dense feedback data, a stable task definition, and need the lower per-call latency that comes from baking behavior into the model rather than retrieving it from memory each call.

Bottom Line

An AI agent that learns from feedback is not a fundamentally harder system than one that does not — it is a different architecture. Reflexion plus disciplined memory design will get you 90% of the practical benefit at 1% of the cost of running an RLHF pipeline. Start there. Move to fine-tuning only when the data depth and the business case make it the obvious next step. The mistake is not picking the wrong pattern — the mistake is reaching for RLHF before you have proven that memory alone could not have solved the problem.

Zarif

Zarif

Zarif is an AI automation educator helping thousands of professionals and businesses leverage AI tools and workflows to save time, cut costs, and scale operations.