How to Monitor and Debug AI Agents
Your agent works perfectly in dev. Then it ships, and three days later a customer asks why it booked the same meeting four times in a row. You open your logs and find a wall of JSON with no idea where the loop started or why.
AI agent observability is the practice of capturing every reasoning step, tool call, prompt, and response an agent produces, then organizing that data into traces, metrics, and evaluations you can search, alert on, and replay.
TL;DR
- Treat every agent run as a distributed trace — parent run, child LLM calls, and tool calls all need spans
- Log six things at minimum: prompts, responses, tool inputs, tool outputs, token counts, and latency per step
- Most production agent failures fall into four buckets: infinite loops, tool misuse, hallucinated arguments, and silent context drift
- Use OpenTelemetry GenAI semantic conventions as your schema so you are not locked into one vendor
- Eval-driven debugging beats log-diving — replay failed traces against fixes before you ship them
Agents are nondeterministic. The same prompt can produce different outputs on different runs, and most failures do not raise exceptions — the agent just does the wrong thing confidently. That is why standard APM tools fall flat. You cannot debug an agent the way you debug a REST API. You need a trace of every decision, not just every HTTP call.
Here is the playbook I use to monitor and debug agents in production.
Step 1: Instrument Every Agent Run as a Trace
What to do: wrap each agent invocation in a parent span, then emit child spans for every LLM call, every tool call, and every retrieval step.
Why it matters: a single user request can fan out into 30+ model calls and tool invocations. Without a parent-child trace hierarchy, you cannot answer "what did the agent actually do?" You will be reading flat log lines and trying to reconstruct causality by timestamp, which never works.
The minimum data per span:
- Span name (e.g.,
agent.run,llm.completion,tool.call:search_crm) - Start time, end time, duration in milliseconds
- Parent span ID so you can rebuild the tree
- Input payload (prompt, tool args)
- Output payload (completion, tool result)
- Token counts (prompt, completion, total)
- Model name and provider
- Cost in USD if you can compute it inline
- Status (success, error, truncated)
If you are using LangChain, LangGraph, or the OpenAI Agents SDK, you get most of this automatically by setting one environment variable. If you are rolling your own, use the OpenTelemetry GenAI semantic conventions — they standardize attribute names like gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.tool.name so any backend can read your traces. The conventions hit stable status earlier in 2026, which means you can build on them without rework.
Tag every trace with a session_id, user_id, and agent_version at the parent span. When a customer reports a bug, you want to filter to their exact runs in seconds, not scroll through 50,000 traces hunting for theirs.
Step 2: Define the Metrics That Actually Matter
What to do: pick five to seven metrics that map directly to user experience and cost, and put them on a dashboard you check daily.
Why it matters: most teams either log nothing or log everything. Logging everything is the same as logging nothing — you cannot find signal. The metrics below are the ones that catch real problems before customers do.
The core seven for any production agent:
- Task success rate — did the agent finish the goal? Score it via an automated judge or a sample of human reviews.
- Tool call success rate — what percentage of tool invocations return a non-error response? A drop here usually points to schema drift or a flaky API.
- Steps per run — average number of LLM and tool calls per agent invocation. A sudden spike means the agent is looping.
- Tokens per run — directly tied to cost. Watch the p95, not just the mean — outliers eat budgets.
- Latency per run (p50, p95, p99) — agents feel slow at the tail. Tracking the 99th percentile catches the bad sessions users actually remember.
- Cost per resolved task — divide total spend by successful runs. This is the only metric that tells you if the agent is economically viable.
- Hallucination rate on tool args — count tool calls that fail because the model invented a parameter or a tool that does not exist.
You do not need all seven on day one. Start with task success, steps per run, and cost per resolved task. Those three catch the majority of regressions.
Step 3: Set Up Alerts for the Failures You Cannot See
What to do: configure alerts on metric thresholds and on specific failure patterns, and route them to the same channel your engineers already check.
Why it matters: agent failures are silent by default. The HTTP 200s back, the customer is unhappy, and nobody knows for two weeks. Alerts close that gap.
The four alerts every production agent needs:
- Loop alert: any single run exceeds your max-steps ceiling (e.g., 25 steps). This catches runaway loops before they burn through your monthly token budget in an afternoon.
- Tool error spike: tool error rate jumps more than 3x over the rolling 1-hour baseline. Usually means an upstream API changed its schema or started rate-limiting you.
- Latency regression: p95 latency increases by more than 50% week-over-week. Often a sign the model rolled to a slower variant or your prompt grew too long.
- Cost anomaly: daily spend exceeds 1.5x the trailing 7-day average. Catches both runaway loops and prompt-injection attempts that try to drain your account.
Push alerts into Slack or PagerDuty. Email gets ignored. And include the offending trace ID in the alert payload so the on-call engineer can click straight into the failed run.
Step 4: Build a Debug Surface You Will Actually Use
What to do: pick one observability platform as your primary debug surface, and make sure every engineer on the team can open a trace in under 10 seconds.
Why it matters: the speed of your debug loop determines how fast you can ship fixes. If opening a trace requires three logins, two queries, and a JSON viewer, nobody will look at traces until something is on fire.
A good debug surface shows, on one screen:
- The full input that triggered the run
- The agent's reasoning at each step (if you log chain-of-thought)
- Every tool call with its arguments and response, expandable inline
- Token and cost totals per step
- Errors highlighted in red with the stack trace inline
- A "rerun this trace" button
LangSmith, Langfuse, Braintrust, Helicone, Arize Phoenix, and Datadog LLM Observability all give you this surface to varying degrees. The exact pick matters less than committing to one and making it the team's source of truth. I cover when to pick which in the comparison table below.
Step 5: Run Replay and Eval-Driven Debugging
What to do: when a trace fails, save it as a test case. Build a regression suite from real production failures and run it on every prompt or model change.
Why it matters: this is the single highest-leverage practice in agent engineering. Without it, every prompt change is a coin flip — you fix one bug and silently introduce three others. With it, you turn debugging from a hope-based activity into a scientific one.
The replay loop:
- Catch a failed trace in production.
- Export the inputs (user message, conversation history, tool outputs) into your eval set.
- Write an assertion describing what success looks like — could be exact match, regex, an LLM-as-judge rubric, or a function check on the final tool call.
- Add the case to your regression suite.
- When you change the prompt, swap models, or update a tool, run the full suite and compare scores.
- Block the deploy if scores regress.
The teams that ship reliable agents in 2026 all run eval-driven development. The ones that do not are still debugging in production with print statements.
Do not feed PII or customer data into a public eval platform without redacting it first. Most observability tools support automatic PII scrubbing — turn it on before you point production traffic at them, not after.
Step 6: Iterate on the Failure Patterns You See Most
What to do: every two weeks, look at the top three failure categories in your traces and fix the systemic cause, not just the individual bug.
Why it matters: agent bugs cluster. If you fix them one at a time, you will be playing whack-a-mole forever. Fix the pattern and you remove a whole class of failures at once.
The four failure patterns I see in nearly every production agent:
Infinite loops. The agent calls the same tool with the same arguments because it keeps getting an ambiguous response and re-tries instead of escalating. Fix: add a max-steps ceiling, detect duplicate consecutive tool calls, and force the agent to summarize and exit if it loops twice.
Tool misuse. The agent calls the right tool with the wrong arguments — a string where an integer goes, the wrong enum value, an out-of-range date. Fix: tighten your tool schemas using strict JSON Schema validation, and return descriptive error messages that tell the agent how to fix the call rather than just "invalid input."
Hallucinated arguments and tools. The agent invents a tool name or parameter that does not exist. Fix: validate all tool calls against an allowlist before execution. Return a structured error listing the available tools and their schemas so the agent can self-correct on the next step.
Silent context drift. The agent's behavior degrades as the conversation gets longer because the system prompt is buried under 40 turns of history. Fix: monitor input token count per step. When it crosses a threshold, summarize older turns and reset.
Each of these patterns is easier to spot in a tracing UI than in raw logs — which is why Step 1 matters so much. The instrumentation is what makes the diagnosis possible.
How the Top Agent Observability Platforms Compare
Pick one based on where your stack lives today and how much eval automation you need.
| Platform | Best For | Hosting | Starting Price |
|---|---|---|---|
| LangSmith | LangChain and LangGraph teams that want zero-config tracing | Cloud or self-hosted (Enterprise) | Free tier, paid from $39/mo |
| Langfuse | Open-source teams who want full data ownership | Self-hosted or Cloud | Free self-hosted |
| Braintrust | Teams that want eval-blocking in CI/CD | Cloud, BYO compute option | Free tier, paid from $249/mo |
| Helicone | Simplest install via proxy, OpenAI-heavy stacks | Cloud or self-hosted | Free tier, paid from $20/mo |
| Arize Phoenix | ML-grade rigor with embeddings and drift detection | Open-source, Arize AX for cloud | Free open-source |
| Datadog LLM Obs | Shops already on Datadog APM | Cloud | Add-on to Datadog plan |
The pattern most mature teams settle on: one tool for tracing and operational monitoring (LangSmith, Langfuse, or Datadog) and one for evaluation and quality scoring (Braintrust or Arize). If you only pick one, pick Langfuse — it is open source, vendor-neutral, supports OpenTelemetry, and covers 80% of what you need for free.
For more on related topics, see How to build production-ready AI agents and The complete guide to AI agent architecture.
What is the difference between AI agent monitoring and AI agent observability?
Monitoring tells you that something is wrong — a metric crossed a threshold, a tool started erroring. Observability lets you ask why it went wrong without shipping new code, by giving you traces, prompts, and tool calls you can search after the fact. You need both. Monitoring fires the alert, observability lets you fix the root cause in minutes instead of days.
Which AI agent observability tool should I use in 2026?
For most teams, start with Langfuse if you want open source and self-hosting, or LangSmith if your stack is built on LangChain or LangGraph. Add Braintrust on top once you start running formal evals in CI. Helicone is the fastest to install if your only goal is tracking OpenAI calls and costs. Datadog LLM Observability is the default for teams already paying for Datadog APM.
What should I log for every AI agent run?
At minimum, log the full input prompt, the model response, every tool call with its arguments and result, token counts per step, latency per step, and the final outcome of the run. Tag the trace with session ID, user ID, and agent version so you can filter to specific cases later. Use the OpenTelemetry GenAI semantic conventions as your schema so you are not locked into one vendor.
How do I debug an AI agent that gets stuck in an infinite loop?
First, add a hard max-steps ceiling — usually 20 to 30 steps — so a runaway loop cannot drain your token budget. Then open the trace and look for repeated tool calls with identical arguments. The fix is almost always one of three things: an ambiguous tool response the agent does not know how to handle, a missing exit condition in the prompt, or two sub-goals that depend on each other. Add explicit "if you have tried this twice, stop and summarize" instructions to your system prompt.
What is eval-driven debugging for AI agents?
Eval-driven debugging is the practice of turning every production failure into a test case in a regression suite, then running that suite every time you change the prompt, swap the model, or update a tool. It replaces hope-based prompt engineering with measurable iteration. You catch regressions before they ship instead of finding them in customer complaints. Tools like Braintrust, Langfuse, and LangSmith all support this workflow natively in 2026.
Do I need OpenTelemetry to monitor AI agents?
You do not strictly need it, but you should use it. OpenTelemetry's GenAI semantic conventions hit stable status in early 2026 and give you a vendor-neutral schema for traces, prompts, and tool calls. That means you can switch from LangSmith to Langfuse to Datadog without rewriting your instrumentation. Using a proprietary SDK locks you in — using OTel keeps your options open.
