Skip to content
DERKONLINE

See Exactly What Your Agent Did When It Goes Off the Rails

Trace every tool call, prompt, and decision with OpenTelemetry so you can replay and root-cause an agent failure in minutes, not an afternoon.

Derrick S. K. Siawor8 min read

An AI agent that works in your demo and breaks in production is the hardest kind of bug to chase. The model called the wrong tool, or passed a malformed argument, or looped on a retrieval step, or quietly hallucinated a value three steps before the visible failure. By the time a user reports "it gave me the wrong answer," the conversation is gone and you are left reconstructing a non-deterministic decision from a screenshot. Tracing is what turns that guessing game into a replay you can scrub through frame by frame.

The shift in thinking is simple. A traditional service has one request and one response, so a log line per request is usually enough. An agent is a tree of decisions: it reasons, calls a tool, reads the result, reasons again, calls another tool, and the final answer is the leaf of a branching path. To debug it you need the whole tree, with timing, inputs, and outputs at every node. That is exactly what distributed tracing was built for, and the ecosystem has now standardized on how to record it.

What a good agent trace actually contains

OpenTelemetry's GenAI working group, formed under the Semantic Conventions group in 2024, has spent the last two years defining how AI operations get recorded. The model that emerged maps cleanly onto how agents behave. A top-level invoke_agent span wraps the whole run. Inside it, every model call becomes a chat span and every tool invocation becomes an execute_tool span. Retrieval steps get their own spans too. The result is a trace tree where the parent-child structure mirrors the agent's actual reasoning chain.

Agent trace span tree: invoke_agent wrapping chat, execute_tool, and retrieval spans with attributes

Each span carries attributes that tell you what happened, not just that something happened. The conventions define fields like gen_ai.request.model (which model was called), gen_ai.usage.input_tokens and gen_ai.usage.output_tokens (token cost of that step), and gen_ai.response.finish_reasons (whether the model stopped naturally, hit a length limit, or was cut off). That last field is also the early-warning sign behind streaming LLM responses that never cut off mid-sentence: a length finish reason in a trace means the budget was too tight before any user ever saw a truncated answer. When you opt into content capture, spans also record gen_ai.input.messages, gen_ai.output.messages, and the system instructions, so you can read the exact prompt and exact completion at the point of failure.

That last part is the difference between a metric and a root cause. A token count tells you a step was expensive. The captured messages tell you the model received a tool result that was already wrong, so the failure was upstream and the model behaved correctly given bad input. Without the content, you are still guessing. The same captured outputs are what let you force LLM output into schemas your code can trust once you see exactly where the model wandered off-format, and what feed the golden cases in an eval harness that catches regressions before your users do.

Wiring it up without rewriting your stack

You do not need to instrument by hand. Auto-instrumentation packages exist for OpenAI, Anthropic, LangChain, and LlamaIndex, and they emit OpenTelemetry spans that any compatible backend can ingest. The conventional pattern in a Node or Python service is to start a span around the agent loop, let the auto-instrumentation create child spans for each model and tool call, and ship the result to a collector.

from opentelemetry import trace
tracer = trace.get_tracer("agent")

with tracer.start_as_current_span("invoke_agent") as span:
    span.set_attribute("agent.name", "support-router")
    span.set_attribute("agent.user_id", user_id)
    result = run_agent(query)   # tool + chat spans nest automatically
    span.set_attribute("agent.outcome", result.status)

Because OpenTelemetry is vendor-neutral, the same spans flow into Jaeger, Zipkin, Datadog, New Relic, or Traceloop without changing your code. You pick the backend that fits your budget and switch later if you outgrow it. That portability matters when an agent is early and you do not yet know how much observability you will need.

The trap that bites teams in 2025: disconnected MCP traces

The Model Context Protocol spread quickly through 2025 as the standard way for agents to talk to external tools and data sources. It introduced an observability gap that is worth knowing about before it costs you a debugging session. The agent process and the MCP server process each produce their own traces, and by default those traces are disconnected. You see the agent decide to call a tool, and separately you see the MCP server handle a request, but nothing links the two. When a tool call goes wrong, you cannot follow it across the boundary.

The fix is trace context propagation: the agent must pass its trace and span identifiers to the MCP server so the server's spans attach as children of the agent's span. If you run agents against MCP servers, verify this end to end early. The same context-handoff discipline matters when you orchestrate multiple agents without losing control of the flow, where a dropped trace context turns a handoff into a black box. The first time you need it is during an incident, and that is the worst time to discover the trace stops at your process boundary. This is one of several things worth nailing down before you ship an MCP server that survives real agent traffic.

Why this is an SRE problem, not just an ML problem

The instinct is to treat agent failures as a model quality issue and reach for better prompts. Often the real cause is operational: a tool timed out, an API returned a rate-limit error the agent swallowed, a retrieval index was stale, or a downstream service changed its response shape. Those are reliability failures wearing an AI costume, and they respond to the same discipline you would apply to any production system. Trace the request, find the span that broke, read its inputs and outputs, fix the root cause. It is the same move as instrumenting your app so you find the root cause in minutes not hours on a conventional service.

This is the same principle that drives LadenX, the AI site-reliability engineer we built. Every action it takes against a server is classified and recorded before anything runs, so when something behaves unexpectedly you can see exactly which command it chose and why, and it refuses destructive operations without a human signing off. That is the same discipline behind teaching an AI SRE to diagnose root cause rather than just restart the service and behind the guardrails that keep autonomous fixes from touching production unchecked. Visibility before action is the whole point. An agent you cannot replay is an agent you cannot trust in production, and the same logic applies whether the agent is answering support tickets or touching infrastructure.

Making traces useful at 3am

Capturing spans is half the job. The other half is being able to find the one trace that matters among thousands. A few practices make the difference.

  • Tag the business context. Attach the user ID, session ID, conversation ID, and the agent's task name as span attributes. When a customer reports a problem, you search by their ID and the trace appears. The same attributes are what let an agent that has lost the thread recover, which is the heart of stopping your agent from forgetting what it was doing.
  • Record the outcome explicitly. Set an attribute like agent.outcome to success, refused, or error on the root span so you can filter for failures without reading every trace.
  • Sample with intent. Full content capture on every request is expensive in storage and can include sensitive data. Sample a fraction of successful runs and capture every error in full, so the cheap-to-store traces are the boring ones and the expensive ones are the ones you actually need.
  • Watch the token and latency budget per step. A trace that shows one tool span taking eight seconds points you straight at the slow dependency, which is usually faster than reading prompts. The same per-step token view is the raw material for cutting your LLM bill in half without touching answer quality.

What you get back

A well-traced agent changes the texture of your on-call life. Instead of "the agent gave a bad answer, let me try to reproduce it," you open the trace, see the model called the pricing tool with a currency the tool did not support, watch the tool return an empty result, and watch the model confidently fill the gap with a made-up number. Five minutes, not an afternoon. You fix the tool to validate currency, add a guard so the model cannot proceed on an empty result, and you have a regression test waiting in the captured trace.

That is the real value. Tracing does not make your agent smarter. It makes your agent debuggable, and a debuggable agent is the only kind you can responsibly run where customers can feel it. If you are putting agents in front of real users or pointing them at real systems, build the trace tree first. The failures are coming either way. The only question is whether you can see them.