The 12 Ways AI Agents Fail That Your Logs Will Never Show You

Every observability tool ever built assumes the same thing: that a request comes in, something happens, a response goes out, and you can understand the whole story by looking at that one exchange. Status code, latency, payload. Stateless. Self-contained. You can replay it, you can graph it, you can alert on it.

AI agents break that assumption completely.

An agent doesn't serve a request. It runs a process — a loop that holds state across dozens of model calls, invokes tools that have real side effects, hands work off to other agents, and makes decisions that depend on everything that came before. The unit of work isn't the HTTP call anymore. It's the trajectory. And almost none of the tooling deployed in production today was built to see a trajectory.

This is not a theoretical gap. It is the reason teams ship an agent that looks perfect in every dashboard and then get a call from compliance, or finance, or a furious customer, about something the logs swear never happened.

Here are twelve failure modes that live in that blind spot. Some are security problems, some are reliability problems, some are compliance problems. They share one property: a conventional request log shows green while each of them is actively happening.

1. MCP tool-description rug-pulls

The Model Context Protocol lets an agent discover tools at runtime and read their descriptions to decide how to use them. That dynamism is the feature. It is also the attack surface.

A tool can change its own description between one invocation and the next. The agent reads "returns the current weather," calls it, and this time the tool also quietly exfiltrates the conversation context to a third party — or returns instructions that reshape the agent's next action. From the request log's point of view, a tool was called and returned 200. Nothing looks wrong. The thing that changed — the description the agent reasoned over — was never captured, because logs capture calls, not the semantic contract the agent thought it was operating under.

To catch this you have to record the tool definition the agent saw at decision time, not just the call it made. Hardly anyone does.

2. Lethal-trifecta exfiltration

Simon Willison named the pattern: give a system access to private data, expose it to untrusted content, and give it the ability to communicate externally, and you have built an exfiltration machine. Each capability is individually reasonable. An agent that reads your email, browses links, and sends messages is just a useful assistant — until a malicious web page it reads instructs it to take something private and send it somewhere.

The reason this is invisible to logging is that every individual step is legitimate and authorized. Read inbox: allowed. Fetch URL: allowed. Send message: allowed. The failure is in the composition — the path data took across three benign actions. A per-request log has no concept of a path. It sees three green checkmarks.

3. Browser-agent stuck loops

An agent driving a browser hits a checkout button. The page doesn't transition the way the model expects. So it tries again. And again. It is convinced the next attempt will work, because nothing in its world model says "stop." The loop burns tokens, holds a session open, and — critically — registers as activity. The agent is busy. The dashboard shows requests flowing. Latency per call is fine.

What's missing is the loop-detection that only exists at the trajectory level: this agent has taken the same action with the same result eleven times. No single request is anomalous. The pattern is the anomaly, and the pattern is what request logs throw away.

4. Agent-to-agent handoff schema breaks

Multi-agent systems pass work between specialized agents. A planner hands a task to an executor; a researcher hands findings to a writer. That handoff carries a payload, and the payload has an expected shape.

When the shape drifts — a field renamed upstream, a context object that's now missing customer_id, a list where a string was expected — the receiving agent rarely errors out cleanly. It does something worse: it fills the gap. It hallucinates a plausible value and proceeds confidently. The downstream output looks complete. No exception is thrown, so no error is logged. The corruption entered at a seam between two components, and request logs don't instrument seams.

5. Prompt-injection cascades

A single compromised input — a poisoned document, a malicious tool result, a crafted user message — doesn't just affect the step it lands in. It reshapes every subsequent step in the trajectory. The injected instruction becomes part of the context the agent carries forward, quietly steering tool choices, output framing, and decisions many calls later.

Tracing the blast radius requires following the contamination through the trajectory: which later actions were influenced by the poisoned context. A request log captures each downstream call as an independent, healthy event. The causal chain — the thing you actually need for an incident review — isn't there.

6. Hallucinated function calls

The model decides to call a tool. The tool doesn't exist. Or it exists, and the model invents arguments that were never valid — a user_id it fabricated, a date format it guessed, a parameter that isn't in the schema.

Sometimes this fails loudly and you see an error. Often it doesn't: the framework coerces the bad call into something that runs, or the tool is forgiving, or the hallucinated arguments happen to be structurally valid and semantically wrong. The request that results looks like every other tool call. The failure was in the gap between what the model intended and what was real, and intent is exactly the thing logs don't store.

7. Cost drift on long-running agents

This one is purely economic and it is brutal precisely because it is gradual. A long-running agent on a retry loop, or one that keeps expanding its context window, or one that re-reads the same large document on every iteration, spends money in a way that no single call reveals. Each individual model call costs a fraction of a cent. There are just a lot of them, and the count grows quietly.

By the time it shows up — on the monthly bill, not the dashboard — it's a number with a lot of zeros. Per-request cost monitoring is structurally blind to this. You need cumulative spend per trajectory, with anomaly detection on the rate of accumulation. Most teams discover the problem from their provider invoice.

8. Jailbreak attempts

A jailbreak is a prompt pattern designed to bypass the model's safety layer. The naive mental model is that these arrive as a single obviously-malicious message you could filter. In agent systems they rarely do. They arrive buried mid-conversation, assembled across multiple turns, or smuggled inside tool output that the agent treats as trusted.

Detecting them means inspecting the content of the request path, not its shape — and inspecting it in context, because the same phrase is benign in one position and an attack in another. A request log records that a message was processed. Whether that message was an attempt to subvert the system is a question it was never designed to ask.

9. PII leakage into spans

Here is a failure mode the observability tooling actively causes. To debug an agent you instrument it, capturing inputs, outputs, and intermediate state as trace attributes. Then you ship those traces to a third-party observability vendor. And somewhere in those captured payloads is a social security number, a medical record, a customer's full conversation — now sitting in a SaaS platform that was never in scope for your data-processing agreements.

The agent worked perfectly. The leak is in the telemetry. For anyone operating under GDPR, HIPAA, or DPDP, this is not a debugging convenience; it's an incident. Redaction has to happen before the span leaves your boundary, and it has to be built into the tracing layer itself. Bolted-on logging doesn't do this, because the people who built it were optimizing for visibility, not for data residency.

10. Eval regression on prompt updates

Someone improves a prompt. They tighten the instructions, add an example, fix a typo. They deploy. Latency is unchanged. Error rate is unchanged. Every operational metric is flat and green.

And the quality of the output silently degrades — the new phrasing made the model more verbose, or less likely to call a needed tool, or subtly worse at the edge cases the old wording happened to handle. Nothing failed. The system did exactly what it was told. It was just told something slightly worse, and no infrastructure metric measures "worse." Catching this requires running evaluations against every prompt change and comparing behavioral quality, not system health. The dashboard that watches latency will never see it.

11. Model-version drift

You call a model by its API name. The provider, doing their job, rolls out an improved version behind that same name. Your code didn't change. Your prompt didn't change. But the thing on the other end of the wire behaves differently now — different formatting, different tool-calling tendencies, different handling of an instruction it used to obey.

Your monitoring sees identical requests to an identical endpoint, returning 200s at normal latency. The drift is in behavior, and it entered through a dependency you don't control and didn't deploy. The only way to see it is to track output characteristics over time and notice the distribution shift. Request logs have no memory of yesterday's behavior to compare against.

12. Region-pinned compliance violations

An agent's request needs to stay in a particular jurisdiction — EU data in EU regions, for residency reasons that carry legal weight. Under normal operation it does. Then a provider has an outage in that region, the gateway fails over to keep the service up, and the request is served from somewhere it was never permitted to be.

The failover worked. Availability was preserved. The request succeeded, the user got their answer, the dashboard stayed green. And a compliance boundary was crossed without a single error being raised. Uptime monitoring is designed to celebrate exactly the event that caused the violation. Only something tracking the jurisdiction of each request against its policy — at the trajectory level — would catch it, and that something is almost never in the stack.

12 Ways AI Agents Fail.pdf

3.13 MB • PDF File

The pattern behind the pattern

Read the twelve back to back and the shared shape is unmistakable. Every one of them is invisible to per-request logging for the same structural reason: the failure lives between steps, across steps, or in the meaning of a step — never in the step itself. Conventional observability instruments the step. It was the right tool for a stateless world, and it is the wrong tool for agents, in the specific sense that a stopwatch is the wrong tool for measuring temperature. It's not bad. It's measuring something else.

The teams that will run AI agents safely inside banks, hospitals, and regulated enterprises are not the ones with the prettiest latency charts. They're the ones who treat the trajectory as the unit of observability — who capture the tool definitions an agent reasoned over, the path data took across actions, the cumulative cost of a run, the jurisdiction of every hop, and the behavioral quality of every output, and who can reconstruct all of it after the fact in a way an auditor will accept.

That's a different category of tool than what most of the industry shipped for the stateless era. Building it — in the open, as a reference implementation of what stateful-agent observability should look like — is the problem worth working on. The twelve failure modes above are the specification. Anything that calls itself agent observability should be able to see all of them.