Skip to content
DERKONLINE

Stop Your Agent From Forgetting What It Was Doing

Observation masking, summarization, and context curation that keep long-running agents coherent instead of forgetting the task.

Derrick S. K. Siawor7 min read

An agent starts a long task confident and capable, and twenty steps in it has forgotten what it was doing. It repeats an action it already took, contradicts a decision it made earlier, or loses track of the goal entirely and wanders. The instinct is to blame the model, but the model is usually fine. What broke is the context: the running record of everything the agent has seen and done has grown so large and so cluttered that the model can no longer find the signal in it. The agent did not get dumber. Its working memory filled with noise.

This is the central engineering problem of long-running agents. A single question fits comfortably in a model's context window. A task that takes fifty steps, each producing tool calls, observations, and reasoning, does not, and even when it technically fits, the model's ability to use it degrades long before the limit. Managing that context, deciding what to keep, what to compress, and what to throw away, is the difference between an agent that stays coherent across a long task and one that loses the plot halfway through. It is closely related to the problem of keeping multiple agents handing off cleanly without losing control of the flow, where the same state has to survive being passed between them, and to stopping prompt injection from untrusted content, since a context full of unvetted tool output is also a context an attacker can poison.

Context rot: the problem is quality, not size

The first thing to understand is counterintuitive. The limit on how much context an agent can use well is not the token limit. It is far below it. Research in 2025 converged on the finding that raw context size matters less than context quality, and that performance degrades measurably as context grows, an effect called context rot that begins well before the token ceiling. One study found degradation at every increment of context growth, not a clean cliff at the limit but a steady decline the whole way up.

There is a specific, well-documented version of this called the lost-in-the-middle effect. Information buried in the middle of a long context gets used poorly, with accuracy drops of over 30 percent reported for facts placed there, even though the model can see them. The model attends well to the start and the end of its context and loses track of the middle. So an agent that dumps its entire history into the context is not just paying for tokens; it is actively burying its own important information in the dead zone where the model will not reliably find it.

The implication reframes the whole task. The goal is not to fit as much history as possible into the window. The goal is to keep the context small, relevant, and well-organized, because a lean context the model can actually use beats a comprehensive one it cannot.

The two core compression strategies

Two main techniques have emerged for keeping context manageable as a task runs long, and they work differently.

Observation masking replaces older environment observations with placeholders while preserving the agent's reasoning and actions. The idea is that the bulky part of an agent's history is usually the raw observations, the full tool outputs, the long file contents, the verbose API responses, which is far easier to compress when those outputs already arrive in a predictable shape because you forced LLM output into schemas your code can trust, while the valuable part is the reasoning and the decisions. Masking strips the bulk and keeps the thread of what the agent thought and did, so the agent remembers its own logic without carrying every byte of what it saw.

LLM summarization uses a separate model to compress historical interactions into a narrative summary. Instead of keeping the full transcript, you periodically summarize what has happened so far into a compact account and carry that forward. This is the more familiar approach and it works, but it has a subtle cost worth knowing.

Both strategies cut costs dramatically, by over 50 percent compared to running with unmanaged context, which makes context management one of the more direct levers for cutting your LLM bill in half without touching answer quality. The interesting finding is which one performs better. Observation masking often matched or beat summarization on actually solving the task, in one study achieving 2.6 percent higher solve rates while being 52 percent cheaper. The reason summarization underperformed is instructive: it inadvertently extended agent trajectories by 13 to 15 percent, because compressing history into a summary obscured the natural stopping signals the agent would otherwise have used to know it was done. The summary smoothed over the cues that said "you have finished," so the agent kept going.

The hybrid that works in practice

The research points to a clear recommendation rather than a single winner. The strongest approach uses observation masking as the primary mechanism, keeping reasoning and actions while shedding the heavy observations, and adds selective summarization only for genuinely complex historical state that needs to be preserved in compressed form. Masking handles the common case cheaply and preserves the stopping signals; summarization is reserved for the parts of history that are too intricate to simply mask away.

Context curation deciding what to keep mask or summarise into a lean working set

There is also an architectural pattern worth knowing for more sophisticated systems. Rather than building context management into the agent itself, you can decouple it: an external manager, often a smaller and cheaper model, handles compression and curation while the main agent stays focused on the task. This is the same instinct behind training a small local classifier that outperforms a frontier model on your task: use a cheap, specialized model for the narrow job and reserve the expensive one for the reasoning. This keeps the context logic separate and tunable without entangling it with the agent's reasoning, and it lets you use an inexpensive model for the bookkeeping while reserving the expensive one for the actual work.

What this means for anyone building agents

The practical takeaways translate directly into how you design a long-running agent.

  • Treat context as a curated working set, not an append-only log. Every step does not earn a permanent place in the context. Decide deliberately what stays, what gets masked, and what gets summarized, and prune aggressively, because a lean context outperforms a complete one.
  • Protect the start and end of the context. Given the lost-in-the-middle effect, put the most important standing information, the goal, the key constraints, near the boundaries where the model attends best, not buried in the middle of accumulated history.
  • Strip observations, keep reasoning. The raw outputs are the bulk and rarely need to persist in full. The agent's decisions and logic are the thread that keeps it coherent. Mask the former, preserve the latter.
  • Watch for trajectory drift. If your agent runs longer than it should or fails to recognize when it is done, suspect that your context compression is hiding the stopping signals, which is exactly how summarization extended trajectories in the research.

This kind of context discipline is part of what separates an agent that holds together across a real task from a demo that works for three steps and falls apart on the fourth. We apply it directly in the agents we build, including LadenX, the AI site-reliability engineer we built, where a coherent agent that remembers what it has already tried on a server is the difference between methodical diagnosis and a confused loop that repeats failed actions. Seeing exactly what it remembered and acted on is itself a job for agent observability that traces every span, so a context bug shows up as a replayable trace rather than a mystery. An agent operating on real infrastructure cannot afford to forget what it did two steps ago, because forgetting there means repeating a command that already did not work, or worse.

The coherent agent

The agent that stays coherent across a long task is not running a bigger model or a longer context window. It is running disciplined context management: a small, relevant, well-organized working set that keeps the goal and the recent reasoning front and center, sheds the bulky observations it no longer needs, and compresses only the complex history that genuinely must be retained. That agent remembers what it was doing because you engineered it to remember the right things and forget the rest.

The failure mode of the forgetful agent is not a model that cannot reason. It is a context so cluttered the reasoning has nothing clean to work with. Fix the context and the coherence follows, because the model was capable the whole time. It just could not see past the noise you let pile up around the task.