ACE prevents context collapse with “evolving playbooks” for self-improving AI agents

A new framework for Stanford University And SambaNova addresses a crucial challenge in creating robust AI agents: context engineering. Called Agent context engineering (ACE), the framework automatically populates and modifies the context window of Large Language Model (LLM) applications, treating it as an “evolving playbook” that creates and refines strategies as the agent gains experience in its environment.
ACE is designed to overcome key limitations of other contextual engineering frameworks, preventing the model’s context from degrading as it accumulates more information. Experiments show that ACE works both to optimize system prompts and to manage an agent’s memory, outperforming other methods while being significantly more efficient.
The challenge of context engineering
Advanced AI applications that use LLMs rely largely on "adaptation to the context," or context engineering, to guide their behavior. Instead of the costly process of retraining or adjusting the model, developers use LLM learning skills in context to guide one’s behavior by modifying input prompts with specific instructions, reasoning steps, or domain-specific knowledge. This additional information is typically obtained as the agent interacts with its environment and gathers new data and experiences. The main goal of contextual engineering is to organize this new information in a way that improves model performance and avoids confusion. This approach is becoming a central paradigm for creating efficient, scalable and self-improving AI systems.
Contextual engineering has several benefits for enterprise applications. Contexts are interpretable by both users and developers, can be updated with new knowledge at runtime, and can be shared across different models. Context engineering also benefits from continued hardware and software advancements, such as increasing context windows of LLM and efficient inference techniques such as prompt and context caching.
There are various automated context engineering techniques, but most of them suffer from two key limitations. The first is “brevity bias,” in which rapid optimization methods tend to favor concise, generic instructions over comprehensive, detailed instructions. This can harm performance in complex domains.
The second, more serious problem is "collapse of context." When an LLM is tasked with repeatedly rewriting its entire accumulated context, it can suffer from a kind of digital amnesia.
“What we call ‘context collapse’ occurs when an AI attempts to rewrite or compress everything it has learned into a single new version of its prompt or memory,” the researchers said in written comments to VentureBeat. “Over time, this rewriting process erases important details, such as overwriting a document so often that key notes disappear. In customer-facing systems, this could mean that a support agent suddenly loses awareness of past interactions…causing erratic or inconsistent behavior.”
The researchers argue that “contexts should function not as concise summaries, but as comprehensive, evolving textbooks – detailed, inclusive, and rich in information about the field.” This approach builds on the strength of modern LLMs, which can effectively extract relevance from long, detailed contexts.
How Agent Context Engineering (ACE) Works
ACE is a global context adaptation framework designed for offline tasks, such as system prompt optimizationand online scenarios, such as real-time memory updates for agents. Rather than compressing information, ACE treats context as a dynamic playbook that collects and organizes strategies over time.
The framework divides the work into three specialized roles: a generator, a reflector, and a curator. This modular design draws inspiration from “the way humans learn – by experimenting, reflecting and consolidating – while avoiding the bottleneck of overloading a single model with all the responsibilities,” according to the document.
The workflow begins with the generator, which produces reasoning paths for input prompts, highlighting both effective strategies and common mistakes. The Reflector then analyzes these paths to extract key lessons. Finally, the curator synthesizes these lessons into compact updates and merges them into the existing playbook.
To avoid context collapse and brevity bias, ACE incorporates two key design principles. First, it uses incremental updates. Context is represented as a collection of structured, detailed bullet points instead of a single block of text. This allows ACE to make granular changes and retrieve the most relevant information without rewriting the entire context.
Second, ACE uses a “growth and refinement” mechanism. As new experiences are gathered, new chips are added to the playbook and existing ones are updated. A deduplication step regularly removes redundant entries, ensuring that the context remains complete but relevant and compact over time.
CAOT in action
Researchers evaluated ACE on two types of tasks that benefit from an evolving context: agent benchmarks requiring multi-round reasoning and tool use, and domain-specific financial analysis benchmarks requiring specialized knowledge. For high-stakes industries like finance, the benefits go beyond just performance. As the researchers put it, the framework is “much more transparent: a compliance officer can literally read what the AI has learned, because it is stored in human-readable text rather than hidden in billions of parameters.”
Results showed that ACE consistently outperformed strong benchmarks such as GEPA and classical in-context learning, achieving average performance gains of 10.6% on agent tasks and 8.6% on domain-specific benchmarks, in both offline and online environments.
Importantly, ACE can create effective contexts by analyzing feedback from its actions and environment instead of requiring manually labeled data. The researchers note that this ability is a "key ingredient for self-improvement of LLMs and agents." On the public AppWorld benchmark, designed to evaluate agentic systems, an agent using ACE with a smaller open source model (DeepSeek-V3.1) matched the performance of the first, Agent powered by GPT-4.1 on average and outperformed it on the most difficult test set.
The consequences for businesses are significant. “This means that companies do not need to rely on massive proprietary models to remain competitive,” the research team said. “They can deploy local models, protect sensitive data, and achieve best-in-class results by continually refining context instead of recycling weights. »
Beyond accuracy, ACE has proven to be very effective. It scales to new tasks with 86.9% lower average latency than existing methods and requires fewer steps and tokens. The researchers point out that this efficiency demonstrates that “scalable self-improvement can be achieved with greater precision and lower overhead.”
For companies concerned about inference costs, the researchers point out that the longer contexts produced by ACE do not translate into proportionately higher costs. Modern service infrastructures are increasingly optimized for long-context workloads through techniques such as KV cache reuse, compression, and offloading, which amortize the cost of managing a large context.
Ultimately, ACE points to a future where AI systems are dynamic and continually improving. "Today, only AI engineers can update models, but contextual engineering opens the door for domain experts (lawyers, analysts, doctors) to directly shape what AI knows by changing its contextual playbook," » said the researchers. It also makes governance more practical. "Selective unlearning becomes much simpler: if information is outdated or legally sensitive, it can simply be removed or replaced in context, without retraining the model.


