Alibaba’s AgentEvolver increases model performance in tool usage by approximately 30% with automatically generated synthetic tasks

Researchers at Alibaba’s Tongyi Lab have developed a new framework for self-evolving agents that create their own training data by exploring their application environments. The frame, AgentEvolveruses the knowledge and reasoning capabilities of large language models for autonomous learning, addressing the high costs and manual efforts typically required to collect task-specific datasets.
Experiments show that compared to traditional frameworks based on reinforcement learning, AgentEvolver is more efficient in exploring its environment, uses data better, and adapts to application environments faster. For the enterprise, this is important because it lowers the barriers to training agents for tailored applications, making powerful, personalized AI assistants more accessible to a wider range of organizations.
The high cost of training AI agents
Reinforcement learning has become a major paradigm for training LLMs to act as agents capable of interacting with digital environments and learning from feedback. However, developing agents with RL faces fundamental challenges. First, collecting the necessary training datasets is often prohibitively expensive, requiring significant manual labor to create sample tasks, particularly in new or proprietary software environments where no commercially available datasets exist.
Second, commonly used RL techniques for LLMs require the model to go through a massive number of trial and error attempts to learn effectively. This process is computationally expensive and inefficient. As a result, training capable LLM agents via RL remains laborious and expensive, limiting their deployment in custom enterprise settings.
How AgentEvolver works
The main idea behind AgentEvolver is to give models greater autonomy in their own learning process. The researchers describe it as a “self-evolving agent system” designed to “achieve autonomous and efficient evolution of capabilities through environmental interaction.” It uses the reasoning power of an LLM to create a self-training loop, allowing the agent to continually improve by directly interacting with its target environment without the need for predefined tasks or reward functions.
“We envision an agent system in which the LLM actively guides exploration, task generation, and performance refinement,” the researchers wrote in their paper.
The process of self-evolution is driven by three fundamental mechanisms that work together.
The first is questioningwhere the agent explores its environment to discover the limits of its functions and identify useful states. It’s like a new user clicking on an app to see what’s possible. Based on this exploration, the agent generates its own diverse set of tasks that match a user’s general preferences. This reduces the need for hand-crafted datasets and allows the agent and its tasks to co-evolve, gradually enabling it to handle more complex challenges.
According to Yunpeng Zhai, a researcher at Alibaba and co-author of the paper, who spoke with VentureBeat, the self-challenge mechanism effectively transforms the model from “data consumer to data producer”, significantly reducing the time and cost required to deploy an agent in a proprietary environment.
The second mechanism is autonomous navigationwhich improves exploration efficiency by reusing and generalizing past experiences. AgentEvolver extracts information about successful and unsuccessful attempts and uses it to guide future actions. For example, if an agent attempts to use an API function that does not exist in an application, it records it as an experiment and learns to check for the existence of functions before attempting to use them in the future.
The third mechanism, self-attributionimproves learning efficiency by providing more detailed feedback. Instead of a simple final signal of success or failure (a common practice in RL that can result in rare rewards), this mechanism uses an LLM to evaluate the contribution of each individual action in a multi-step task. It retrospectively determines whether each step contributed positively or negatively to the final result, giving the agent precise feedback that accelerates learning.
This is crucial for regulated industries where how an agent solves a problem is as important as the outcome. “Instead of rewarding a student only for the final answer, we also assess the clarity and correctness of each step of their reasoning,” Zhai explained. This improves transparency and encourages the agent to adopt more robust and verifiable problem-solving models.
“By shifting the training initiative from human engineering pipelines to LLM-guided self-improvement, AgentEvolver establishes a new paradigm that paves the way for scalable, cost-effective, and constantly improving intelligent systems,” the researchers say.
The team also developed a practical end-to-end training framework that integrates these three mechanisms. A key element of this foundation is the Context managera component that controls the agent’s memory and interaction history. While current benchmarks test a limited number of tools, real-world enterprise environments can involve thousands of APIs.
Zhai acknowledges that this is a major challenge for the field, but notes that AgentEvolver was designed to scale. “Retrieving across extremely large action spaces will always present computational challenges, but AgentEvolver’s architecture paves the way for scalable tool reasoning in enterprise environments,” he said.
A more effective path to agent training
To measure the effectiveness of their framework, the researchers tested it on AppWorld And BFCL v3two benchmarks that require agents to complete long, multi-step tasks using external tools. They used models from Alibaba Qwen2.5 family (parameters 7B and 14B) and compared their performance to a base model trained with GRPO, a popular RL technique used to develop reasoning models such as DeepSeek-R1.
The results showed that integrating all three mechanisms into AgentEvolver resulted in substantial performance gains. For Model 7B, the average score improved by 29.4% and for Model 14B, it increased by 27.8% compared to the baseline. The framework systematically improved the reasoning and task execution capabilities of the models in both benchmarks. The most significant improvement came from the self-questioning module, which autonomously generates various training tasks and directly addresses the data sparsity problem.
The experiments also demonstrated that AgentEvolver can efficiently synthesize a large volume of high-quality training data. The tasks generated by the self-questioning module were found to be sufficiently diverse to achieve good training efficiency even with a small amount of data.
For businesses, this allows agents to be created for bespoke applications and internal workflows while minimizing the need for manual data annotation. By providing high-level goals and letting the agent generate its own training experiences, organizations can develop personalized AI assistants more easily and cost-effectively.
“This combination of algorithmic design and engineering pragmatics positions AgentEvolver as both a research vehicle and a reusable foundation for creating adaptive, tool-augmented agents,” the researchers conclude.
Looking ahead, the ultimate goal is much larger. “A truly singular model that can integrate into any software environment and master it overnight is certainly the holy grail of agentic AI,” Zhai said. “We see AgentEvolver as a necessary step in this direction.” Although this future still requires breakthroughs in model reasoning and infrastructure, self-evolving approaches are leading the way.




