
2025 was supposed to be the next year "AI agent," In many ways, a number of major AI model providers, including OpenAI, Google, and even Chinese competitors like Alibaba, are releasing fine-tuned AI models and applications designed to focus on a narrow set of tasks, such as web search and reporting, according to Nvidia CEO Jensen Huang and other AI industry insiders.
But one major hurdle remains for the future of high-performance, reliable AI agents. It’s about keeping the AI agent on task when the task spans multiple steps. Third-party benchmark tests show that even the most powerful AI models have higher failure rates the more steps they take to complete a task and the longer they spend on the task (more than a few hours).
A new academic framework called EAGLET proposes a practical and efficient method to improve long-term task performance in LLM-based agents without the need for manual data labeling or retraining.
Developed by researchers at Tsinghua University, Peking University, DeepLang AI, and the University of Illinois at Urbana-Champaign. Eaglet provides: "global planner" It can be integrated into existing agent workflows to reduce hallucinations and improve task efficiency.
EAGLET is a fine-tuned language model that interprets task instructions (usually provided as prompts by the user or the agent’s operating environment) and generates a high-level plan (utilizing its own LLM) for the agent. Although it does not intervene during execution, proactive guidance reduces planning errors and improves task completion rates.
Addressing planning issues in long-term agents
Many LLM-based agents struggle with long-term tasks because they rely on reactive, step-by-step reasoning. This approach often leads to trial-and-error behavior, deliberate illusions, and inefficient trajectories.
EAGLET is global planning module This works in parallel with the execution agent.
Rather than mixing planning and action generation in one model, EAGLET separates them, allowing for more consistent task-level strategies.
Two-stage training pipeline without human annotations
EAGLET’s planner is trained using a two-step process that requires no human-generated plans or annotations.
The first step is to generate a synthesis plan using a powerful LLM such as GPT-5 or DeepSeek-V3.1-Think.
These plans are filtered using a new strategy called homologous consensus filtering to retain only those plans that improve task performance for both expert and novice execution agents.
In the second stage, a rules-based reinforcement learning process further refines the planner and uses a custom-designed reward function to evaluate how well each plan helps multiple agents succeed.
Introducing Executor Capability Gain Reward (ECGR)
One of EAGLET’s key innovations is the Executor Capability Gain Reward (ECGR).
This reward measures the value of the generated plan by determining whether it helps both high- and low-ability agents complete the task more successfully with fewer steps.
It also includes a damping factor that favors shorter and more efficient task trajectories. This approach avoids over-reward planning that only serves already competent agents and promotes more generalizable planning guidance.
Compatibility with existing agents and models
The EAGLET planner is designed to be modular. "plug and play," This means it can be inserted into existing agent pipelines without requiring retraining of the executor.
In our evaluation, the planner improved the performance of various base models including GPT-4.1, GPT-5, Llama-3.1, and Qwen2.5.
It has also been proven to be effective regardless of prompting strategy, and works well with standard ReAct-style prompts and approaches such as Reflexion.
State-of-the-art performance across benchmarks
EAGLET was tested on three widely used benchmarks for long-term agent tasks. ScienceWorld simulates scientific experiments in a text-based lab environment. ALFWorld challenges agents to complete household activities through natural language in a simulated home environment. WebShop measures goal-driven behavior in a realistic online shopping interface.
In all three, execution agents with EAGLET outperform non-planning agents and other planning baselines such as MPO and KnowAgent.
In experiments using the open source Llama-3.1-8B-Instruct model, EAGLET improves average performance from 39.5 to 59.4, an improvement of +19.9 points across tasks.
For ScienceWorld’s unseen scenario, performance improved from 42.2 to 61.6.
In the scenarios seen at ALFWorld, EAGLET improved the results from 22.9 to 54.3, a performance improvement of more than 2.3x.
Higher-performance models saw even greater gains.
For example, EAGLET improved the average score of GPT-4.1 from 75.5 to 82.2, and GPT-5 increased from 84.5 to 88.1, despite already showing good performance.
In some benchmarks, performance improved by as much as +11.8 points, such as when combining the EAGLET and ETO executor methods on the ALFWorld unidentified task.
Compared to other planning baselines such as MPO, EAGLET consistently achieved higher task completion rates. For example, on the ALFWorld unseen task with GPT-4.1, MPO achieved 79.1, while EAGLET scored 83.6, giving it a +4.5 point advantage.
Furthermore, the paper reports that agents using EAGLET complete tasks in fewer steps on average. Using GPT-4.1 as the executor, the average step count decreased from 13.0 (no planner) to 11.1 (EAGLET). GPT-5 dropped from 11.4 to 9.4, supporting the claim of improved execution efficiency.
Increase efficiency in training and execution
Compared to RL-based methods such as GiGPO that require hundreds of training iterations, EAGLET achieved better or comparable results with approximately 1/8th less training effort.
This efficiency carries over to execution. Agents that use EAGLET typically require fewer steps to complete their tasks. This leads to reduced inference time and computational cost in operational scenarios.
Official norms do not yet exist
As of the version submitted to arXiv, the authors have not released an open source implementation of EAGLET. It is unclear if and when the code will be released, under what license, and how it will be maintained, which may limit the short-term usefulness of the framework in enterprise deployments.
VentureBeat has reached out to the author for clarification on these points and will update this article when we hear back.
Questions about enterprise deployment remain
Although the planner is described as plug-and-play, it remains unclear whether EAGLET can be easily integrated into popular enterprise agent frameworks such as LangChain or AutoGen, or whether a custom stack is required to support separation of planning and execution.
Similarly, training setups utilize multiple execution agents, which can be difficult to reproduce in enterprise environments with limited access to the model. VentureBeat asked researchers whether homologous consensus filtering techniques could be applied to teams that only have access to a single executor model or limited computing resources.
Although the authors of EAGLET have reported success across a variety of model types and sizes, the minimum model scale required for practical deployment is not yet known. For example, can enterprise teams effectively use Planner with an open model of less than 10B parameters in latency-sensitive environments? Additionally, while this framework may provide industry-specific value in areas such as customer support and IT automation, it remains to be seen how easily Planner can be fine-tuned or customized for such industries.
Real-time planning vs. pre-generated plans
Another open question is how best to deploy EAGLET in practice. Should the planner operate in real-time in parallel with the executers in a loop, or is it better to use it offline to pre-generate global plans for known task types? Each approach has implications for delay, cost, and operational complexity. VentureBeat has posed this question to authors and will report any revealing insights.
Strategic tradeoffs for enterprise teams
For technology leaders in medium to large enterprises, EAGLET provides a compelling proof of concept for improving the reliability and efficiency of LLM agents. But without public tools or implementation guidelines, the framework still presents a build-or-wait decision. Companies must weigh the potential for improved task performance and efficiency against the cost of replicating or approximating the training process in-house.
Potential use cases in enterprise environments
For companies developing agent AI systems, especially in environments that require step-by-step planning, such as IT automation, customer support, or online interactions, EAGLET provides templates that show how to incorporate planning without retraining. The ability to guide both open and closed source models, as well as efficient training methods, can make it an attractive starting point for teams looking to improve agent performance with minimal overhead.
