
researchers University of Illinois at Urbana-Champaign and Google Cloud AI research developed a framework that allows large-scale language model (LLM) agents to organize their experiences into memory banks and get better at complex tasks over time.
A framework called mystery bankextracts “generalizable inference strategies” from the agent’s successful and unsuccessful problem-solving attempts. Agents use this memory during reasoning to avoid repeating past mistakes and make better decisions when faced with new problems. Researchers believe that when combined with: Test time scaling techniquesReasoningBank significantly improves the performance and efficiency of LLM agents when the agent makes multiple attempts for a problem.
Their findings show that ReasoningBank consistently outperforms traditional memory mechanisms across web browsing and software engineering benchmarks, providing a practical path to building more adaptive and reliable AI agents for enterprise applications.
LLM agent memory issues
LLM agents are deployed in long-running applications, so they encounter a continuous stream of tasks. One of the main limitations of current LLM agents is their inability to learn from accumulated experience. If you approach each task individually, you will inevitably repeat past mistakes, throw away valuable insights from related problems, and fail to develop skills that will improve your abilities over time.
The solution to this limitation is to give the agent some memory. Previous efforts to provide memory to agents have focused on preserving past interactions for reuse by organizing information in a variety of formats, from plain text to structured graphs. However, these approaches are often insufficient. Many use raw interaction logs or only save examples of successful tasks. This means that higher-level transferable inference patterns cannot be extracted and, importantly, valuable information cannot be extracted and used from agent failures. As the researchers point out in their paper, “existing memory designs are often limited to passive record-keeping rather than providing practical and generalizable guidance for future decision-making.”
How ReasoningBank works
ReasoningBank is a memory framework designed to overcome these limitations. Its central idea is to distill useful strategies and reasoning cues from past experiences into structured memory items that can be stored and reused.
According to Google researcher and paper co-author Jun Yan, this marks a fundamental change in the way agents operate. "Traditional agents operate statically and handle each task in isolation." Yang explained. "ReasoningBank changes this by turning any task experience (success or failure) into a structured, reusable reasoning memory. As a result, agents don’t start working with each customer from scratch. Recall and adapt proven strategies from similar cases in the past."
This framework processes both successful and unsuccessful experiences and transforms them into a collection of useful strategies and prevention lessons. The agent determines success or failure. LLM scheme as a judge To avoid the need for human labeling.
Yan provides a practical example of this process. An agent tasked with finding Sony headphones is likely to fail as a broad search query returns more than 4,000 unrelated products. "ReasoningBank first tries to figure out why this approach failed." Yang said. "Next, we extract strategies such as “optimizing search queries” and “limiting products with category filters.” These strategies will be very useful for successfully completing similar tasks in the future."
This process works in a closed loop. When an agent faces a new task, it uses embedding-based search to retrieve relevant memories from the ReasoningBank to guide its actions. These memories are inserted into the agent’s system prompts and provide context for the agent’s decisions. Once a task is completed, the framework creates new memory items to extract insights from successes and failures. This new knowledge is analyzed, extracted, and integrated into ReasoningBank, allowing the agent to continually evolve and improve its capabilities.
Supercharge memory with scaling
Researchers have discovered that there is a powerful synergy between memory and memory. Test time scaling. Classic test time scaling involves generating multiple independent answers to the same question, but the researchers argue that this format is “suboptimal because it does not take advantage of the unique contrasting signals that result from redundant exploration of the same question.”
To address this, they propose Memory-aware Test-Time Scaling (MaTTS), which integrates scaling with ReasoningBank. MaTTS comes in two formats. With “parallel scaling,” the system generates multiple trajectories for the same query and compares and contrasts them to identify consistent patterns of inference. With sequential scaling, the agent iteratively refines its inferences within a single trial, with intermediate notes and corrections also serving as valuable memory signals.
This creates a virtuous cycle. Existing memories in the ReasoningBank guide the agent to more promising solutions, while the diverse experiences generated through scaling allow the agent to create and store higher quality memories in the ReasoningBank.
“This positive feedback loop positions memory-driven experience scaling as a new scaling dimension for agents,” the researchers wrote.
ReasoningBank is up and running
The researchers tested their framework on web arena (web browsing) and SWE bench verified (Software Engineering) Benchmark. Use models like Google’s Gemini 2.5 Pro or Anthropic’s Claude 3.7 Sonnet. They compared ReasoningBank to baselines including memory-free agents and agents that use trajectory-based or workflow-based memory frameworks.
Results show that ReasoningBank consistently outperforms these baselines across all datasets and LLM backbones. WebArena improved the overall success rate by up to 8.3 percent compared to the memory-free agent. It was also able to better generalize to more difficult cross-domain tasks while reducing the number of interaction steps required to complete the task. When combined with MaTTS, performance improved further for both parallel and sequential scaling, consistently outperforming standard test time scaling.
This efficiency increase has a direct impact on operational costs. Yang points to a case where a memory-less agent made eight tries just to find the right product filter on a website. "These costs of trial and error can be avoided by leveraging relevant insights from ReasoningBank." he pointed out. "In this case, operating costs are saved by almost double." It also improves the user experience, as issues are resolved more quickly.
For enterprises, ReasoningBank helps develop cost-effective agents that can learn from experience and adapt over time in areas such as complex workflows, software development, customer support, and data analysis. The paper concludes: “Our findings suggest a practical path toward building adaptive lifelong learning agents.”
Yang acknowledged that their findings point to the future of true constructive intelligence. For example, coding agents can learn individual skills such as API integration or database management from individual tasks. "Over time, these modular skills become building blocks that agents can flexibly recombine to solve more complex tasks." He said it suggests a future where agents can autonomously gather knowledge and manage entire workflows with minimal human oversight.
