sakana ai’s treequest: deploy multi-model teams that exceed individual LLMs by 30% -

Need smarter insights in your inbox? Sign up for our weekly newsletter to get only the things that matter to enterprise AI, data and security leaders. Subscribe now

Japan’s AI Lab Sakana AI has introduced new methods that allow multiple major language models (LLMs) to cooperate on a single task, effectively creating a “dream team” of AI agents. This method, called Multi-LLM AB-MCTS, allows models to perform trial and error and combine their own strengths to solve problems that are too complex for individual models.

For businesses, this approach provides a means to develop more robust and capable AI systems. Instead of being locked to a single provider or model, companies can dynamically leverage the best aspects of various frontier models and assign the right AI to the right parts of their tasks to achieve great results.

The power of collective intelligence

Frontier AI models are evolving rapidly. However, each model has its own distinct advantages and disadvantages derived from its own training data and architecture. It may be great in coding, but another person is great in creative writing. Researchers at Sakana Ai argue that these differences are features rather than bugs.

“We see these biases and various aptitudes not as limitations, but as valuable resources for creating collective intelligence,” the researcher said in a blog post. They believe that AI systems can achieve more by working together, just as humanity’s greatest achievements come from diverse teams. “By pooling intelligence, AI Systems can solve problems that cannot be overcome by a single model.”

Think longer during reasoning

Sakana AI’s new algorithm is a “scaling at inference” technique known as “test time scaling,” a highly popular research field in the past year. While the majority of AI’s focus is “training time scaling” (which makes the model bigger and trains on a larger dataset), inference time scaling improves performance by allocating more computational resources after the model has already been trained.

One common approach is to use reinforcement learning to encourage models to generate longer, more detailed chaining (COT) sequences, as seen in popular models such as OpenAI O3 and DeepSeek-R1. Another simpler method is repeated sampling, where the model is given the same prompt multiple times, generating various potential solutions, similar to brainstorming sessions. Sakana Ai’s work combines these ideas and proceeds.

“Our framework offers a smarter, more strategic version of Best-of-N (aka, repeat sampling),” Takuya Akiba, research scientist at Sakana AI and co-author of the paper, told VentureBeat. “It complements inference techniques like long COT via RL. By dynamically selecting search strategies and the appropriate LLM, this approach maximizes performance within a limited number of LLM calls and provides better results for complex tasks.”

How adaptive branch search works

The core of the new method is an algorithm called Adaptive Branching Monte Carlo Tree Search (AB-MCTS). This allows LLM to effectively carry out trial and error by intelligently balancing two different search strategies: “deep search” and “broader search.” A deeper search involves getting promising answers and repetitive improvements, but searching for a wider means generating a whole new solution from scratch. AB-MCTS combines these approaches to not only improve a good idea, but also try new things if the system hits a dead end or discovers another promising direction.

To achieve this, the system uses Monte Carlo Tree Search (MCTS), a decision-making algorithm famously used by DeepMind’s alphago. At each step, AB-MCTS uses a probabilistic model to determine whether to refine an existing solution or to generate a new solution.

*Various test time scaling strategies Source: Sakana AI*

The researchers took this a step further with the MultiLLM AB-MCTS. This not only determines “what to do”, but also “which” LLM should do that. At the start of the task, the system does not know which model is best for the problem. Start by trying out a balanced mix of available LLMs, and as you progress, you learn which models are more effective and allocate more workloads over time.

Test ai ‘dream team’

The researchers tested the multi-LLM AB-MCTS system on the ARC-AGI-2 benchmark. ARC (Abstraction and Inference Corpus) is notoriously difficult for AI, designed to test human-like abilities to solve new visual reasoning problems.

The team used a combination of frontier models such as the O4-Mini, Gemini 2.5 Pro, and Deepseek-R1.

The model population was able to find the correct solution for over 30% of the 120 test questions. This is a score that is significantly better than any of the models that operate on their own. This system demonstrated its ability to dynamically assign optimal models to a particular problem. In tasks where there was a clear path to the solution, the algorithm quickly identified the most effective LLM and used it more frequently.

AB-MCTS vs. Individual models (Source: SakanaAI) — *AB-MCTS vs. Individual model source: Sakana AI*

More impressively, the team observed instances in which the model solved problems previously impossible. In one case, the solution generated by the O4-MINI model was incorrect. However, the system passed this flawed DeepSeek-R1 and Gemini-2.5Pro. This allowed us to analyze, correct the error and ultimately generate the correct answer.

“This shows that multi-LLM AB-MCTS can flexibly combine frontier models to solve problems previously unsolved and push the limits of what is achievable by using LLM as collective intelligence,” the researchers write.

AB-MTC allows you to choose different models at different stages of solving problems (Source: Sakana AI) — *AB-MTCS allows you to select different models at different stages of solving problems. Source: Sakana AI*

“In addition to the individual advantages and disadvantages of each model, hallucination tendencies can vary widely between them,” Akiba said. “By creating an ensemble with a model that is less likely to hallucinate, it is possible to achieve both powerful logic and powerful grounding to achieve the best world of both. As hallucinations are a major issue in the context of business, this approach may help mitigate it.”

From research to real-world applications

To help developers and businesses apply this technique, Sakana AI has released the underlying algorithm as an open source framework called TreeQuest, available under the Apache 2.0 license (can be used for commercial purposes). TreeQuest offers a flexible API, allowing users to implement multi-LLM AB-MCT for their own tasks with custom scoring and logic.

“While we are in the early stages of applying AB-MCT to specific business-oriented issues, our research reveals important possibilities in several areas,” Akiba said.

Beyond the ARC-AGI-2 benchmark, the team was able to successfully apply AB-MCT to tasks such as complex algorithm coding, improving the accuracy of machine learning models.

“AB-MCT is also extremely effective for problems that require repeated trial and error, such as optimizing performance metrics for existing software,” Akiba said. “For example, it can be used to automatically find ways to improve response delays in web services.”

The release of practical and open source tools could pave the way for a new class of more powerful and reliable enterprise AI applications.

Daily insights into business use cases in VB every day

If you want to impress your boss, VB Daily has it covered. From regulatory shifts to actual deployments, it provides an internal scoop on what companies are doing with generated AI, allowing you to share the biggest ROI insights.

Please read our privacy policy

Thank you for subscribing. Check out this VB newsletter.

An error has occurred.

Source link

Categories

The power of collective intelligence

Think longer during reasoning

How adaptive branch search works

Test ai ‘dream team’

From research to real-world applications

Related News

Aave deploys Aave Shield after $50M user loss incident

Differences in the reaction of Bitcoin and gold to the impact of the Iran war