Large-scale language models (LLMS) are excellent at using textual inference to understand the context of a document and providing logical answers about its content. However, these same LLMs struggle to correctly answer even the simplest math problems.
Text inference is usually a less-than-ideal way to deliberate on computational or algorithmic tasks. Some LLMs can generate code such as Python to process symbolic queries, but the model isn’t always aware of when to use the code or what code is best for you.
LLMs may need a coach to direct them towards the best techniques.
Enter Codesteer, a smart assistant developed by MIT researchers. This will lead the LLM to switch between code and text generation until you answer the query correctly.
CodeSteer itself is a small LLM and automatically generates a series of prompts to repeatedly pilot the larger LLM. We review the current and previous answers of the model after each round and provide guidance on how to modify or improve that solution until we determine that the answer is correct.
Researchers found that increasing the larger LLM with CodeSteer increases the accuracy of iconic tasks, including increasing numbers, stacking Sudoku and stacking blocks by more than 30%. Additionally, the less refined models were able to outperform the more advanced models with enhanced inference skills.
This advancement could improve LLM problem-solving capabilities for complex tasks that are particularly difficult to solve with textual inference alone, such as generating robot paths in uncertain environments and scheduling shipments in international supply chains.
“There is a race to develop better models that can do everything, but we have adopted a complementary approach. Researchers have developed effective techniques and tools to tackle problems in many domains. They allow LLMS to choose the right tools and methods, and use the expertise of others to enhance their own capabilities. Chief researcher at MIT Institute for Information and Decision-making Systems (Cover).
Fan, a senior author of the study, has participated in a paper on the work by Yongchao Chen, a graduate student at Lids. Yilun Hao, graduate student at Aeroastro; Yueying Liu, graduate student at the University of Urbana-Champaign, University of Illinois; MIT-IBM Watson AI Lab Research Scientist Yang Zhang. This research will be presented at an international conference on machine learning.
LLM “Trainer”
When you ask LLM which number is 9.11 or 9.9, you often use textual inference to give the wrong answer. However, if you ask them to answer the same question using code, you can easily solve the problem by generating and running a Python script to compare the two numbers.
LLMs, trained to first understand and predict human language, are more likely to use text to answer queries, even if the code is more effective. I’ve also learned to generate code through fine tuning, but these models often generate incorrect or inefficient versions of code.
Rather than trying to retrain powerful LLMs like GPT-4 and Claude to improve these features, MIT researchers will fine-tune smaller, lighter LLMs to guide larger models between text and code. Tweaking small models does not change the larger LLM, so there is no risk of undermining the other capabilities of the larger models.
“We were also inspired by humans. In sports, trainers may not be better than the star athletes on the team, but trainers can make useful suggestions for guiding athletes. This steering method also works with LLMS,” says Chen.
This trainer, Codesteer, works in conjunction with the larger LLM. First check the query to determine if the text or code is suitable for this problem, and what code is best for you.
It then generates a larger LLM prompt and tells you to use coding methods or textual inference to answer the query. The larger model follows this prompt to respond to the query and sends the result back to Codesteer.
If the answer is incorrect, CodeSteer continues to encourage LLM to try various things that could solve the problem, such as incorporating search algorithms and constraints into Python code, until the answer is correct.
“We found that often larger LLMs are lazy and try to use shorter, less efficient codes that don’t carry the correct symbolic calculations. We designed the code steer to avoid this phenomenon,” says Chen.
The symbolic checker evaluates the complexity of the code and sends a signal to CodeSteer if it is too simple or inefficient. Researchers also incorporate self-response checkers into CodeSteer. This will generate the code to calculate the answer to confirm that it is correct.
Tackling complex tasks
Because researchers designed Codesteer, they were unable to find the right symbolic dataset to tweak and test the model, as many existing benchmarks did not point out whether they could best solve a particular query in text or code.
So we collected corpus of 37 complex symbolic tasks, including spatial inference, mathematics, order inference, and optimization, and constructed our own dataset called Symbench. They implemented a tweaking approach that leverages symbolnch to maximize CodeSteer performance.
In their experiments, CodeSteer outperformed all nine baseline methods assessed, increasing the average accuracy from 53.3% to 86.4%. It maintains similar performance on invisible tasks and various LLMs.
Furthermore, general purpose models reinforced with CodeSteer can achieve higher accuracy than cutting-edge models designed to focus on complex inference and planning, but with much less calculations.
“Our method uses unique features of LLM. By extending the ability to use LLM smartly to use coding, we can use already very powerful models to improve performance,” says Chen.
In the future, researchers want to streamline Codesteer and speed up the iterative prompt process. Furthermore, they are studying how to effectively fine-tune a unified model with the ability to switch between text inference and code generation rather than relying on another assistant.
“The author presents an elegant solution to the key challenges of tooling use in LLMS. This simple yet impactful approach allows cutting-edge LLMs to achieve significant performance improvements without the need for direct tweaking.” “This study represents a significant contribution that promises to significantly enhance application to the diverse tasks that LLMS is currently struggling with.”
“The success in training smaller, specialized models to strategically guide large, sophisticated models is especially impactful,” said Chi Wang, senior staff scientist at Google Deepmind, who was not involved in the work. “This intelligent collaboration between diverse AI ‘agents’ pave the way for more robust and versatile applications in complex real-world scenarios. ”
This research is supported in part by the U.S. Naval Research Office and the MIT-IBM Watson AI Lab.