
As language models (LMs) get better at tasks like image generation, trivia questions, and simple math, you might think human-like reasoning is just around the corner. In reality, they still lag us by a wide margin when it comes to complex tasks. For example, try playing Sudoku with one. Enter the numbers 1 through 9 so that they appear only once in the columns, rows, and sections of the 9-by-9 grid. Your AI opponent will be able to see if you’ve filled it out correctly, but they won’t be able to fill in the boxes on their own, or they’ll fill them out inefficiently.
Whether the LM is trying to solve a sophisticated puzzle, design a molecule, or write a mathematical proof, the system struggles to answer open-ended requests with strict rules to follow. This model is better at telling users how to approach these challenges than trying to tackle them themselves. Furthermore, practical problem solving requires LMs to consider a wide range of options while adhering to constraints. Smaller LMs cannot reliably do this on their own. Large-scale language models (LLMs) are possible, especially when optimized for inference tasks, but they are slow to respond and use large amounts of computational power.
In response to this predicament, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) developed a collaborative approach in which the LLM develops the plan and divides the strategy preparation work among smaller LLMs. Their technique helps small-scale LMs provide more accurate responses than leading LLMs such as OpenAI’s GPT-4o, approaching the accuracy of the best inference systems such as o1, while being more efficient than both. In its framework, called “Distributed Constraints through Inferential Programming with Language Models” (or “DisCIPL”), a large model guides smaller “follower” models toward accurate responses when writing things like text blurbs, grocery lists with budgets, and travel itineraries.
The inner workings of DisCIPL are very similar to contracting with a company for a specific job. When you submit a request to the “boss” model, it will carefully consider how to proceed with the project. The LLM then communicates these instructions and guidelines to the smaller model in a clear manner. Modify the output of the follower LM as necessary. For example, replace expressions from one model that do not fit the poem with better options from another model.
The LLM communicates with its followers using a language that all followers understand, a programming language for controlling the LM called “LLaMPPL”. Developed by MIT’s Stochastic Computing Project in 2023, the program allows users to encode specific rules that guide the model to a desired outcome. For example, LLaMPPL can be used to generate error-free code by incorporating language-specific rules within the instructions. Instructions such as “Write an 8-line poem with exactly 8 words on each line” are encoded in LLaMPPL and queue up small models that contribute different parts of the answer.
MIT doctoral student Gabriel Grand, lead author of the paper introducing this work, says DisCIPL allows LMs to guide each other toward optimal responses, increasing overall efficiency. Grand, who is also a CSAIL researcher, added, “We are working to improve the inference efficiency of LMs, especially in many modern applications of these models that generate output according to constraints.” “Language models consume more energy as people use them, which means we need models that can provide accurate answers while using minimal computing power.”
“It’s really exciting to see new alternatives to standard language model inference,” said Alan Sahr, an assistant professor at the University of California, Berkeley, who was not involved in the study. “This work brings a new approach to language modeling and LLM that significantly reduces inference latency through parallelization, requires significantly fewer parameters than current LLMs, and improves task performance over standard serialized inference. This work also provides an opportunity to explore transparency, interpretability, and controllability of model outputs. Significant open questions remain in the deployment of these technologies.”
story of the underdog
In terms of accuracy and efficiency, one might think that a large LM is “better” for complex prompts than a small LM. DisCIPL offers a surprising alternative to these tasks. Alternatively, if the strengths of smaller models can be combined, similar results could be achieved with increased efficiency.
The researchers note that, in theory, dozens of LMs, regardless of size, can be connected and work together within the DisCIPL framework. In our writing and reasoning experiments, we used GPT-4o, one of the models that helps ChatGPT generate responses, as a “planner LM.” We brainstormed plans for several “Llama-3.2-1B” models (a small system developed by Meta) and LM filled in each word (or token) of the response.
This collective approach competed against three comparable approaches: a follower-only baseline powered by Llama-3.2-1B, GPT-4o running on its own, and the industry-leading o1 reasoning system that helps ChatGPT understand more complex questions such as coding requests and math problems.
DisCIPL was the first to present the ability to write sentences and paragraphs according to explicit rules. The models were given very specific prompts. For example, when writing a sentence of exactly 18 words, the fourth word should be ‘Glasgow’, the eighth word should be ‘in’, and the eleventh word should be ‘and’. This system was very good at handling this request, producing consistent output while achieving accuracy and consistency similar to o1.
faster, cheaper, better
The experiment also revealed that the main components of DisCIPL are much cheaper than state-of-the-art systems. For example, existing inference models such as OpenAI’s o1 perform inference on text, whereas DisCIPL “infers” by writing more compact Python code. In fact, the researchers found that DisCIPL reduced inference by 40.1 percent and led to an 80.2 percent cost reduction over o1.
DisCIPL’s efficiency gains are partially due to the use of smaller Llama models as followers, which are 1,000 to 10,000 times cheaper per token than comparable inference models. This means that DisCIPL is more “scalable”. Researchers were able to run dozens of Llama models in parallel at a fraction of the cost.
That wasn’t the only surprising discovery, according to CSAIL researchers. Their system also performed well against o1 on real-world tasks such as creating ingredient lists, planning travel itineraries, and writing grant proposals with character limits. GPT-4o, on the other hand, struggled to cope with these demands, and test creation often failed to place keywords in the correct parts of sentences. The follower-only baseline basically finished last overall because it was difficult to follow instructions.
“In recent years, we have seen some impressive results from approaches that use language models to ‘automatically formalize’ mathematical and robotics problems by expressing them in code,” said senior author Jacob Andreas, MIT associate professor of electrical engineering and computer science and CSAIL principal investigator. “What I find most interesting about this paper is the fact that LM can now be used to automatically formalize text generation itself, allowing for the same kinds of efficiency gains and guarantees that we’ve seen in other fields.”
In the future, the researchers plan to extend this framework to a more fully recursive approach, allowing the same model to be used as both leader and follower. Grand adds that DisCIPL could also be extended to mathematical reasoning tasks where the answers are difficult to verify. We also plan to test the system’s ability to satisfy vague user preferences rather than following strict constraints that cannot be explicitly accounted for in code. Thinking more broadly, the team wants to use the largest possible model available, but notes that such experiments are computationally expensive.
Grand and Andreas co-authored the paper with CSAIL principal investigator and MIT professor Joshua Tenenbaum, principal investigator Vikash Mansingka of MIT’s Department of Brain and Cognitive Sciences, and Yale University assistant professor Alex Lu SM’20, PhD’25. CSAIL researchers presented this work at the Language Modeling Conference in October and at IVADO’s “Deploying Autonomous Agents: Lessons, Risks, and Real-World Implications” workshop in November.
Their research was supported in part by MIT Quest for Intelligence, Siegel Family Foundation, MIT-IBM Watson AI Lab, Sloan Research Fellowship, Intel, Air Force Office of Scientific Research, Defense Advanced Research Projects Agency, Office of Naval Research, and National Science Foundation.
