
Large-scale language models (LLMs) like ChatGPT can help you write essays and plan menus almost instantly. But until recently, it was also easy to trip them up. Models that relied on language patterns to respond to user queries often failed on math problems and were poor at complex reasoning. But suddenly they became much better at these things.
A new generation of LLMs, known as inference models, are trained to solve complex problems. Like humans, we need time to think through such issues. And surprisingly, scientists at MIT’s McGovern Institute for Brain Research have discovered that the types of problems that most require inferential models to handle are the very same problems that humans need to spend their time tackling. In other words, they report today in magazines PNASthe “cost of thinking” of an inference model is similar to the cost of human thinking.
The researchers, led by Evelina Fedorenko, an associate professor of brain and cognitive science and a researcher at the McGovern Institute, concluded that in at least one important respect, inferential models have a human-like approach to thinking. That’s not by design, they point out. “The people building these models don’t care if they do it like humans do; they just want a system that works robustly and produces the correct response under all kinds of conditions,” Fedorenko says. “The fact that we’re seeing some convergence is really surprising.”
inference model
Like many forms of artificial intelligence, the new inference model is the artificial neural network. It is a computational tool that learns how to process information when given data and a problem to solve. Artificial neural networks have been very successful at many of the tasks that the brain’s own neural networks excel at, and in some cases, neuroscientists have discovered that the best-performing neural networks share certain aspects of information processing in the brain. Still, some scientists argued that artificial intelligence is not ready to take on the more sophisticated aspects of human intelligence.
“Not too long ago, I was one of the people who said, ‘These models are very good in areas like perception and language, but we’re still a long way from having neural network models that can make inferences,'” Fedorenko says. “Then these large-scale inference models emerged and were able to perform much better at many thinking tasks, such as solving math problems or writing computer code.”
Andrea Gregor de Varda, a fellow at the K. Lisa Yang ICoN Center and a postdoctoral fellow in Fedorenko’s lab, explains that inferential models solve problems step by step. “At some point, people realized that models needed more space to perform the actual calculations needed to solve complex problems,” he says. “Once we forced the model to break down the problem into parts, performance started to improve significantly.”
To help models address progressively more complex problems that lead to the correct solution, engineers can use reinforcement learning. During training, the model is rewarded for correct answers and penalized for incorrect answers. “The model itself explores the problem space,” De Varda says. “Because the behavior that leads to a positive reward is reinforced, the correct solution will be produced more often.”
A model trained in this way is much more likely than previous models to arrive at the same answer as a human when given an inference task. Their step-by-step problem solving means that reasoning models can take a little longer to find an answer than previous LLMs. But their response is worth the wait, as we’re getting the right answer where previous models would have failed.
The fact that models require some time to tackle complex problems already suggests similarities with human thinking. If you ask humans to solve difficult problems instantly, they will probably fail too. De Varda wanted to examine this relationship more systematically. So he gave reasoning models and human volunteers the same set of problems and tracked not only whether they got the answers right, but also how much time and effort it took them to get there.
time and tokens
This meant measuring the time it took people to answer each question in milliseconds. For the model, Varda used a different metric. Measuring processing time is meaningless because processing time depends more on the computer hardware than on the effort the model puts into solving the problem. So instead, we tracked tokens that are part of the model’s internal chain of thought. “These generate tokens that are not intended for the user to see and work with, but only to track the internal calculations being performed,” de Varda explains. “It’s like he’s talking to himself.”
Both humans and reasoning models were asked to solve seven different types of problems, including numerical calculations and intuitive reasoning. A number of questions were given for each problem class. The more difficult a particular problem is, the longer it will take people to solve it. And the longer it takes to solve a problem, the more tokens the inference model generates in arriving at its own solution.
Similarly, the class of problems that took humans the longest to solve was the same class of problems that required the most tokens for the model. Arithmetic problems were the least demanding, while a group of problems called “ARC challenges,” in which pairs of colored grids represented transformations that needed to be inferred and applied to new objects, were the most costly for both humans and models.
De Varda and Fedorenko say the remarkable agreement in thought costs shows that inferential models think like humans. However, that does not mean that the model reproduces human intelligence. Researchers still want to know whether the model uses information representations similar to the human brain, and how those representations are translated into solutions to problems. We are also interested in whether the model can handle problems that require knowledge of the world that is not detailed in the text used to train the model.
The researchers point out that even though reasoning models generate internal monologues when solving problems, they don’t necessarily use language to think. “If you look at the output that these models produce during inference, even if the model ultimately arrives at the correct answer, it often contains errors and some bits of nonsense. So the actual internal computations are likely to take place in an abstract, non-linguistic representation space, just as humans don’t use language to think,” he says.
