Large-scale inference models can almost certainly be considered.



There’s been a lot of fuss lately about the idea that large-scale reasoning models (LRMs) can’t think. This is mainly due to a research article published by Apple. "illusion of thinking" Apple claims that LRM should not be able to think. Instead, it just performs pattern matching. The evidence they provided is that LRM with Chain of Thought (CoT) reasoning becomes unable to continue computation using predefined algorithms as the problem grows.

This is a fundamentally flawed argument. If you ask someone who already knows the algorithm for solving the Tower of Hanoi problem to solve the Tower of Hanoi problem using, say, 20 disks, he or she will almost certainly fail. If we follow that logic, we have no choice but to conclude that humans cannot even think. However, this argument only points to the idea that there is no evidence that LRMs cannot think. This alone does not mean that the LRM can think. It’s just that you can’t be sure unless LRM thinks about it.

In this article, I make a bolder claim. LRM can almost certainly be considered. I say “almost” because there’s always a chance that further research could surprise us. But I think my argument is pretty conclusive.

What is thinking?

Before trying to understand whether LRMs can think, we need to define what we mean by thinking. But first, humans need to be able to think according to that definition. We only think about things related to solving the problem at issue.

1. Problem representation (frontal and parietal lobes)

When you think about a problem, the process engages your prefrontal cortex. This area is responsible for working memory, attention, and executive function. These features allow you to keep a problem in mind, break it down into subcomponents, and set goals. The parietal cortex helps encode the symbolic structure of math and puzzle problems.

2. Mental simulation (memory and inner word memory)

This has two components. One is the auditory loop that allows for soliloquy, which is very similar to CoT production. The other is visual imagery, which allows you to visually manipulate objects. Geometry was so important to navigating the world that we developed features specifically for it. The auditory part is linked to Broca’s area and the auditory cortex, both of which are recycled from the language center. The visual cortex and parietal cortex primarily control visual elements.

3. Pattern matching and search (hippocampus and temporal lobe)

These actions depend on knowledge accumulated from past experience and long-term memory.

  • The hippocampus helps recall relevant memories and facts.

  • The temporal lobe provides semantic knowledge such as meaning, rules, and categories.

This is similar to how neural networks rely on training to handle tasks.

4. Monitoring and evaluation (anterior cingulate cortex)

Our anterior cingulate cortex (ACC) monitors for errors, contradictions, and dead ends, and is where we notice them. This process is essentially based on pattern matching from previous experience.

5. Insight or Reframing (Default Mode Network and Right Hemisphere)

When you get stuck, your brain can go into the following states: default mode — A more relaxed, internal-facing network. This is when you step back, let go of the current thread, and sometimes “suddenly” see a new angle (a classic “aha!” moment).

This is how to do it like this Deep Seek-R1 It was trained for CoT inference without including any CoT examples in the training data. Remember that the brain processes data and continuously learns as it solves problems.

in contrast, LRM It cannot be changed based on real-world feedback during prediction or generation. However, using DeepSeek-R1’s CoT training, the learning did Occurs when trying to resolve a problem. In other words, it will be updated as it infers.

Similarities between CoT reasoning and biological thinking

LRM does not have all of the above features. For example, it is highly unlikely that an LRM will do too much visual reasoning within its circuitry, but some may occur. However, CoT generation does not generate intermediate images.

Most humans can create spatial models in their heads to solve problems. Does this mean we can conclude that LRM cannot think? I disagree. Some people find it difficult to form a spatial model of the concepts they are thinking about. This state is called: aphantasia. People with this condition can think normally. In fact, they live their lives as if they lack no abilities at all. Many of them are actually good at symbolic reasoning and are also very good at mathematics. It’s often enough to compensate for the lack of visual reasoning. We might expect that our neural network model can also avoid this limitation.

If we look at the human thought process described above more abstractly, we can see that it primarily involves the following:

1. Pattern matching is used to recall learned experiences, express problems, and monitor and evaluate chains of thought.

2. Working memory stores all intermediate steps.

3. A backtracking search concludes that the CoT is not going anywhere and backtracks to a reasonable point.

Pattern matching in LRM is achieved through training. The point of training is to learn both knowledge about the world and patterns for effectively processing that knowledge. LRM is a layered network, so the entire working memory must fit within one hierarchy. Weights store knowledge of the world and patterns to follow, and processing occurs between layers using learned patterns stored as model parameters.

Note that even for CoT, the entire text must fit in each layer, including the input, CoT, and some of the output already generated. Working memory is just one layer (for attention mechanisms, this includes the KV cache).

In fact, CoT is very similar to when we talk to ourselves (which is most of the time). We almost always verbalize our thoughts. The same goes for CoT reasoners.

There is also ample evidence that CoT reasoners can perform backtracking steps when a particular inference appears futile. In fact, this is what researchers at Apple saw when they tried to ask LRM to solve larger instances of simple puzzles. LRM correctly recognized that trying to solve the puzzle directly wouldn’t fit in its working memory, so it tried to find better shortcuts, just like humans do. This is further evidence that LRMs are thinkers rather than blind followers of predefined patterns.

But why does the next token predictor learn to think?

Neural networks of sufficient size can learn any computation, including thinking.. But a system that predicts the next word can also learn thinking. Let me explain in detail.

The general idea is that you can’t think about LRM because it’s just predicting the next token after all. It’s just “glorified autocomplete”. This view is fundamentally wrong. This doesn’t mean it’s “autocomplete”, it’s just that “autocomplete” doesn’t need to be thought about. In fact, predicting the next word is far from a limited expression of thought. On the contrary, it is the most general form of knowledge representation anyone could hope for. Let me explain.

When we want to express some knowledge, we need a language or symbolic system to do it. There are a variety of formal languages ​​that are very precise in terms of what they can express. However, such languages ​​are fundamentally limited in the types of knowledge that can be expressed.

For example, first-order predicate logic does not allow more than one predicate, so it is not possible to represent the properties of all predicates that satisfy a particular property.

Of course, higher-order predicate calculus also exists that can express predicates on predicates to arbitrary depth. However, even they lack precision or are unable to express ideas that are abstract in nature.

However, natural languages ​​are expressive and capable of describing any concept at any level of detail or abstraction. In fact, you can even describe concepts. About A natural language that uses natural language itself. Therefore, it is a strong candidate for knowledge representation.

The challenge, of course, is that this expressiveness makes it difficult to process information encoded in natural language. But you don’t necessarily need to know how to do it manually. You simply program the machine with your data through a process called training.

The next token prediction machine basically calculates the probability distribution of the next token given the context of the previous token. A machine that aims to accurately calculate this probability must somehow represent knowledge of the world.

A simple example: Consider an incomplete sentence. "The highest mountain in the world is Mt." — To predict the next word as Everest, this knowledge must be stored somewhere in the model. When a task requires the model to compute an answer or solve a puzzle, the Next Token Predictor must output a CoT token to advance the logic.

This means that even if you are predicting one token at a time, your model must represent at least the next few tokens internally in working memory. This is enough to keep the model on the logical path.

If you think about it, humans also predict the next token during speech or when thinking using our inner voice. A perfect autocomplete system that always outputs the correct token and generates the correct answer must be omniscient. Of course, we’ll never reach that point. Because not all answers are computable.

However, parameterized models that can represent knowledge by tuning parameters and learn through data and reinforcement can reliably learn to think.

Does thinking have an effect?

After all, the ultimate test of thinking is whether a system can solve problems that require thinking. If a system can answer previously unseen questions that require some degree of reasoning, it must have learned how to think, or at least reason, to arrive at the answer.

We know that our proprietary LRM performs very well on certain inference benchmarks. However, some of these models may have been tweaked on our benchmark test set through backdoors, so we’ll only focus on the following: open source model For fairness and transparency.

Evaluate them using the following benchmarks:

As you can see, in some benchmarks, LRM can solve a significant number of logic-based questions. While it is true that in many cases they still lag behind human performance, it is important to note that human baselines are often obtained from individuals specifically trained on these benchmarks. In fact, in some cases, LRMs outperform the average untrained human.

conclusion

Based on benchmark results, the striking similarities between CoT inference and biological inference, and the theoretical understanding that a system with sufficient representational power, sufficient training data, and adequate computational power can perform any computable task, LRM meets these criteria to a large extent.

Therefore, it is reasonable to conclude that LRMs almost certainly have the ability to think.

Debasish Ray Chawdhuri is talentica software and PhD candidate in Cryptography from IIT Bombay.

read more guest writer. Or consider submitting your own post. See our Click here for guidelines.



Source link