If AI reasoning doesn’t work: Microsoft Research shows that more tokens can mean more problems


Large-scale linguistic models (LLMS) are increasingly capable of complex inference through “scaling at inference.” This is a set of techniques that allocate more computational resources during inference to generate answers. However, a new study from Microsoft Research reveals that the effectiveness of these scaling methods is not universal. Performance improvements vary widely depending on the different models, tasks, and complexity of the problem.

The core finding is that throwing more calculations in the problem during inference won’t guarantee better or more efficient results. The findings help businesses to better understand cost volatility and model reliability as they try to integrate advanced AI inference into their applications.

Place the scaling method in the test

The Microsoft Research team conducted extensive empirical analysis on nine cutting-edge basic models. This includes both “traditional” models such as the GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Pro, and Llama 3.1 405b, as well as models specifically fine-tuned for enhanced inference through inference time scaling. These include Openai’s O1 and O3-Mini, Anthropic’s Claude 3.7 Sonnet, Google’s Gemini 2 Flash Thinking, and Deepseek R1.

They evaluated these models using three different inference time scaling approaches.

  1. Standard Chain of Shoes (COT): The basic way the model is asked to answer step-by-step.
  2. Parallel Scaling: This model generates multiple independent answers for the same question and uses aggregators (such as majority votes or selection of the best answer) to arrive at the final result.
  3. Sequential scaling: The model repeatedly generates answers and uses feedback from critics (potentially from the model itself) to refine the answers in subsequent attempts.

These approaches were tested with eight challenging benchmark datasets covering a wide range of tasks that benefit from mathematics and STEM inference (AIME, OMNI-MATH, GPQA), calendar planning (BA calendar), NP-hard problems (3SAT, TSP), and spatial inference (spatial maps): mathematics and STEM inference.

Some benchmarks include various difficulty levels of problems, giving us a more nuanced understanding of how scaling behaves as the problems become more difficult.

“The availability of difficulty tags for Omni-Math, TSP, 3SAT, and BA-Calendar allows you to analyze accuracy and token usage scale with difficulty in inference time scaling.

The researchers evaluated the Pareto frontier of LLM inference by analyzing both accuracy and computational cost (i.e., the number of tokens generated). This helps you identify how efficiently the model achieves the results.

Inference Time Scaling Parate
Inference Time Scaling Pareto Frontier Credit: arxiv

They also introduce a “traditional gap” measure that compares the possible performance of possible performance from traditional models (using the ideal “best N” selection) with the average performance of the inference model, and estimates potential benefits that can be achieved through better training or validation techniques.

More calculations aren’t always the answer

This study provided some important insights that challenged general assumptions about inference time scaling.

The advantages vary widely: Models tailored for inference generally outperform traditional ones in these tasks, but the degree of improvement varies widely depending on the particular domain and task. As the complexity of the problem increases, profits often decrease. For example, the performance improvements seen in mathematical problems were not always equally translated into scientific reasoning and planning tasks.

Token inefficiency is intense: Researchers observed high variability in token consumption, even among models that achieve similar accuracy. For example, in the AIME 2025 Math Benchmark, Deepseek-R1 used more than five times the tokens of the Claude 3.7 sonnet to obtain roughly comparable average accuracy.

More tokens do not lead to higher accuracy: Contrary to the intuitive idea that a long inference chain means better inference, this study found that this is not necessarily true. “Amazingly, we also observe that longer generations compared to the same model can be indicators of models struggling, rather than improvements in reflexes,” the paper states. “Similarly, when comparing different inference models, when there are high token usages, they are not always associated with better accuracy. These findings motivate the need for a more purposeful and cost-effective scaling approach.”

Nondeterminism of cost: Perhaps the most concern for enterprise users, repeating the query of the same model for the same issue can result in highly variable token use. This means that even if the model consistently provides the correct answer, the cost of running a query can fluctuate significantly.

Dispersion of model output
Response length variance (spikes indicate small variance) Credit: arxiv

Possible verification mechanisms: Scaling performance was consistently improved across all models and benchmarks when simulated with a “perfect validator” (using Best-of-n results).

Traditional models sometimes match inference models: By significantly increasing inference calls (up to 50x or more in some experiments), traditional models such as GPT-4O may approach the performance level of dedicated inference models, especially for less complicated tasks. However, these benefits drop rapidly in very complex settings, indicating that brute force scaling is limited.

GPT-4O Inference Time Scaling
In some tasks, the accuracy of GPT-4O continues to improve with parallel and continuous scaling. Credit: arxiv

Impact on companies

These findings carry considerable weight for developers and LLMS company employers. The “non-deterministic” issue is particularly severe, making budgeting difficult. As researchers point out, “Ideally, developers and users would prefer a model with a low standard deviation regarding token use per instance in the case of cost predictability.”

“The profiling done in Microsoft Research can be useful for developers as a tool to select models with less volatile for the same or different prompts. “Ideally, you’d want to choose models with lower standard deviations for the correct input.”

A model peaking to the left consistently generates the same number of tokens for the given task credit: arxiv

This study also provides appropriate insight into the correlation between model accuracy and response length. For example, the following diagram shows that the possibility that mathematics is correct beyond the length of ~11,000 tokens is very slim, and those generations must either stop at that point or restart with some sequential feedback. However, Nushi points out that these models that allow post-relaxation also have clean separations between the correct and incorrect samples.

“Ultimately, it’s the responsibility of the model builder, and it’s a non-critical cost to think about reducing accuracy and costs, and we expect this to happen as this method matures more,” Nusi said. “Along with costless non-determinism, non-determinism is also applied.”

Another important finding is a consistent performance boost from a complete validation agent that highlights key areas of future work. This is the construction of a robust and widely applicable verification mechanism.

“The availability of stronger validators can have different types of impact,” Nushi said. “If used efficiently, these can shorten traces of inference.”

Powerful validation agents are also a central part of your enterprise agent AI solution. Many company stakeholders have already implemented such validation agents and need to be reused for more agent solutions such as SAT solvers, logistic validity checkers, etc.

“The future question is how these existing techniques can be combined with AI-driven interfaces, and what language does the two connect?” Nushi said. “The need to connect the two comes from the fact that users don’t always formulate queries in a formal way, and you’ll want to use a natural language interface and expect a solution in a similar format or final action (for example, suggesting a meeting invitation).”

 

Disclaimer: Includes third-party opinions. No financial advice.

Source link