Google’s new AI training method helps small models tackle complex inference



Researchers from Google Cloud and UCLA have proposed a new reinforcement learning framework that significantly improves the ability of language models to learn extremely difficult multi-step inference tasks. Supervised reinforcement learning (SRL) reformulates problem solving as a series of logical “actions” and provides rich learning signals during the training process.

This approach allows smaller models to learn complex problems that were previously inaccessible using other common training methods. Experiments show that SRL not only outperforms numerical reasoning benchmarks, but also generalizes effectively to agent software engineering tasks.

SRL is a versatile training framework that can boost small, inexpensive models to higher inference capabilities.

Limitations of current LLM inference training

Recent advances in training large-scale language models (LLMs) for inference are primarily driven by reinforcement learning with verifiable rewards (RLVR). This is how the model gets rewarded based on the correctness of the final answer. By repeatedly attempting to solve a problem and receiving feedback on the final result, the model gradually learns effective problem-solving strategies.

However, the success of this outcome-based approach depends on the model’s ability to find the correct solution within a limited number of trials. "Roll out." Each rollout is computationally expensive, so the model cannot be tried out indefinitely. This method hits a wall when the problem is so difficult that the model rarely, if ever, finds the correct answer within the budget.

This creates a significant learning bottleneck. In many multi-step reasoning problems, a model may solve multiple steps correctly, but one mistake can derail it and lead to an incorrect answer. In RLVR, this entire effort is negatively rewarded, and the model learns nothing from partially correct work. This is an all-or-nothing approach, doesn’t provide detailed feedback, and rewards are sparse.

An alternative method is supervised fine-tuning (SFT). In this method, the model learns from examples that include the complete reasoning process demonstrated by experts. Although SFT can provide inference power, it often leads to overfitting (the model simply learns to imitate trajectories in the training data, rather than learning to generalize to problems beyond the examples it has seen before). This problem is further exacerbated by the fact that high-quality human-generated training data is scarce and expensive to create.

As the paper points out, these limitations "This is a critical gap in training small open source models to effectively learn difficult problems."

How supervised reinforcement learning works

SRL introduces a framework that reformulates problem solving as follows. "a series of decision-making processes," Balance between pure outcome-based RL and pure imitation learning. Rather than optimizing only the final answer, SRL teaches the model to reproduce a series of key actions that form the backbone of the expert’s reasoning, rather than forcing the model to imitate the entire thought process of the expert. This allows the model to learn to perform actions similar to experts while developing its own internal reasoning style.

In the SRL framework, an expert’s demonstration is divided into a series of intermediate concrete actions, each representing a meaningful step. For math problems, the actions can be algebraic operations. For software engineering agents, this could be a command run on a code repository. To generate training data, SRL uses a powerful supervised model to create solution trajectories, which are then used to train smaller models.

According to I-Hung Hsu, a research scientist at Google and co-author of the paper, this intermediate approach is key to its effectiveness in real-world scenarios. "SRL is somewhere in between. It captures the structured flexibility of real-world problem solving, where there is not only multiple valid strategies, but also a clear notion of what “good reasoning” looks like at each step." Hsu told VentureBeat. "This makes SRL well-suited for areas such as data science automation and perhaps supply chain optimization, tasks that add value to intermediate inferences rather than just the final answer."

During training, the model first "inner monologue" ( internal reasoning process (enclosed in tags) before committing the action. At each step, SRL provides rewards based on the similarity of the model’s predicted actions and the expert’s actions. This progressive reward system provides dense, granular feedback that allows the model to learn and improve, even if the overall solution isn’t perfect. This solves the low reward problem faced by RLVR.

SRL running

Researchers’ experiments show that SRL significantly outperforms strong baselines on both difficult mathematical reasoning and agent software engineering benchmarks. We also observed that SRL facilitates more flexible and sophisticated inference patterns in the model, such as interleaved planning and self-validation, which not only increases the length of the output but also improves the quality of the solution.

For corporate leaders, improved performance is only valuable if it does not come with runaway costs. Hsu revealed that models trained with SRL are more efficient in inference. "The benefits come not from redundancy, but from improved quality and structure of inferences." he said. "In terms of efficiency, models trained with SRL are roughly equivalent to the base model in token usage. SRL is not designed to reduce inference cost, but it does provide stronger inference performance without increasing inference cost."

When it comes to math tests, the team made some tweaks Qwen2.5-7B-Instruction About a dataset of 1,000 difficult math questions. They compared its performance to models trained on SFT and RLVR (using the popular GRPO algorithm in models such as Deep Seek-R1) on four competitive level math benchmarks. Models trained with SRL achieved a significant performance improvement of 3.0% on average compared to other methods.

The team expanded SRL into agent software engineering, an important area for enterprise automation. They trained a model specifically for coding. Qwen2.5-Coder-7B-Instructon the trajectories of 5,000 experts of agents interacting with a coding environment. The SRL-trained model was benchmarked against the original base model and a strong baseline fine-tuned with SFT, SWE-Gym-7B. SRL achieved a task resolution rate of 14.8%. This represents a relative improvement of 74% compared to the SFT-based model. This demonstrates SRL’s ability to train more capable AI agents for complex real-world programming tasks.

A new standard for high-stakes AI?

The paper’s strongest results came from combining methods. First, use SRL to teach basic reasoning, then use RLVR to hone those skills. In experiments, researchers used SRL as pre-training and applied RLVR after training and observed an average increase of 3.7%, demonstrating a powerful curriculum learning strategy.

This raises the question that this could be a new blueprint for building specialized AI.

"We see SRL as a strong foundation." Sue said. "In a sense, SRL provides a curriculum that teaches the model to think and act step by step, before refining the behavior with outcome-based reinforcement learning. This SRL-first approach not only stabilizes the later RL stages, but also makes the inference more interpretable and generalizable. This is important for high-stakes applications."

Looking ahead, Hsu acknowledges that scaling this pipeline still faces challenges, particularly the high cost and complexity of end-to-end RLVR for agent tasks. But he is optimistic about the path forward. "Although high-quality professional trajectories remain important;" He concluded that, "We believe the next big leap forward will come from automating data generation and filtering, leveraging powerful teacher models and self-improving student models to bootstrap new data."



Source link