Google’s new AI training method helps small models tackle complex reasoning

ahsan65@gmail.comNovember 14, 2025

0 2 5 minutes read

Google’s new AI training method helps small models tackle complex reasoning

Researchers from Google Cloud and UCLA have proposed a new reinforcement learning framework that significantly improves the ability of language models to learn very difficult multi-step reasoning tasks. Supervised Reinforcement Learning (SRL) reformulates problem solving as a sequence of logical “actions,” providing rich learning cues during the training process.

This approach allows smaller models to learn complex problems that were previously out of reach for other common training techniques. Experiments show that SRL not only excels in mathematical reasoning tests, but also generalizes effectively to agentic software engineering tasks.

SRL is a versatile training framework that can elevate smaller, less expensive models to higher reasoning capabilities.

The limits of current training in LLM reasoning

Recent advances in training large language models (LLMs) for reasoning have been largely driven by reinforcement learning with verifiable rewards (RLVR), a method in which a model is rewarded based on the correctness of its final answer. By repeatedly trying to solve problems and getting feedback on the final result, the model gradually learns effective problem-solving strategies.

However, the success of this outcome-based approach depends on the model’s ability to discover a correct solution in a limited number of attempts, or "deployments." Since each deployment is computationally expensive, the models cannot be tried indefinitely. This method hits a wall when the problems are so difficult that the model rarely, if ever, finds the right answer within its budget.

This creates a critical learning bottleneck. In many multistep reasoning problems, a model may correctly solve multiple steps but be derailed by a single error, leading to an incorrect answer. With RLVR, all this effort receives a negative reward and the model learns nothing from its partially correct work. It’s an all-or-nothing approach that fails to provide granular feedback and offers rare rewards.

An alternative method is supervised fine-tuning (SFT), in which the model learns from examples containing the complete reasoning process presented by experts. Although SFT can instill reasoning skills, it often leads to overfitting (the model simply learns to imitate the trajectories in the training data instead of learning to generalize to problems beyond the examples it has seen). This problem is compounded by the fact that high-quality human-created training data is both rare and expensive to produce.

As the paper notes, these limitations leave "a critical gap for training small open source models to efficiently learn difficult problems."

How Supervised Reinforcement Learning Works

SRL introduces a framework that reformulates problem solving as a "sequential decision-making process," finding a balance between pure outcome-based RL and pure imitation learning. Instead of optimizing only for the final answer or forcing the model to imitate an expert’s entire thought process, SRL teaches the model to reproduce a sequence of key actions that form the backbone of expert reasoning. This allows the model to learn to take actions similar to those of an expert while developing its own internal reasoning style.

Under SRL, expert demonstrations are broken down into a series of intermediate, concrete actions, each representing a significant step. For a math problem, an action might be an algebraic manipulation. For a software engineering worker, this might be a command executed in a code repository. To generate training data, SRL uses a powerful teacher model to create solution trajectories, which are then used to train a smaller model.

According to I-Hung Hsu, a Google researcher and co-author of the paper, this middle-of-the-road approach is key to its effectiveness in real-world scenarios. "SRL falls in the middle: it captures the structured flexibility of real-world problem solving, where there are multiple valid strategies but also clear notions of what “good reasoning” looks like at each step." Hsu told VentureBeat. "This makes SRL suitable for areas like data science automation or probably supply chain optimization – tasks that reward strong intermediate reasoning rather than simple final answers."

During training, the model first generates a "inner monologue" (its internal reasoning process, locked in tags ) before committing to an action. At each step, the SRL provides a reward based on the similarity between the model’s predicted action and the expert’s action. This staged reward system provides dense and precise feedback, allowing the model to learn and improve even if its overall solution is not perfect. This solves the sparse reward problem that RLVR faces.

SRL in action

The researchers’ experiments show that SRL significantly outperforms strong baselines in complex mathematical reasoning and agentic software engineering benchmarks. They also observed that SRL encourages more flexible and sophisticated reasoning patterns in models, such as interleaved planning and self-verification, which improve the quality of solutions without simply elongating results.

For business leaders, performance gains are only useful if they are not accompanied by uncontrollable costs. Hsu points out that models trained in SRL are more efficient in their reasoning. "The gains come from better quality and structure of reasoning, not from verbosity," he said. "In terms of efficiency, models trained by SRL are roughly comparable to the base model in terms of token usage… even though SRL is not designed to reduce inference cost, it achieves better reasoning performance without increasing it."

For the mathematics tests, the team refined Qwen2.5-7B-Instruct on a dataset of 1,000 difficult math questions. They compared its performance to that of models trained with SFT and RLVR (using the GRPO algorithm common in models like DeepSeek-R1) on four competition-level math tests. The model trained by SRL achieved a substantial average improvement of 3.0% in performance compared to other methods.

The team expanded SRL into agentic software engineering, a critical area for enterprise automation. They trained a model specialized in coding, Qwen2.5-Coder-7B-Instructon 5,000 expert trajectories of agents interacting with a coding environment. The SRL trained model was compared to the original baseline model and SWE-Gym-7B, a strong baseline refined with SFT. SRL achieved a task solution rate of 14.8%, which represents a relative improvement of 74% over the SFT-based model. This shows the ability of SRL to train more competent AI agents for complex, real-world programming tasks.

A new standard for high-stakes AI?

The paper’s most striking results came from combining methods: first, using SRL to teach fundamental reasoning, and then using RLVR to refine this skill. In their experiments, when the researchers used SRL as pre-training and applied RLVR as post-training, they observed an average increase of 3.7%, demonstrating a powerful program learning strategy.

This raises the question of whether this could become a new model for building specialized AI.

"We consider the SRL as a solid foundation," Hsu said. "In a sense, SRL provides a curriculum – instructional models for thinking and acting step by step – before refining those behaviors with results-based reinforcement learning. This SRL-driven approach not only stabilizes the later stage of RL, but also makes the reasoning more interpretable and generalizable, which is essential for high-stakes applications."

Looking ahead, Hsu acknowledges that scaling this pipeline still faces challenges, particularly the high cost and complexity of end-to-end RLVR for agentic tasks. He is nevertheless optimistic about the path forward. "Although high-quality expert courses remain important," he concluded, "we believe the next big step will come from automating their generation and filtering, leveraging strong teacher models or even self-improving student models to bootstrap new data."

ahsan65@gmail.comNovember 14, 2025

0 2 5 minutes read