How Google’s “Internal RL” Could Unlock AI Agents in the Long Term

0 2 5 minutes read

How Google’s “Internal RL” Could Unlock AI Agents in the Long Term

Google researchers have developed a technique that makes it easier for AI models to learn complex reasoning tasks that typically cause hallucinations or collapse of LLMs. Instead of training LLMs via prediction of the next token, their technique, called internal reinforcement learning (internal RL), directs the model’s internal activations toward developing a high-level step-by-step solution for the input problem.

Ultimately, this could provide a scalable path to creating autonomous agents capable of handling complex reasoning and real-world robotics without the need for constant manual guidance.

The limits of predicting the next token

Reinforcement learning plays a key role in post-training LLMs, particularly for complex reasoning tasks that require long-term planning. However, the problem lies in the architecture of these models. LLMs are autoregressive, meaning they generate sequences one token at a time. When these models explore new strategies during training, they do so by making small, random changes to the next token or action. This exposes a deeper limitation: predicting the next token forces models to search for solutions at the wrong level of abstraction, making long-term reasoning inefficient even when the model “knows” what to do.

This token-by-token approach works well for basic language modeling, but does not apply to long-term tasks where rewards are rare. If the model relies solely on random sampling at the token level, the probability of finding the correct solution in several steps is infinitesimal, "of the order of one in a million," according to the researchers.

The problem is not just confusion between models; it’s that they’re messing up at the wrong level. In comments provided to VentureBeat, Yanick Schimpf, co-author of the paper, notes that in a 20-step task, an agent can get lost in the fine details of a single step, or lose track of the overall goal.

"We argue that when faced with a problem with an abstract structure… [goal-oriented exploration] that’s what you want," Schimpf said. By first solving the problem at the abstract level, the agent commits to a path, ensuring that it will not do so. "getting lost in one of the stages of reasoning" and fail to complete the larger workflow.

To solve this problem, the field has long turned to hierarchical reinforcement learning. HRL attempts to solve complex problems by breaking them down into a hierarchy of temporally abstract actions (high-level subroutines that represent different steps in the solution) rather than managing a task as a chain of tokens.

However, discovering these appropriate subroutines remains a long-standing challenge. Current HRL methods often fail to discover appropriate policies. "converge towards degenerate options" which do not represent meaningful behaviors. Even modern, sophisticated methods like GRPO (a popular RL algorithm used for sparse-reward tasks) fail in complex environments because they cannot effectively bridge the gap between low-level execution and high-level planning.

Lead the internal reflection of the LLM

To overcome these limitations, the Google team came up with an internal RL. Advanced autoregressive models already "know" how to perform complex, multi-step tasks in-house, even if they are not explicitly trained to do so.

Since these complex behaviors are hidden in the model’s residual flow (i.e., the numerical values that carry information through the layers of the network), the researchers introduced a "internal neural network controller," or metacontroller. Instead of monitoring and modifying the output token, the metacontroller controls the model’s behavior by applying changes to the model’s internal activations in the intermediate layers.

This nudge steers the model toward a specific useful state. The base model then automatically generates the sequence of individual steps needed to achieve this goal, because it has already seen these patterns during its initial pre-training.

The metacontroller works through unsupervised learning and does not require human-labeled training examples. Instead, researchers use a self-supervised framework in which the model analyzes a complete sequence of behavior and works backwards to infer the high-level hidden intent that best explains actions.

During the internal RL phase, updates are applied to the metacontroller, which shifts training from predicting the next token to learning the high-level actions that can lead to the solution.

To understand the practical value of this, consider an enterprise agent responsible for code generation. Today there is a difficult trade-off: you need "low temperature" (predictability) to get the syntax right, but "high temperature" (creativity) to solve the logical puzzle.

"Internal RL could facilitate this by allowing the model to explore the space of abstract actions, i.e. structuring logic and method calls, while delegating the token-level realization of these actions to the robust, lower-temperature distribution of the base model." Schimpf said. The agent explores the solution without breaking the syntax.

Researchers investigated two methods for applying this controller. In the first, the base autoregressive model is pre-trained on a behavioral dataset and then frozen, while the metacontroller is trained to drive the residual flow of the frozen model. In the second, the metacontroller and the base model are jointly optimized, with the parameters of both networks being updated simultaneously.

Internal RL in action

To evaluate the effectiveness of internal RL, researchers conducted experiments in hierarchical environments designed to confuse traditional learners. These included a discrete grid world and a continuous control task where a quadruped "ant" the robot must coordinate joint movements. Both environments used sparse rewards with very long action sequences.

While baselines such as GRPO and CompILE failed to learn tasks in a million episodes due to the difficulty of assigning credits over long horizons, internal RL achieved high success rates with a small number of training episodes. By choosing high-level goals rather than small steps, the metacontroller significantly reduced the search space. This allowed the model to identify which high-level decisions led to success, making credit allocation efficient enough to solve the problem of scarce rewards.

The researchers noted in particular that "frozen" the approach was superior. When the base model and metacontroller were co-trained from scratch, the system failed to develop meaningful abstractions. However, when applied to a frozen model, the metacontroller successfully discovered key control points without any human labels, perfectly aligning its internal switching mechanism with ground truth moments when an agent completed one subgoal and started the next.

While the industry is currently focused on reasoning models that generate verbose results "thought chains" To solve the problems, Google’s research points to a different, perhaps more efficient, future.

"Our study joins a growing body of work suggesting that “internal reasoning” is not only feasible but potentially more effective than token-based approaches." Shame said. "Additionally, these silent “thoughts” can be decoupled from specific input modalities – a property that could be particularly relevant to the future of multimodal AI."

If internal reasoning can be guided without being outsourced, the future of AI agents may depend less on incentive strategies and more on how we can access and drive what the models already represent internally. For companies betting on autonomous systems that must plan, adapt, and act over long horizons, this shift could matter more than any new benchmark in reasoning.

ahsan65@gmail.com4 hours ago

0 2 5 minutes read