Scaling LLM Reinforcement Learning with ProRL v2: Prolonged Training for Continuous Improvement

TL;DR

ProRL v2 is the latest evolution of Prolonged Reinforcement Learning (ProRL) for large language models (LLMs), designed to test effects of extended RL training. NVIDIA Research
It builds on the REINFORCE++ baseline and adds methods like Clip-Higher and Dynamic Sampling to improve exploration and learning efficiency.
Core innovations include clipped PPO-Clip loss, Global Batch Normalization, and periodic resets of the reference policy to maintain learning momentum.
A scheduled cosine length penalty and a KL penalty help balance informativeness, conciseness, and policy stability.
Evaluations show state-of-the-art performance and sustained gains across math, code, and reasoning benchmarks, even with reduced training context length. Open-source models and benchmarks are provided to enable reproducibility and broader validation.

Context and background

The AI community has long debated whether LLMs can keep improving through sustained reinforcement learning, or whether capabilities plateau after conventional training schedules. NVIDIA Research introduces ProRL v2 as the next step in probing this question, extending the idea of Prolonged Reinforcement Learning (ProRL) and testing effects of thousands of additional RL steps on LLMs. The project builds on established RL techniques but introduces rigorous regularization, broad domain coverage, and targeted exploration strategies to push past traditional boundaries. The goal is not merely resampling familiar solutions, but genuinely expanding what the model can discover over time. In this setting, ProRL v2 leverages a stabilized RL framework built on the REINFORCE++ baseline, which uses local mean and global batch advantage normalization to improve training stability in RL with value-based components. This work also incorporates explicit mechanisms to encourage exploration and reduce noise in gradient estimates, enabling models to continue learning and improving in challenging tasks. The evaluation spans math, code generation, and diverse reasoning benchmarks to assess whether extended RL can yield robust gains across domains, including out-of-distribution tasks. NVIDIA Research

What’s new

ProRL v2 introduces several innovations designed to stabilize long-horizon RL while expanding what the model can discover. Key elements include:

Clip-PPO-Clip loss at the core of policy updates to restrict how far the new policy can diverge from the old one, helping maintain stability during prolonged training.
Global Batch Normalization in the REINFORCE++ baseline to prevent value instability from small group sizes by first centering rewards across groups before normalization.
Clip-Higher, which uses a higher upper bound of the PPO clipping range to promote sampling diversity and mitigate policy entropy collapse.
Dynamic Sampling, which discards prompts with uniform group rewards (e.g., all correct or all incorrect) to reduce noise in gradient estimates.
A scheduled cosine length penalty to encourage concise, token-efficient outputs by cycling the penalty on and off at regular intervals.
A KL penalty to keep policy updates aligned with a reference. Periodic resets of the reference policy occur every 200–500 RL steps (or upon KL spikes or stalled validation) to prevent stagnation and keep learning aligned with current capabilities.
The combination of these strategies helps avoid overfitting to a static reference or fixed context length, supporting improved accuracy and broader reasoning capabilities over time. | Innovation | Purpose | Impact |--- |--- |--- |PPO-Clip loss | Stabilizes policy updates | Limits divergence and improves stability during repeated updates |Global Batch Normalization | Stabilizes value estimates across small groups | Reduces sensitivity to reward patterns and group size |Clip-Higher | Encourages exploration and diversity | Maintains learning momentum under extended RL |Dynamic Sampling | Reduces gradient noise | Improves signal quality for updates |Scheduled cosine length penalty | Promotes conciseness in outputs | Balances informativeness with token efficiency |KL penalty + periodic resets | Keeps policy aligned while allowing adaptation | Prevents stagnation and supports continuous improvement |
ProRL v2 was evaluated across math, code generation, and diverse reasoning benchmarks. The results show new state-of-the-art performance and sustained improvement even when training context length is reduced from 16K to 8K.
As of the time of writing, the model remains under continuous training and accuracy improvements, with open-source models and benchmarks available for community exploration. The work highlights that extended RL can meaningfully expand LLM reasoning capabilities beyond conventional pipelines.
Practitioners are encouraged to leverage ProRL as a reproducible foundation and training recipe for advancing model performance in RL contexts. The project also points to open-source models and benchmarks as a path for broader validation and collaboration. Ready to get started? Explore ProRL models on Hugging Face.

Why it matters (impact for developers/enterprises)

For developers and enterprises seeking to push the boundaries of LLM capabilities, ProRL v2 offers a practical framework to test and realize sustained improvements through prolonged reinforcement learning. By combining a stabilized RL objective with exploration-friendly mechanisms, ProRL v2 aims to expand what models can discover rather than simply rehash familiar solutions. The visible gains across math, code, and reasoning tasks suggest that extended RL can yield robustness across challenging and out-of-distribution scenarios, potentially translating to more capable assistants, better code-generation tools, and stronger analytical reasoning components. The halving of training context length (from 16K to 8K) while still achieving accuracy gains demonstrates potential reductions in computational cost without sacrificing performance. The inclusion of a reproducible training recipe and open-source models and benchmarks lowers barriers to adoption, enabling organizations to validate, extend, and apply these techniques within their own pipelines and in production-grade environments. The emphasis on regular resets and dynamic sampling also supports long-running training regimes, reducing the risk of stagnation and helping teams maintain progress across iterations. In sum, ProRL v2 offers a structured path to sustained improvements in LLMs through prolonged RL, with practical implications for researchers, builders, and enterprise users aiming to extend model capabilities and reasoning reach. NVIDIA Research

Technical details or Implementation (how it works)

At the heart of ProRL v2 is clipped proximal policy optimization (PPO-Clip) loss, which stabilizes policy updates by restricting how much the new policy can diverge from the old one. The design also draws on the REINFORCE++ baseline, which employs local mean and global batch advantage normalization to enhance RLVR training stability. The normalization approach helps the algorithm remain robust to reward patterns that could otherwise destabilize learning, and the global batch normalization step explicitly reshapes rewards prior to normalization to manage sensitivity to small group sizes. Several innovations specifically address exploration, noise, and efficiency:

Clip-Higher increases the upper bound of the PPO clipping range to promote sampling diversity and mitigate policy entropy collapse.
Dynamic Sampling discards prompts with all-1 (fully correct) or all-0 (fully incorrect) group rewards, reducing gradient noise and improving learning efficiency.
A scheduled cosine length penalty is applied to promote concise outputs, cycling on and off to balance informativeness with token economy.
A KL penalty keeps the policy close to a reference, while periodic resets (every 200–500 RL steps or upon KL spikes) reset the reference policy to the current policy without clearing optimizer state, helping to avoid overfitting to outdated guidance. Together, these components curb overfitting and encourage continuous improvement in accuracy and overall performance. The approach is designed to prevent the model from becoming trapped by a fixed context length or a stubborn reference policy, enabling ongoing improvements in the model’s reasoning abilities. The results show sustained gains across math, code, and reasoning tasks, with robust outcomes on challenging and out-of-distribution benchmarks. The work presents a reproducible foundation and practical training recipe for researchers and practitioners seeking to push LLM performance through prolonged RL. Open-source models and benchmarks accompany the release, inviting broader validation and collaboration. NVIDIA Research

Key takeaways

ProRL v2 demonstrates that LLMs can achieve sustained improvements through extended RL beyond conventional schedules.
A combination of PPO-Clip loss,Clip-Higher, Dynamic Sampling, and Global Batch Normalization stabilizes training and promotes exploration.
Periodic reference-policy resets and a scheduled cosine length penalty help avoid stagnation and promote token-efficient outputs.
The approach yields state-of-the-art performance across math, code, and reasoning benchmarks, even with shorter context lengths.
Open-source models and benchmarks provide a reproducible path for validation and extension by the community.

FAQ

What is ProRL v2?

ProRL v2 is the latest evolution of Prolonged Reinforcement Learning for LLMs, designed to test the effects of thousands of additional RL steps and to push sustained improvements beyond typical RL schedules.
How does ProRL v2 stabilize long-horizon RL training?

It uses a clipped PPO-Clip loss, Global Batch Normalization, Clip-Higher, Dynamic Sampling, a scheduled cosine length penalty, and a KL penalty, along with periodic resets of the reference policy.
Is ProRL v2 open source and available to the community?

Yes, open-source models and benchmarks are available to enable reproducibility and further exploration. Ready to get started? Explore ProRL models on Hugging Face.
Does ProRL v2 reduce training costs while maintaining or improving accuracy?

The evaluation shows improved accuracy even when the training context length is reduced from 16K to 8K, indicating potential computational efficiency benefits in addition to performance gains.

References

https://developer.nvidia.com/blog/scaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2

Scaling LLM Reinforcement Learning with ProRL v2: Prolonged Training for Continuous Improvement

TL;DR

Context and background

What’s new

Why it matters (impact for developers/enterprises)

Technical details or Implementation (how it works)

Key takeaways

FAQ

References

More news

First look at the Google Home app powered by Gemini

NVIDIA HGX B200 Reduces Embodied Carbon Emissions Intensity

Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection

Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)

Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference

Google expands Gemini in Chrome with cross-platform rollout and no membership fee