Reinforcement Learning with NVIDIA NeMo-RL: Megatron-Core Backed Optimized Training Throughput

TL;DR

NeMo-RL v0.3 adds Megatron-Core backend for optimized training throughput on massive language models.
The Megatron-Core backend addresses saturation and recompute overhead seen with PyTorch DTensor (FSDP2) at hundreds of billions of parameters, delivering higher throughput and improved convergence.
The release supports dense and Mixture of Experts (MoE) models, 6D parallelism, long-context training up to 16k sequence lengths, and context parallelism when needed.
Training can be launched in the same way as with DTensor, and configuration is streamlined via policy.megatron_cfg in YAML; Megatron arguments are forwarded during training.
NeMo-RL continues to enable native HuggingFace integration and PyTorch native parallelisms, with optimizations tailored for GPU kernels and high-throughput workloads.

Context and background

NVIDIA’s NeMo-RL project originally released training support through PyTorch DTensor (also known as FSDP2), enabling native HuggingFace ecosystem compatibility, quick experimentation, and scaling with PyTorch parallelisms such as FSDP2, tensor parallel, sequence parallel, and context parallel. However, as model sizes grow toward hundreds of billions of parameters, the DTensor path introduces activation-memory recompute overhead and lacks optimized NVIDIA CUDA kernels and other performance improvements necessary for optimal throughput. These challenges motivated the integration of a more efficient backend designed specifically for large-scale models: Megatron-Core. The Megatron-Core backend is built with GPU-optimized techniques and high-throughput performance enhancements. It implements a 6D parallelism strategy that optimizes both communication and computation patterns and supports a diverse range of model architectures. NeMo-RL v0.3 brings Megatron-Core support to post-training workflows, giving developers access to optimized training while maintaining convergence properties comparable to prior backends. In addition to performance, NeMo-RL v0.3 emphasizes usability. While Megatron-Core exposes many low-level settings, the release automates much of the tuning behind the scenes and presents a simpler, more intuitive set of configuration options. This approach lowers the barrier to adopting Megatron-based training for large-scale models. The release also demonstrates practical results, including performance comparisons and long-context training evidence. For example, tests on Llama 3.3 70B with a 16k sequence length using the Megatron backend illustrate the practical reach of Megatron-Core for long-context tasks. The release notes that measurements were averages over multiple steps to reflect typical training behavior and convergence with these optimizations. These results highlight the potential for substantial throughput improvements while preserving convergence characteristics. NVIDIA highlights that Megatron-Core training can be used alongside DTensor for long-context scenarios, reflecting a flexible approach to hybrid parallelism when necessary. This supports a wide range of model architectures and sequence lengths while offering a clear path to optimize post-training efficiency.

For a deeper look and working examples, see the NVIDIA NeMo-RL documentation and the official blog describing Megatron-Core support: NVIDIA blog on Megatron-Core-backed NeMo-RL training.

What’s new

NeMo-RL v0.3 introduces Megatron-Core as a first-class backend for post-training workflows. Key enhancements include:

Megatron-Core backend integration for high-throughput training of large language models, with improved communication/computation patterns via 6D parallelism.
Support for both dense models and Mixture of Experts (MoE) architectures.
Long-context training support, including sequences up to 16k, with ongoing optimization for even longer contexts.
Seamless enabling via configuration: add the policy.megatron_cfg section to your YAML configuration and set enabled True. All arguments within the config are forwarded to Megatron during training.
Training workflow parity with DTensor: launching Megatron-based training follows the same process as DTensor-based runs, with the same setup guidance from the NeMo-RL README and related guides.
Automated tuning and sensible defaults reduce the complexity of optimizing Megatron-Core settings, while exposing essential controls for advanced users.
Performance demonstrations show superior training throughput relative to PyTorch DTensor at similar convergence properties, with specific results reported for representative models such as Llama 3.3 70B at long contexts.
Context parallelism remains compatible with Megatron-Core and DTensor, enabling long-context training when required. To get started, explore the complete working example and configuration files in the NeMo-RL v0.3 release and official documentation. The Megatron-based workflow supports both dense and MoE models, enabling a broad range of large-model post-training scenarios.

Why it matters (impact for developers/enterprises)

Throughput gains on very large models: Megatron-Core is designed to maximize GPU utilization and reduce step times for models with hundreds of billions of parameters, addressing the suboptimal throughput observed on the DTensor path.
Easier adoption of large-model post-training: NeMo-RL now provides an automatic tuning workflow and straightforward YAML configuration, lowering the barrier for teams to adopt Megatron-Core without deep low-level tuning.
Broad model support: The combination of dense and MoE support expands the set of architectures that can be efficiently post-trained, enabling more flexible experimentation and deployment.
Long-context capability: With 16k context windows and ongoing optimizations for longer contexts, developers can push performance on tasks requiring extended dependencies, aligning with real-world use cases.
Hybrid parallelism flexibility: The ability to use Megatron-Core alongside DTensor for long-context scenarios provides a practical pathway for teams to optimize current pipelines while evaluating Megatron-Core benefits.
Convergence parity: Despite higher throughput, convergence properties remain comparable to prior approaches, enabling reliable training outcomes and reproducible results.

Technical details or Implementation

Megatron-Core backend: Built with GPU-optimized kernels and a 6D parallelism strategy to optimize both communication and computation across model architectures.
Model support: Both dense and MoE models are supported, enabling a wide spectrum of large-model post-training workflows.
Configuration and automation: The Megatron-Core integration is designed to reduce manual tuning. Users add policy.megatron_cfg to their YAML configuration and set enabled to true. All arguments in the config are forwarded to Megatron at training time, providing a unified workflow.
Launch parity: Megatron-based training launches in the same manner as DTensor-based runs, according to the NeMo-RL guidance and reproduceable DeepScaleR workflows.
Long-context and context parallelism: Megatron-Core supports long-context training and can be used with context parallelism in conjunction with DTensor for very long sequences; practical results include testing with 16k sequence lengths on large models like Llama 3.3 70B.
Performance characteristics: Timing measurements indicate improved throughput with Megatron-Core while preserving convergence properties, enabling faster experiments and training cycles.
Supporting features: The release references performance enhancements like sequence packing and importance sampling as part of the Megatron-Core optimization toolkit.

Tables and comparisons (high level)

| Backend path | Training throughput | Convergence properties | Model scale support |---|---|---|---| | DTensor (FSDP2) path | Moderate to high; limited by activation memory overhead on very large models | Comparable to prior methods | Large models approaching hundreds of billions of parameters (with caveats) |Megatron-Core | Higher throughput with GPU-optimized kernels and 6D parallelism | Preserves same convergence properties as DTensor for given setups | Dense and MoE architectures, including very large models (e.g., 70B Llama) |

Key takeaways

Megatron-Core brings GPU-optimized training throughput to NeMo-RL, addressing DTensor limitations for ultra-large models.
The v0.3 release enables both dense and MoE models with 6D parallelism and long-context training, including 16k sequences.
Configuration is simplified through policy.megatron_cfg, with arguments forwarded to Megatron during training, and training launches remain familiar to DTensor users.
Long-context training with Megatron-Core can be used alongside DTensor, offering flexible deployment for enterprise-scale workloads.
NVIDIA emphasizes native HuggingFace integration and compatibility with PyTorch native parallelisms, now extended with Megatron-Core performance optimizations.

FAQ

What is Megatron-Core in NeMo-RL v0.3?

Megatron-Core is a backend designed to deliver optimized training throughput for very large language models, implemented with GPU-optimized kernels and a 6D parallelism strategy. It is integrated into NeMo-RL v0.3 for post-training workflows.
How do I enable Megatron-Core in NeMo-RL?

Add the policy.megatron_cfg section to your YAML configuration and set enabled to True. All arguments within the config are forwarded to Megatron during training.
Does Megatron-Core support MoE models?

Yes. Megatron-Core supports both dense and MoE model architectures.
Can I use Megatron-Core with long-context training?

Yes. Megatron-Core supports long-context training and has demonstrated performance with sequences up to 16k, with ongoing optimizations for longer contexts.
Is the Megatron-Core workflow compatible with DTensor launches?

Yes. Megatron-based training can be launched in the same way as DTensor training, and can be used in combination with DTensor for long-context scenarios.