NVIDIA NeMo-RL Megatron-Core: Optimized Training Throughput

Overview

NVIDIA NeMo-RL (NeMo Reinforcement Learning) began with training support through PyTorch DTensor (also known as FSDP2), enabling native HuggingFace integration and scaling with PyTorch-native parallelisms such as FSDP2, tensor parallelism, sequence parallelism, and context parallelism. As model sizes push toward hundreds of billions of parameters, the DTensor path faces activation-memory and recompute overhead that lead to slow step times, and it lacks the optimized CUDA kernels and other performance benefits needed for high throughput. This highlighted the need for a more efficient backend, which Megatron-Core provides. In the NeMo-RL v0.3 release, NVIDIA adds Megatron-Core backend support, with detailed documentation, example scripts, and configuration files to post-train large models efficiently. Megatron-Core delivers GPU-optimized techniques and high-throughput performance enhancements, built around a 6D parallelism strategy that optimizes communication and computation patterns and supports a wide range of model architectures. NeMo-RL now enables developers to leverage these optimizations during post-training. NeMo-RL simplifies Megatron-Core usage by automating much of the complex tuning and exposing a simpler, more intuitive configuration surface. Enabling Megatron-based training is straightforward: add a policy.megatron_cfg section to your YAML configuration and set enabled: true. A complete working example is provided in the documentation, and all arguments within the config are forwarded to Megatron during training. After adding the megatron section and enabling it, you’re ready to train. Launching training proceeds the same way as with DTensor, as described in the README or the guide on reproducing DeepScaleR. Megatron-Core-based training supports both dense and Mixture-of-Experts (MoE) models. The approach is demonstrated with step-time breakdowns for Group Relative Policy Optimization (GRPO) on multiple models, showing superior training performance relative to DTensor while preserving convergence properties. The work highlights GPU-optimized kernels, 4D parallelism, sequence packing, and importance sampling as contributors to throughput gains, along with long-context training support up to 16K sequence lengths. Context parallelism with Megatron-Core and DTensor is supported for long-context training, illustrated by results for large models like Llama 3.3 70B. The underlying message is that Megatron-Core-based training significantly improves reinforcement learning throughput for very large models without sacrificing convergence. For researchers and engineers targeting post-training reinforcement learning on very large models, Megatron-Core provides a path to higher throughput, longer context support, and broader model coverage, aided by documentation, example configurations, and scripts designed to streamline adoption. The NeMo-RL documentation, example configs, and scripts are the primary sources to start post-training your large models with Megatron-Core optimizations.

Key features

Megatron-Core backend support in NeMo-RL v0.3, enabling post-training optimizations for large models.
GPU-optimized kernels and high-throughput performance improvements.
6D parallelism strategy to optimize communication and computation; supports a wide range of model architectures.
4D parallelism in conjunction with Megatron-Core, plus sequence packing and importance sampling features.
Long-context training support, with tests demonstrating effectiveness up to 16K sequence lengths.
Context parallelism with Megatron-Core and DTensor for long-context training scenarios.
Support for both dense and Mixture-of-Experts (MoE) models.
Simplified configuration via policy.megatron_cfg; automatic handling of much tuning behind the scenes.
Forward-compatible with existing DTensor workflows; training launch remains as described for DTensor.
Documentation, complete working examples, and scripts to help reproduce results like those shown for Llama 3.3 70B.

Common use cases

Post-training reinforcement learning for very large language models (model sizes reaching hundreds of billions of parameters).
Scenarios requiring long-context training (demonstrated at 16K sequence lengths) to preserve context across long input streams.
Experiments and production workflows involving dense models as well as Mixture-of-Experts (MoE) architectures.
Rapid experimentation and iteration within the HuggingFace ecosystem, leveraging Megatron-Core optimizations for throughput.

Setup & installation

To enable Megatron-Core in NeMo-RL, add the Megatron configuration to your YAML config:

policy.megatron_cfg:
enabled: true

This mirrors the guidance that Megatron-based training is activated by adding the megatron section to the config and setting enabled=True. All arguments within the config will be forwarded to Megatron during training. After adding the megatron section and enabling it, you’re ready to train a model. Launching training is performed in the same way as with DTensor, as described in the README or our guide on reproducing DeepScaleR. Megatron-Core-based training also supports both dense and MoE models. For a minimal start, ensure the Megatron section is present and enabled in your config, then follow the standard DTensor-based training workflow documented in the project README.

Quick start

Ensure the Megatron core is enabled in your NeMo-RL configuration:

policy.megatron_cfg:
enabled: true

Use the same training launch path you would use for the DTensor workflow. See the DTensor guide and the NeMo-RL README for the exact commands and scripts to reproduce the DeepScaleR workflow.
A minimal example using the Megatron backend is the YAML snippet above; from there, refer to the documentation for the complete working configuration example and the script files provided.
For model choices, Megatron-Core supports dense models and Mixture-of-Experts (MoE) variants; you can apply the same training routines with the Megatron backend enabled.

Quick start (minimal runnable example)

Minimal YAML to enable Megatron Core:

policy.megatron_cfg:
enabled: true

Launch training using the same DTensor-based entry points as described in the NeMo-RL README; Megatron configuration arguments will be forwarded automatically to Megatron during training.

Pros and cons

Pros
Substantially higher training throughput for large models compared with the PyTorch DTensor path.
GPU-optimized kernels and dedicated 4D/6D parallelism reduce communication bottlenecks and improve compute efficiency.
Supports long-context training (tested up to 16K sequence lengths) and large architectures like Llama 3.3 70B.
Dense and MoE model support provides flexibility for diverse workloads.
Automatic tuning reduces the complexity of configuring Megatron-Core, simplifying adoption for new users.
Context parallelism with Megatron-Core and DTensor enables long-context training scenarios.
Cons
Megatron-Core introduces many low-level settings; although NeMo-RL automates much of the tuning, some advanced users may still need to adjust options for specific workloads.
The setup requires adding and enabling Megatron-specific config in YAML, which is an additional step beyond the base DTensor workflow.

Alternatives (brief comparisons)

| Aspect | PyTorch DTensor (FSDP2) path | Megatron-Core (NeMo-RL) |---|---|---| | Throughput on very large models | Slower step times due to activation memory and recompute overhead; lacks optimized CUDA kernels | Higher throughput via GPU-optimized kernels and a 6D parallelism strategy; 4D parallelism with sequence packing and importance sampling |Convergence properties | Maintains convergence with the same properties as other PyTorch-based training | Maintains convergence properties; optimized for throughput without sacrificing convergence |Complexity / tuning | Lower surface-level configuration; more manual tuning may be required for scale | NeMo-RL automates much tuning; simplified config via policy.megatron_cfg |Model support | Broad support through PyTorch parallelisms | Dense and MoE models supported; long-context training demonstrated |Context length | Dependent on DTensor capabilities | Demonstrated up to 16K sequence lengths; ongoing optimizations for long-context training |