NVFP4 Trains with 16-Bit Precision and the Speed of 4-Bit for Large-Scale Pretraining
Sources: https://developer.nvidia.com/blog/nvfp4-trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit, https://developer.nvidia.com/blog/nvfp4-trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/, developer.nvidia.com
TL;DR
- NVIDIA extends NVFP4 from inference to pretraining, enabling 4-bit precision to accelerate large-scale transformer pretraining while preserving FP8/BF16-like accuracy.
- In a 12B Hybrid Mamba-Transformer experiment, NVFP4 trained from scratch with a 10-trillion-token dataset, showing stable convergence and FP8-like validation behavior.
- Measured GEMM performance on Blackwell Ultra vs Hopper indicates up to 7x speedups, highlighting significant throughput gains for large-scale pretraining.
- The NVFP4 pretraining recipe targets dynamic range, gradient volatility, and numerical stability, and remains in the research phase with ongoing collaborations across leading cloud providers and AI labs.
Context and background
As AI workloads scale to multi-billion-parameter foundation models, organizations increasingly rely on token throughput and training efficiency to unlock new capabilities. Inference has already driven multiple waves of precision optimization, including FP32, FP16, FP8, and NVIDIA’s NVFP4 for inference latency, throughput, and efficiency. Pretraining presents distinct challenges, since many models still rely on BF16 or FP8 for stability and convergence. Training, which consumes the bulk of compute, power, and time, benefits immensely when memory is reduced, arithmetic throughput is increased, and communication is optimized. In this context, 4-bit precision presents a potential method to push more tokens through the same hardware without sacrificing accuracy, enabling faster experimentation and scaling to frontier models. 4-bit pretraining requires carefully designed quantization techniques to preserve model effectiveness while handling gradients and updates robustly. NVIDIA describes a purpose-built NVFP4 pretraining recipe that addresses core challenges of dynamic range, gradient volatility, and numerical stability in large-scale training. Blackwell Ultra is highlighted as the architecture that supports FP4 formats with massive FP4 FLOPs throughput on GB200 and GB300, enabling efficient 4-bit training by accelerating narrow-precision matrix operations while preserving the scale and parallelism needed for convergence. The measured GEMM performance demonstrates notable acceleration, including a 7x speedup over the Hopper generation. The work also notes how 4-bit pretraining can deliver acceleration while maintaining accuracy close to FP8/BF16 baselines when paired with the right quantization recipe. Active collaborations and continued engagement around NVFP4 are ongoing with a broad ecosystem of organizations, including Amazon Web Services, Cohere, Google Cloud, Kimi AI, Microsoft AI, Mistral, OpenAI, Perplexity, Reflection, and Runway. The NVFP4 effort sits in the broader context of building scalable AI factories where compute, data, and engineering choices together determine what models can be trained at scale.
What’s new
NVFP4 is being extended from inference to the pretraining workflow, enabling a 4-bit precision path for large-scale transformer pretraining. A dedicated NVFP4 pretraining recipe has been developed to address primary obstacles in 4-bit training, including dynamic range, gradient volatility, and numerical stability. The architecture and tooling enable efficient narrow-precision training through specialized techniques that keep convergence stable and accuracy on par with FP8/BF16 baselines. In an illustrative experiment, a 12B Hybrid Mamba-Transformer model (12B) was trained from scratch using NVFP4, after an initial FP8 baseline comparison (8-bit FP8) showed close alignment with 16-bit results in prior studies. The 12B model was trained on a 10-trillion-token dataset using a phased data-blending approach: Phase 1 fed the model with one data mix, Phase 2 shifted to a mix at 70%, and Phase 3 to a mix at 90% for pretraining. This configuration demonstrated that NVFP4 can support trillion-token-scale pretraining with stable convergence and robust downstream performance. The 12B model pretrained with NVFP4 was compared to the FP8 baseline across multiple downstream tasks and intelligence domains, and NVFP4 matched FP8 performance across the board. A key hardware insight comes from Blackwell Ultra, where FP4-based GEMM throughput is significantly higher, enabling the observed acceleration. The 7x GEMM speedup is reported relative to the Hopper generation, illustrating substantial improvements in training throughput that translate to faster iteration cycles and the potential to scale to larger frontier models.
Why it matters (impact for developers/enterprises)
Throughput and efficiency are not abstract metrics in large-scale pretraining; they directly determine which model scales are feasible and how quickly breakthroughs can be validated. NVFP4’s 4-bit pretraining path promises meaningful reductions in memory footprint and compute time, which can translate into shorter training cycles, more experiments per unit of compute, and the ability to train larger models or explore more ambitious architectures. By delivering FP8/BF16-like accuracy with dramatically higher throughput, NVFP4 redefines what AI factories can achieve and aligns with the industry push toward energy-efficient, high-performance AI development. For developers and enterprises, these advances imply faster prototyping, more aggressive experimentation, and potentially lower operating costs during the long tail of large-scale model development. The ongoing collaboration with cloud providers and AI labs signals a path toward broader evaluation, validation, and potential deployment pipelines that leverage 4-bit pretraining as part of scalable AI infrastructure.
Technical details or Implementation
NVIDIA emphasizes that 4-bit pretraining is non-trivial and requires a dedicated recipe to ensure both stability and accuracy. The NVFP4 approach centers on a purpose-built pretraining recipe that addresses core challenges of 4-bit training: dynamic range, gradient volatility, and numerical stability. Blackwell was the first NVIDIA architecture to natively support FP4 formats, and its FP4 capabilities—especially the high-throughput operations on GB200/GB300 chips—enable efficient 4-bit training by accelerating narrow-precision matrix multiplications while preserving the scale and parallelism needed for convergence on large models. The NVFP4 pretraining study used a 12B Hybrid Mamba-Transformer model (12B) trained on 10 trillion tokens via a phased data-blending scheme. The baseline used FP8 (8-bit) precision, which has previously been shown to closely match 16-bit precision in related work. The same 12B model was then trained from scratch with NVFP4, demonstrating stable convergence without typical ultra-low-precision instabilities. Validation loss curves for NVFP4 tracked closely with the FP8 baseline throughout training, indicating that the 4-bit dynamics can resemble higher-precision behavior when coupled with an appropriate quantization recipe. Downstream evaluation on the 12B model pretrained with NVFP4 showed parity with FP8 across a broad set of domains, reinforcing the potential of 4-bit pretraining to support large-scale frontier models without sacrificing performance. The overall narrative positions NVFP4 as a foundational technology that, when matured, could enable faster scaling of AI factories while maintaining the quality required for production-grade models.
Key takeaways
- NVFP4 expands from inference to pretraining, enabling 4-bit precision in large-scale transformer pretraining while preserving FP8/BF16-like accuracy.
- A 12B Hybrid Mamba-Transformer model trained with NVFP4 on 10 trillion tokens achieved stable convergence and FP8-aligned validation behavior.
- Hardware-accelerated gains were observed, with up to 7x GEMM speedups on Blackwell Ultra compared with Hopper; implications for training throughput are substantial.
- The NVFP4 pretraining recipe targets core challenges of 4-bit training—dynamic range, gradient volatility, and numerical stability—and remains in the research phase with active collaborations across major AI organizations.
- Downstream performance after NVFP4 pretraining matches FP8 baselines across multiple domains, supporting its potential for scalable, efficient frontier-model development.
FAQ
-
What is NVFP4 in this context?
NVFP4 is a 4-bit precision format used for pretraining to accelerate large-scale transformer training while aiming to preserve FP8/BF16-like accuracy.
-
What model size and data scale were involved in the demonstration?
12B Hybrid Mamba-Transformer model trained on about 10 trillion tokens using a phased data-blending approach.
-
What hardware performance was observed?
Measured GEMM performance on Blackwell Ultra showed up to a 7x speedup over the Hopper generation.
-
Is NVFP4 pretraining production-ready?
The NVFP4 pretraining work is in the research phase with ongoing collaborations; broader validation and deployment considerations are forthcoming.
-
How does NVFP4 affect accuracy?
In the reported study, NVFP4 achieved validation losses and downstream performance that matched the FP8 baseline across tested domains.
References
- https://developer.nvidia.com/blog/nvfp4-trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/
- (Additional source details referenced in the NVIDIA blog as part of the NVFP4 pretraining exploration.)
More news
First look at the Google Home app powered by Gemini
The Verge reports Google is updating the Google Home app to bring Gemini features, including an Ask Home search bar, a redesigned UI, and Gemini-driven controls for the home.
NVIDIA HGX B200 Reduces Embodied Carbon Emissions Intensity
NVIDIA HGX B200 lowers embodied carbon intensity by 24% vs. HGX H100, while delivering higher AI performance and energy efficiency. This article reviews the PCF-backed improvements, new hardware features, and implications for developers and enterprises.
Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection
Security researchers demonstrated a prompt-injection attack called Shadow Leak that leveraged ChatGPT’s Deep Research to covertly extract data from a Gmail inbox. OpenAI patched the flaw; the case highlights risks of agentic AI.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference
Scaleway is now a supported Inference Provider on the Hugging Face Hub, enabling serverless inference directly on model pages with JS and Python SDKs. Access popular open-weight models and enjoy scalable, low-latency AI workflows.
Google expands Gemini in Chrome with cross-platform rollout and no membership fee
Gemini AI in Chrome gains access to tabs, history, and Google properties, rolling out to Mac and Windows in the US without a fee, and enabling task automation and Workspace integrations.