NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit

Overview

AI workloads have grown exponentially—not only in the deployment of large language models (LLMs) but also in the demand to process tokens during pretraining and post-training. As organizations scale compute infrastructure to train and deploy multi-billion-parameter foundation models, token throughput becomes mission critical. AI factories are increasingly defined by how many tokens they can push through to unlock the next wave of model capabilities. AI-optimized data formats have emerged as a key innovation in this effort. Narrow-precision computation has already transformed inference, with NVIDIA’s introduction of NVFP4, a 4-bit format purpose-built to deliver exceptional inference latency, throughput, and efficiency—while maintaining production-grade accuracy. Now NVIDIA extends this innovation to the pretraining phase, marking a major leap forward in LLM development. Using NVFP4 for pretraining unlocks huge improvements in training LLMs at scale and overall infrastructure efficiency. This isn’t just an incremental optimization—it’s a foundational shift in how large models can be trained at scale. In the era of AI factories, where compute is the engine of progress, precision is no longer a backend detail—it’s a strategic advantage. NVFP4 4-bit pretraining redefines the boundaries of efficiency and scalability, setting a new standard for high-performance AI model development. Training is still in the research phase, exploring and validating the potential of 4-bit precision in large-scale model pretraining. Active engagements and continued collaboration around NVFP4 are ongoing with leading organizations such as Amazon Web Services, Cohere, Google Cloud, Kimi AI, Microsoft AI, Mistral, OpenAI, Perplexity, Reflection, and Runway. 4-bit quantization refers to the process of reducing the precision of model weights and activations to just 4 bits—a dramatic drop from the typical 16-bit or 32-bit floating-point formats. Pretraining with 4 bits is challenging because gradients and updates must be handled very carefully to preserve accuracy while improving the overall training speed. Specialized techniques and recipes are required to maintain effectiveness while mapping high-precision tensors to a much smaller set of quantized values. In recent years, AI workloads have grown exponentially—not just in the deployment of large language models (LLMs) but also in the scale of foundation model pretraining and post-training. As organizations expand compute infrastructure to handle training and deployment of multi-billion-parameter models, progress is increasingly defined by how much token throughput an AI factory can sustain to unlock new capabilities. Inference has already undergone multiple waves of innovation, from FP32 and FP16 down to FP8 and most recently, NVIDIA’s release of NVFP4 for AI inference. While methods like post-training quantization (PTQ) have shown NVFP4 to be a force multiplier in increasing inference throughput while maintaining accuracy, a remaining challenge lies upstream in pretraining—where foundation models still rely on BF16 or FP8 for stability and convergence. Training is where AI factories can spend the bulk of their compute, power, and time. Power budgets are fixed and GPU cycles are scarce, so developers must account for every bit, token, and epoch. Throughput isn’t an abstract metric here—it directly determines what scale of models can be built, how many experiments can be run, and how quickly breakthroughs arrive. This is where 4-bit precision becomes transformative. By cutting memory needs, boosting arithmetic throughput, and optimizing communication, 4-bit pretraining allows factories to push significantly more tokens through the same hardware. With the right quantization recipe, it can deliver accuracy on par with FP8/BF16 while dramatically raising throughput—unlocking faster convergence cycles, more experiments per unit of compute, and scaling to unprecedented frontier models. To enable pretraining at 4-bit precision, we’ve developed a purpose-built NVFP4 pretraining recipe that addresses the core challenges of dynamic range, gradient volatility, and numerical stability in large-scale training. Blackwell was the first architecture from NVIDIA to natively support FP4 formats. The massive FP4 FLOPs throughput on GB200 and GB300 enables efficient 4-bit training by accelerating narrow-precision matrix operations while maintaining the scale and parallelism needed for large model convergence—making them ideal for next-generation AI factories deploying FP4-based pretraining. Figure 1 below shows measured GEMM performance with Blackwell Ultra, revealing a 7x speedup over the Hopper generation. Modern LLMs fundamentally rely on matrix multiplication, particularly within their fully-connected or linear layers, as a core computational element. This makes the efficiency of these operations crucial. With FP4 precision enabling faster and more efficient execution of these operations, the observed GEMM acceleration means the entire pretraining process—from forward propagation to gradient updates—runs significantly faster, reducing time-to-train while enabling rapid larger-scale model development. To enable efficient narrow-precision training, NVIDIA’s NVFP4 pretraining recipe leverages several key techniques which have been chosen based on their performance and accuracy. For narrow-precision formats to be practical in large-scale pretraining, they must ensure both model accuracy and stable convergence. To assess the viability of 4-bit precision in large-scale model training, experiments were conducted with FP8 and NVFP4 on a 12-billion parameter model based on a combined Mamba-Transformer architecture (12B Hybrid Mamba-Transformer model)—similar to NVIDIA Nemotron Nano 2. This model was trained on a massive dataset of 10 trillion tokens using a phased data-blending approach, switching to a different dataset mix in the second phase of training at 70%, and in the third phase of training at 90% during pretraining. A version of the 12B Hybrid Mamba-Transformer model was initially trained with 8-bit precision—FP8, which has been shown in previous studies to closely match 16-bit precision, and hence served as our baseline for comparison. We then successfully trained this same 12B model from scratch using NVFP4, demonstrating that this new low-precision format can support full pretraining at trillion-token scale. The NVFP4 run exhibited stable convergence without the training instabilities or divergence issues that typically plague ultra-low precision training. Figure 3 below shows that NVFP4’s validation loss curve closely matches the loss curves from the higher-precision baseline (i.e., FP8) throughout the entire duration of training. The quantization techniques outlined above ensure that even with aggressive bit-width reduction, the 4-bit pretraining dynamics closely resemble those of higher-precision runs. We then took the 12B Hybrid Mamba-Transformer model pretrained using NVFP4 and compared it to the higher precision FP8 baseline across a range of downstream tasks and intelligence domains. Figure 4 illustrates that across all domains, NVFP4 matches the performance of FP8, highlighting its effectiveness. This finding strengthens the initial hypothesis: NVFP4 is a robust choice for pretraining LLMs even at the trillion-token scale—highlighting its potential for efficient large-scale frontier model training. NVIDIA’s NVFP4 format is redefining the landscape of AI training—setting a new benchmark for speed, efficiency, and purposeful innovation. By enabling 4-bit pretraining, NVFP4 empowers AI factories to scale more rapidly and sustainably, paving the way for the next era of generative AI. As a dynamic and evolving technology, NVFP4 continues to unlock new opportunities for teams building frontier models, driving progress in energy-efficient, high-performance AI. With its breakthrough in compute efficiency, 4-bit pretraining opens the door to more advanced architectures, larger training runs, and significantly more tokens—fueling the future of intelligent systems.

Key features

4-bit NVFP4 pretraining recipe designed to address dynamic range, gradient volatility, and numerical stability in large-scale training.
Blackwell Ultra architecture enables native FP4 formats with substantial FP4 FLOPs throughput on GB200 and GB300, accelerating narrow-precision matrix operations.
Measured GEMM performance with Blackwell Ultra shows a 7x speedup over Hopper, driving end-to-end pretraining acceleration.
Throughput and memory reductions enable significantly more tokens to be processed on the same hardware, supporting trillion-token-scale pretraining.
Demonstrated stability and convergence during 4-bit pretraining on a 12B Hybrid Mamba-Transformer model trained on 10 trillion tokens, using phased data blending.
Baseline comparison against FP8 shows similar validation loss trajectories and downstream task performance across domains.
Collaborations with major players (AWS, Cohere, Google Cloud, Kimi AI, Microsoft AI, Mistral, OpenAI, Perplexity, Reflection, Runway) to explore NVFP4 in real-world settings.
4-bit precision reduces memory and increases arithmetic throughput while maintaining production-grade accuracy in pretraining contexts.

Common use cases

Pretraining large language models at scale (trillion-token regimes) to improve throughput and infrastructure efficiency.
AI factories seeking to maximize tokens processed per unit of compute while preserving training stability and accuracy.
Research and development of frontier-model architectures that require efficient, scalable pretraining pipelines.

Setup & installation

Not specified in the provided source. Setup and installation commands are not described in the NVIDIA Dev Blog excerpt.

# Setup and installation commands are not provided in the source.

Quick start

Not provided in the source. A minimal runnable example is not specified.

# Quick start not provided in the source

Pros and cons

Pros:
Substantial reductions in memory and increases in arithmetic throughput enable larger token throughput in pretraining.
FP4-based pretraining can achieve stable convergence similar to FP8 baselines across multiple downstream tasks.
7x GEMM speedup reported on Blackwell Ultra versus Hopper, improving overall pretraining speed.
Maintain production-grade accuracy while operating in 4-bit precision.
Enables trillion-token-scale training with dedicated 4-bit recipes and data-blending strategies.
Active collaborations indicate industry validation and interest.
Cons:
Training at 4-bit precision is still in the research phase and requires specialized techniques and recipes.
Validation here is on a 12B model with a specific architecture and dataset; broad generalization to all models is not stated.
Inference-focused benefits of NVFP4 exist, but the current emphasis is on pretraining; broader deployment implications are still being explored.

Alternatives (brief comparisons)

| Option | Focus | Throughput vs. accuracy | Notes |---|---|---|---| | NVFP4 (4-bit pretraining) | 4-bit pretraining with FP16-like accuracy | High throughput; matches FP8 on downstream tasks | Research-phase; specialized recipe required |FP8 (baseline for pretraining) | 8-bit precision; higher than 4-bit | Lower memory footprint than FP16 but higher than 4-bit; used as baseline in studies | Used as FP8 baseline for comparison; stability established in prior studies |BF16 / FP16 (pretraining stability) | Higher precision; traditional baselines | Stable, well-understood, but higher memory/compute needs | Not the focus of NVFP4; mentioned as comparison points for stability and convergence |