Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
Sources: https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training, https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/, NVIDIA Dev Blog
TL;DR
- NVIDIA demonstrates a practical fine-tuning workflow for gpt-oss that uses high-precision supervised fine-tuning (SFT) followed by quantization aware training (QAT) to recover accuracy in FP4 while preserving deployment efficiency.
- The workflow upcasts to BF16 for SFT, then applies QAT to return to MXFP4 precision, enabling both alignment and low-precision deployment benefits.
- In evaluation, the recipe boosted two downstream tasks from 16% and 30% baseline to 98% pass rates after SFT + QAT.
- NVFP4, a newer FP4 format designed for training and inference on NVIDIA Blackwell, shows consistently better validation loss (2–3% improvement) and promises tighter convergence for deeper reasoning tasks.
- The MXFP4 recipe can be adapted to NVFP4 with a single line change; upcoming NVFP4 support in TensorRT-LLM will broaden adoption across frameworks.
- The end-to-end workflow is implemented in the NVIDIA Model Optimizer repository, with a convenience script to export to standard PyTorch checkpoints and deployment paths via TensorRT-LLM.
Context and background
Open-source foundational model releases have energized the AI community with architectural innovations and new capabilities. The gpt-oss family marks the first open-source model suite released by OpenAI since GPT-2, delivering an advanced model with a mixture of expert (MoE) architecture, a 128K context length, and adjustable deep reasoning abilities. The largest variant, gpt-oss-120B, achieves open-benchmarks performance comparable to OpenAI’s closed-source o3 and o4 models. Despite strong benchmark results, deploying foundational models in production, especially in low-fault-tolerance domains like healthcare and finance, typically requires post-training techniques to optimize performance and reliability. The open model’s native MXFP4 precision posed unique fine-tuning challenges. NVIDIA’s analysis highlights that stable accuracy with gpt-oss fine-tuning in FP4 is not yet established, which motivates a two-stage approach: upcast to higher precision to stabilize gradient accumulation, followed by a high-precision SFT run and a subsequent application of QAT to revert to FP4 precision while preserving task-specific performance. This SFT + QAT workflow aims to deliver both alignment and deployment efficiency. The workflow is anchored in practical tooling: the Model Optimizer repository provides the complete recipe, and the approach builds on concepts from Hugging Face’s gpt-oss-recipes, OpenAI Cookbook datasets, and established NVIDIA tooling such as the second-generation NVIDIA Transformer Engine. The goal is to recover accuracy in FP4 while retaining the throughput benefits that make low-precision deployments attractive for production systems.
What’s new
- The core recommendation is to perform high-precision fine-tuning first (BF16) to stabilize gradients, then apply QAT to bring the model back to MXFP4 precision. Skipping the high-precision stage and going straight to QAT tends to yield lower accuracy.
- Evaluation on two downstream tasks demonstrates dramatic improvements: non-English reasoning using a multilingual dataset from the OpenAI Cookbook and a reduction in unnecessary refusals of safe user prompts using Amazon’s FalseReject dataset. Pre-recipe scores were 16% and 30%, respectively; post-recipe pass rates reached 98% for both tasks.
- The NVIDIA team compares MXFP4 with NVFP4, noting that NVFP4 generally converges more reliably and yields 2–3% better validation loss across tasks. NVFP4 is designed for FP4 training and inference and leverages the second-generation NVIDIA Transformer Engine for higher performance.
- The MXFP4 recipe can be updated to NVFP4 with a single line change, demonstrating a straightforward migration path as NVFP4 support expands in TensorRT-LLM and other frameworks.
- With NVIDIA Blackwell, NVFP4 enables up to 15 PFLOPs of FP4 compute on Ultra-class hardware, offering tighter convergence and larger margins for tighter thresholds and deeper reasoning. E4M3 FP8 scaling during the fake-quantization phase helps the base weights adapt more effectively to the target precision.
- After finishing the recipe, a convenience script in the Model Optimizer repository exports the BF16-trained checkpoint to MXFP4, and the resulting MXFP4 checkpoints have been tested with upstream SGLang, TensorRT-LLM, and vLLM. Deployment can be performed with TensorRT-LLM 1.1.0rc1.
- The central challenge remains: recover accuracy in FP4 while preserving the efficiency advantages of low precision. The proposed path—upcast to BF16 for SFT, then apply QAT—addresses this gap by adapting weights to the low-precision target while reinforcing task-specific behavior.
- Looking ahead, NVFP4 support in TensorRT-LLM and other open-source inference frameworks will broaden adoption, enabling NVFP4 with the same SFT + QAT workflow for greater accuracy in gpt-oss deployments.
Why it matters (impact for developers/enterprises)
For developers and enterprises, the ability to deploy powerful open-source models like gpt-oss in FP4 while preserving or enhancing accuracy offers a compelling ROI. The combination of SFT and QAT helps recover task-specific performance without sacrificing the efficiency gains of low-precision inference. In safety-sensitive domains, improved alignment and reduced refusals translate into more usable and trustworthy AI systems. As hardware advances, the introduction of NVFP4 could unlock even greater accuracy gains when paired with QAT. NVIDIA’s Blackwell architecture and accompanying tooling—such as the second-generation Transformer Engine and TensorRT-LLM—are positioned to deliver tighter convergence and larger margins for stricter thresholds and deeper reasoning in production deployments. The ability to adapt MXFP4 checkpoints to NVFP4 with minimal code changes lowers the barrier to adoption and accelerates deployment timelines.
Technical details or Implementation
- Core workflow: upcast to BF16 for SFT, followed by QAT to MXFP4 for deployment. This sequence stabilizes gradient accumulation at higher precision and then adapts weights for the target low-precision format.
- Hyperparameters and training duration for QAT are optimizable; skipping the high-precision step reduces final accuracy, so a high-precision fine-tuning phase is recommended before QAT.
- The two key downstream evaluation tasks demonstrate the practical impact of this approach: | Task | Baseline | Post-workflow pass rate |---|---|---| | Non-English reasoning (OpenAI Cookbook multilingual dataset) | 16% | 98% |Safe-prompt refusals (Amazon FalseReject dataset) | 30% | 98% |
- For MXFP4 to NVFP4 migration, a single-line code update is sufficient to adapt the recipe, after which NVFP4 validation loss consistently improves by 2–3% across tasks.
- NVFP4 introduces a precision format designed for FP4 training and inference, enabling developers to leverage up to 15 PFLOPs of FP4 NVIDIA Blackwell Ultra compute for greater efficiency and accuracy. The E4M3 FP8 scaling precision plays a role in minimizing quantization errors during forward passes, aiding the weight adaptation process.
- The standardized path includes exporting the trained BF16 checkpoint to MXFP4 via the Model Optimizer’s convenience script, followed by deployment through validated stacks such as TensorRT-LLM, SGLang, and vLLM.
- The described workflow aligns with ongoing efforts to integrate gpt-oss NVFP4 support into NVIDIA TensorRT-LLM and other open-source inference frameworks, signaling broader accessibility once NVFP4 support is fully available.
Key takeaways
- A two-stage fine-tuning path (SFT in higher precision followed by QAT to FP4) effectively recovers accuracy for gpt-oss in deployment-friendly precision.
- The workflow delivers dramatic improvements on targeted tasks, with 98% pass rates after the recipe on both evaluated tasks.
- NVFP4 offers potential accuracy and convergence benefits over MXFP4, with better validation loss and alignment for tougher tasks.
- Migration from MXFP4 to NVFP4 is streamlined, requiring only a single-line update in the code path.
- The NVIDIA Model Optimizer repository provides end-to-end tooling to export, validate, and deploy frozen checkpoints in production environments.
FAQ
-
What is the core idea behind QAT in this workflow?
Quantization Aware Training adapts model weights to the target FP4 precision while preserving accuracy gained during high-precision training.
-
Why upcast to BF16 before QAT?
Upcasting stabilizes gradient accumulation during fine-tuning, enabling more reliable subsequent QAT to recover FP4 accuracy.
-
What are MXFP4 and NVFP4?
They are FP4 precision formats used for model weights and computations; MXFP4 is the base FP4 path described initially, while NVFP4 is a newer FP4 format designed for training and inference on NVIDIA Blackwell hardware.
-
How can I deploy the fine-tuned model?
fter convergence, export to a standard PyTorch checkpoint via the Model Optimizer, then deploy with frameworks such as TensorRT-LLM (TensorRT-LLM 1.1.0rc1 mentioned in the article).
-
Where can I find the complete recipe?
The complete SFT + QAT recipe is provided through the NVIDIA Model Optimizer repository, and is noted as adaptable to NVFP4 with future framework support.
References
More news
First look at the Google Home app powered by Gemini
The Verge reports Google is updating the Google Home app to bring Gemini features, including an Ask Home search bar, a redesigned UI, and Gemini-driven controls for the home.
NVIDIA HGX B200 Reduces Embodied Carbon Emissions Intensity
NVIDIA HGX B200 lowers embodied carbon intensity by 24% vs. HGX H100, while delivering higher AI performance and energy efficiency. This article reviews the PCF-backed improvements, new hardware features, and implications for developers and enterprises.
Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection
Security researchers demonstrated a prompt-injection attack called Shadow Leak that leveraged ChatGPT’s Deep Research to covertly extract data from a Gmail inbox. OpenAI patched the flaw; the case highlights risks of agentic AI.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference
Scaleway is now a supported Inference Provider on the Hugging Face Hub, enabling serverless inference directly on model pages with JS and Python SDKs. Access popular open-weight models and enjoy scalable, low-latency AI workflows.
Google expands Gemini in Chrome with cross-platform rollout and no membership fee
Gemini AI in Chrome gains access to tabs, history, and Google properties, rolling out to Mac and Windows in the US without a fee, and enabling task automation and Workspace integrations.