OpenAI GPT OSS: Open-source MoE models (120B/20B) with MXFP4 under Apache 2.0
Sources: https://huggingface.co/blog/welcome-openai-gpt-oss, Hugging Face Blog
Overview
GPT OSS is a highly anticipated release of open weights from OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases. It comprises two models: a large one with 117B parameters (gpt-oss-120b) and a smaller one with 21B parameters (gpt-oss-20b). Both are mixture-of-experts (MoEs) and use a 4-bit quantization scheme (MXFP4), enabling fast inference with fewer active parameters while keeping resource usage low. The large model fits on a single H100 GPU, while the small one runs within 16GB of memory and is well suited for consumer hardware and on-device applications. GPT OSS is released under the Apache 2.0 license, with a minimal usage policy that emphasizes safe, responsible, and democratic use while giving users control over how they deploy and use the models. By using gpt-oss, you agree to comply with all applicable law. OpenAI describes this release as a meaningful step toward broad AI accessibility, and Hugging Face welcomes OpenAI to the community. The models are accessible via Hugging Face’s Inference Providers service, using the same infrastructure as the OpenAI demo on gpt-oss.com. The models are designed to be run in a variety of environments with flexible tooling, including single-GPU inference, multi-GPU setups via accelerate or torchrun, and on consumer hardware or enterprise endpoints.
Key features
- Two open-weight models: gpt-oss-120b (117B parameters) and gpt-oss-20b (21B parameters).
- Mixture-of-experts (MoEs) with MXFP4 4-bit quantization for memory savings and faster inference.
- Large model fits on a single H100 GPU; 20B runs on ~16GB RAM, suitable for consumer hardware and on-device deployments.
- Apache 2.0 license with a minimal usage policy; emphasis on safe, responsible, and democratic use; you agree to comply with applicable law.
- Accessible via Hugging Face Inference Providers; OpenAI-compatible Responses API support for flexible chat-style interactions.
- Integration readiness: transformers, accelerate, and kernels are recommended; Triton 3.4+ recommended for MXFP4 support on CUDA hardware.
- Hardware compatibility: MXFP4 was originally for Hopper/Blackwell but now supports Ada, Ampere, and Tesla; AMD Instinct ROCm support is available via kernels; MegaBlocks MoE kernels offer a speed path when MXFP4 isn’t available.
- On Hopper (H100/H200) cards, kernels can be upgraded to enable optimized kernels; attention sinks via Flash Attention 3 integration (via vLLM kernels).
- 120B model can be run on a single GPU but also scales across multiple GPUs with accelerate or torchrun; default parallelization plans exist in the Transformers ecosystem.
- Llama.cpp offers native MXFP4 support with Flash Attention across Metal, CUDA, and Vulkan backends; available via llama-server for 120B and 20B.
- The stack supports tool use and fine-tuning workflows (trl) and SFTTrainer; enterprises can deploy via Azure AI Model Catalog and Dell Enterprise Hub.
- On the hardware side, OpenAI GPT OSS has been verified on AMD Instinct and ROCm support is being developed; AMD MI300 has MegaBlocks MoE kernel acceleration in this context.
- GPT OSS is a reasoning-focused family; evaluations emphasize larger generation sizes to capture reasoning traces; the model outputs may include a reasoning trace, which should be filtered for evaluation.
- Output structure uses channels (e.g., analysis and final); for user-facing responses you typically render the final channel only when tools are not used.
Common use cases
- Private/local deployments and on-device inference on consumer hardware.
- Real-time endpoints for customer-facing chat and agentic tasks in enterprise environments.
- Tool use and reasoning tasks that require extended generation with a focus on reasoning steps.
- Fine-tuning and experimentation with SFTTrainer in trl to adapt models to domain tasks.
- Deployments via cloud or on-prem partners (Azure AI Model Catalog, Dell Enterprise Hub).
- Running on AMD ROCm environments with initial ROCm kernel support and on NVIDIA CUDA hardware with MXFP4 and Flash Attention 3 where supported.
- Evaluation workflows that require large generation sizes to capture reasoning traces before producing final results.
Setup & installation
# Install prerequisites (exact commands from the release guidance)
pip install --upgrade transformers>=4.55.1 accelerate
pip install --upgrade kernels
pip install --upgrade triton>=3.4
Note: For Hopper cards (H100/H200) you may need to upgrade kernels and apply provider-specific kernel code as described in the release notes to enable optimized MXFP4 kernels.
# Optional: to enable Flash Attention 3 sinks with vLLM kernels where supported
pip install --upgrade vllm-flash-attn3
# Quick server or API-based usage can leverage Hugging Face Inference Providers
# Ensure you have an API token and use the model identifier for GPT OSS 20B/120B
Quick start
Minimal runnable example using a Hugging Face Inference API endpoint for a GPT OSS model:
import os
import requests
API_URL = "https://api-inference.huggingface.co/models/gpt-oss-20b"
headers = {"Authorization": f"Bearer {os.environ.get('HF_API_TOKEN')}"}
payload = {"inputs": "Explain how GPT OSS uses MoE and MXFP4 quantization."}
response = requests.post(API_URL, headers=headers, json=payload)
print(response.json())
This demonstrates a basic request to a model hosted via inference providers. Replace the model URL with the 120B variant as needed and supply your token.
Pros and cons
- Pros:
- Open-weight release under Apache 2.0; two model sizes offer trade-offs between latency and capacity.
- MoE with MXFP4 quantization enables memory savings and fast inference on supported hardware.
- Large model fits on a single H100; 20B runs on 16GB RAM, enabling consumer hardware and on-device deployments.
- Broad hardware support (CUDA with MXFP4, Flash Attention 3, ROCm on AMD, and optimization via vLLM kernels).
- Integrations with Hugging Face Inference Providers, OpenAI-compatible Responses API, and enterprise deployment paths (Azure, Dell).
- Fine-tuning and tooling support via trl and SFTTrainer; ready for enterprise workflows.
- Cons:
- The models are designed as reasoning models and may require large generation sizes for evaluation and inference quality.
- Some optimizations (MXFP4, Flash Attention 3) may require compatible hardware and library versions (e.g., Triton) to realize memory and speed benefits.
- If a GPU/memory stack does not support MXFP4, a bfloat16 fallback may be used, increasing memory footprint.
- Evaluation traces (reasoning) must be filtered to avoid parsing issues in metrics calculations.
Alternatives (brief comparisons)
| Approach | Key characteristic | Pros | Cons |---|---|---|---| | GPT OSS (MoE + MXFP4) | Open-weight 120B/20B MoE with MXFP4 quantization | Memory efficiency; fast inference; single-GPU runs; Apache 2.0 | Requires compatible hardware and software stack for MXFP4; specialized setup may be needed |MegaBlocks MoE kernels | MoE kernel acceleration without MXFP4 | Works when MXFP4 is not available; can improve speed on some GPUs | Memory usage higher when not using MXFP4; needs bfloat16 |Llama.cpp with MXFP4 | Native MXFP4 support with Flash Attention across backends | Broad backend compatibility; simple deployment paths | May require integration work with specific model families |Cloud/OpenAI API path | Hosted OpenAI API alternatives | Simpler management; no local infra | Ongoing usage costs; data leaves the local environment |
Pricing or License
- License: Apache 2.0. The GPT OSS models are released under Apache 2.0 with a minimal usage policy. By using gpt-oss, you agree to comply with all applicable law. The release emphasizes safety, responsibility, and democratic access, while maximizing user control over deployments and usage.
References
More resources
CUDA Toolkit 13.0 for Jetson Thor: Unified Arm Ecosystem and More
Unified CUDA toolkit for Arm on Jetson Thor with full memory coherence, multi-process GPU sharing, OpenRM/dmabuf interoperability, NUMA support, and better tooling across embedded and server-class targets.
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap
Leverage GPU memory swap (model hot-swapping) to share GPUs across multiple LLMs, reduce idle GPU costs, and improve autoscaling while meeting SLAs.
Improving GEMM Kernel Auto-Tuning Efficiency with nvMatmulHeuristics in CUTLASS 4.2
Introduces nvMatmulHeuristics to quickly select a small set of high-potential GEMM kernel configurations for CUTLASS 4.2, drastically reducing auto-tuning time while approaching exhaustive-search performance.
Make ZeroGPU Spaces faster with PyTorch ahead-of-time (AoT) compilation
Learn how PyTorch AoT compilation speeds up ZeroGPU Spaces by exporting a compiled model once and reloading instantly, with FP8 quantization, dynamic shapes, and careful integration with the Spaces GPU workflow.
Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
Guide to fine-tuning gpt-oss with SFT + QAT to recover FP4 accuracy while preserving efficiency, including upcasting to BF16, MXFP4, NVFP4, and deployment with TensorRT-LLM.
How Small Language Models Are Key to Scalable Agentic AI
Explores how small language models enable cost-effective, flexible agentic AI alongside LLMs, with NVIDIA NeMo and Nemotron Nano 2.