Welcome GPT OSS: OpenAI's Open-Source GPT-OSS MoE Models Arrive on Hugging Face

TL;DR

GPT OSS introduces two open-weight, mixture-of-experts models: GPT OSS 120B (gpt-oss-120b) and GPT OSS 20B (gpt-oss-20b), both using MXFP4 4-bit quantization to speed inference with fewer active parameters.
The 120B model fits on a single H100 GPU; the 20B model runs on systems with as little as 16 GB of RAM, enabling consumer hardware and on-device deployments.
The models are Apache 2.0 licensed with a minimal usage policy, aiming for safe, responsible, and democratic use while maximizing user control; access is via Hugging Face Inference Providers.
The release is integrated with OpenAI-compatible interfaces (Responses API), and supports various deployment options (Azure, Dell, on-prem) through partner ecosystems. It also emphasizes tooling and performance enhancements via kernels, vLLM Flash Attention 3, and MXFP4 support across CUDA generations.

Context and background

OpenAI has published GPT OSS as a hugely anticipated open-weights release designed for strong reasoning, agentic tasks, and versatile developer use cases. The team highlights two core models under the GPT OSS umbrella: a large model with approximately 117B parameters (designated gpt-oss-120b) and a smaller model with about 21B parameters (designated gpt-oss-20b). Both models use mixture-of-experts (MoEs) and employ a 4-bit quantization scheme named MXFP4. This combination enables faster inference by reducing the number of active parameters while keeping resource usage comparatively low. The 120B model can fit on a single Nvidia H100 GPU, while the 20B model is designed to run within 16 GB of memory, making it particularly suitable for consumer hardware and on-device applications. The release aligns with OpenAI’s broader mission to broaden AI benefits through open-source ecosystems, and Hugging Face frames the move as a meaningful step for the community by welcoming OpenAI to open-source collaboration. Accessibility is facilitated via Hugging Face’s Inference Providers, which allows users to send requests to supported providers using standard JavaScript or Python code. This infrastructure powers the OpenAI demo on gpt-oss.com and is available for users to integrate into their own projects.

What’s new

OpenAI GPT OSS marks a notable expansion in the landscape of open-source AI models for reasoning-based tasks. Key highlights include:

Two models, gpt-oss-120b (117B parameters) and gpt-oss-20b (21B parameters), both Mixture-of-Experts (MoEs) with MXFP4 4-bit quantization for efficient inference.
Qualified to run on consumer and enterprise hardware: 20B on GPUs with as little as 16 GB RAM using MXFP4; 120B on a single H100 GPU with MXFP4; multiple GPUs supported via accelerate or torchrun for larger deployments.
Licensing and policy: Apache 2.0 license with a minimal usage policy emphasizing safe, responsible, and democratic use with strong user controls.
Rich tooling ecosystem: transformers, accelerate, and kernels are required (v4.55.1+ and above; Triton 3.4+ recommended for MXFP4 on CUDA). The setup enables downloading optimized MXFP4 kernels on first use, delivering substantial memory savings.
Inference and performance optimizations: vLLM’s Flash Attention 3 kernels with sink attention are packaged and integrated; recommended for Hopper cards (H100/H200) with PyTorch 2.7 or 2.8. AMD ROCm support is included via kernels to broaden hardware compatibility.
Integration and deployment options: GPT OSS is verified on AMD Instinct hardware and is supported via Azure AI Model Catalog and Dell Enterprise Hub for secured, enterprise-grade deployments; on-prem deployments are facilitated by optimized containers and Dell hardware integration.
Additional tooling: native MXFP4 support in Llama.cpp with Flash Attention across Metal, CUDA, and Vulkan via llama-server; Hugging Face Space demonstrates AMD hardware compatibility.

Why it matters (impact for developers/enterprises)

The GPT OSS release is positioned to empower developers and enterprises to incorporate large, reasoning-focused language models into real-world pipelines without relying solely on opaque, closed APIs. Key implications include:

On-device and on-prem deployment paths expand options for private data handling, compliance, and latency-sensitive use cases. The 20B model’s 16 GB memory footprint unlocks consumer hardware and edge deployments, while the 120B model remains accessible on high-end GPUs in data centers.
Open licensing (Apache 2.0) together with a minimal usage policy lowers the barriers to experimentation and integration, enabling teams to build, fine-tune, and deploy with fewer IP restrictions.
The combination of ML optimizations (MXFP4 4-bit quantization, MoE, and accelerated kernels) and broad hardware support (CUDA generations, AMD ROCm, Hopper-based acceleration) is designed to deliver practical throughput improvements for real-time inference scenarios.
Enterprise ecosystems are supported through major partners and catalog integrations (Azure AI Model Catalog, Dell Enterprise Hub), enabling secure deployments, autoscaling, and monitoring within established corporate infrastructure.
The emphasis on tool use in reasoning tasks, including the ability to structure outputs with explicit reasoning traces and channels, reflects a practical approach to evaluation and responsible usage, helping teams calibrate prompts and metrics for complex reasoning tasks.

Technical details or Implementation

Architecture and quantization: Both models are mixture-of-experts (MoEs) and use MXFP4, a 4-bit quantization scheme that enables faster inference by reducing active parameters. The large model (gpt-oss-120b) is reported to fit on a single H100 GPU with MXFP4, while the smaller model (gpt-oss-20b) can run on GPUs with 16 GB RAM using MXFP4. If MXFP4 is not available or GPUs are not compatible, the weights can be loaded in bf16 unpacked form.
Hardware and software stack: The models require transformers (v4.55.1+), accelerate, and kernels. For MXFP4 on CUDA hardware, Triton 3.4+ and the kernels library are recommended to enable optimized MXFP4 kernels on first use. This combination yields substantial memory savings and enables 20B inference on 16 GB GPUs such as consumer cards (e.g., 3090, 4090) and platforms like Colab and Kaggle.
Kernel and acceleration options: The vLLM project provides optimized Flash Attention 3 kernels that support sink attention, and Hugging Face integrates these kernels for performance gains. On Hopper-based GPUs, users can install the latest kernels and download pre-compiled kernel code from the kernels-community repository as described in the open-source workflows.
GPU compatibility and fallbacks: If a user’s GPU supports MXFP4, it is the recommended path. If not, MegaBlocks MoE kernels can be used, but MoE kernels generally require bf16 and lead to higher memory consumption. AMD Instinct (e.g., MI300-series) support is available through the AMD ROCm kernels, broadening hardware coverage.
Deployment patterns: GPT OSS can be run in server configurations via transformers serve or 2-GPU/4-GPU setups using accelerate. The example snippets demonstrate how to launch a server with two H100 GPUs and how to make requests via the OpenAI-compatible Responses API or the standard Completions API. The models are designed to be used with the OpenAI-compatible interface so existing tooling and prompts can be adapted.
Integration and fine-tuning: The GPT OSS models are fully integrated with TRL (training with reinforcement learning from human feedback) workflows, and include fine-tuning examples using SFTTrainer to help developers get started.
Ecosystem and ecosystem-level deployment: Hugging Face collaborates with Azure on the Azure AI Model Catalog to bring the models into secured endpoints for managed online deployments, and Dell with optimized containers to enable on-prem deployments with enterprise-grade security. The models are also available via Hugging Face Inference Providers with an OpenAI-compatible Responses API, enabling flexible chat-style interactions.

Tables: quick facts

| Model | Parameters | Typical RAM / GPU | Notes |---|---:|---:|---| | GPT OSS 120B | ~117B | Fits on a single H100 with MXFP4 | MoE, 4-bit quantization; multi-GPU support via accelerate/torchrun |GPT OSS 20B | ~21B | ~16 GB RAM with MXFP4; ~48 GB bf16 fallback | Consumer hardware friendly; on-device deployment possible |

Why it matters (summary for developers and enterprises)

GPT OSS represents a concrete move toward open, auditable, and deployable AI tools for reasoning tasks. The combination of MoE architecture, 4-bit MXFP4 quantization, and broad hardware support creates a practical path for researchers and engineers to experiment with large-scale reasoning models without sacrificing privacy or control. The licensing and ecosystem support from Hugging Face, along with cloud and on-prem deployment options, lowers barriers to adoption in enterprise contexts where latency, data locality, and governance are critical. The alignment with Azure and Dell deployments demonstrates a focus on enterprise-grade deployment pipelines, autoscaling, and security features while maintaining accessibility for independent developers and smaller teams.

Key takeaways

GPT OSS delivers two open-weight MoE models with 4-bit MXFP4 quantization to balance performance and resource usage.
The larger model fits on a single H100, while the smaller model can run on typical consumer GPUs with 16 GB RAM.
Apache 2.0 licensing and a minimal usage policy aim to broaden access while emphasizing safety and responsibility.
Inference providers, optimized kernels, and multi-GPU deployment options enable flexible production use across cloud, edge, and on-prem environments.
AMD ROCm support and Llama.cpp MXFP4 integration widen hardware compatibility; enterprise partnerships (Azure, Dell) support managed deployments.

FAQ

What are the GPT OSS models and their sizes?

GPT OSS includes two open-weight models: gpt-oss-120b (~117B parameters) and gpt-oss-20b (~21B parameters). Both use Mixture-of-Experts and MXFP4 quantization.
What hardware is required to run these models effectively?

The 120B model can run on a single H100 GPU with MXFP4. The 20B model can run on GPUs with 16 GB RAM using MXFP4, with bf16 as a fallback if MXFP4 is unavailable.
How do I access and deploy these models?

They are accessible via Hugging Face Inference Providers and are integrated with the OpenAI-compatible Responses API. Deployments can be on Azure AI Model Catalog and Dell Enterprise Hub, among other environments.
What software stack is required for optimal performance?

You should use transformers (v4.55.1+), accelerate, and kernels; Triton 3.4+ is recommended for MXFP4 on CUDA. Using the latest kernels enables optimized MXFP4 kernels on first use. If not using MXFP4, MegaBlocks MoE kernels are an alternative.
Are there any notes about evaluation or generation for these models?

GPT OSS models are reasoning models that require large generation sizes for evaluations. They use a reasoning trace with channels in outputs; parsing should filter the reasoning trace before metric computation. The 20B model has published scores for certain evaluation benchmarks under these conditions.