Welcome GPT OSS: OpenAI’s Open-Source 120B and 20B MoE Models Arrive

TL;DR

OpenAI released GPT OSS, a new open-source model family with two MoE models: gpt-oss-120b (120B parameters) and gpt-oss-20b (20B parameters).
Both models use mixture-of-experts (MoE) and MXFP4 4-bit quantization to enable fast inference while keeping resource usage low.
The 20B model can run on GPUs with 16 GB of RAM; the 120B model fits on a single H100 GPU. Both are accessible via Hugging Face Inference Providers and licensed under Apache 2.0 with a minimal usage policy.
The release supports on-demand deployment through OpenAI-compatible APIs (Responses API) and is integrated with Azure and Dell offerings for enterprise deployments.
The models leverage tool-use during reasoning, require large generation sizes for evaluation, and come with guidance on how to evaluate and handle reasoning traces in outputs.

Context and background

OpenAI’s GPT OSS represents a significant step in their stated mission to make AI benefits broadly accessible and to contribute to the open-source ecosystem. The release brings two open weights MoE models to the community: a large 120B-parameter model and a smaller 20B-parameter model. Both are quantized with MXFP4, a four-bit scheme designed to reduce active parameter counts during inference and thus lower resource requirements while preserving performance for a range of tasks. According to the release, the models are licensed under Apache 2.0 with a minimal usage policy that emphasizes safe, responsible, and democratic use while preserving developer control. You can learn more directly from the Hugging Face blog announcing GPT OSS and the accompanying demonstrations at gpt-oss.com. Hugging Face Blog • gpt-oss.com

What’s new

GPT OSS introduces two open-weight models: gpt-oss-120b and gpt-oss-20b. Both models are mixture-of-experts architectures and use MXFP4 4-bit quantization to enable faster inference with fewer active parameters. The large 120B model can fit on a single H100 GPU, while the 20B model can run on consumer-grade hardware with as little as 16 GB of RAM, expanding on-device and on-premise use cases. The models are deployed through Hugging Face’s Inference Providers service, enabling requests to be sent to OpenAI-compatible backends using the same Python or JavaScript code you’d use with other providers. The blog post highlights an OpenAI-compatible Responses API, designed for more flexible and intuitive chat interactions, and provides examples with the Fireworks AI provider. The integration points include the latest transformers release (v4.55.1 or later), accelerate, and kernels, and it is recommended to install Triton 3.4+ to unlock MXFP4 kernels on CUDA hardware. If MXFP4 is not available for a given GPU, a bfloat16 fallback is used. Additionally, the post discusses optimized kernels for attention (Flash Attention 3 with sink attention) and the option to use MegaBlocks MoE kernels for certain hardware combinations, noting their memory trade-offs. In terms of ecosystem and tooling, the GPT OSS models have been verified on AMD Instinct hardware, with initial ROCm kernel support announced in the Kernels library. There is also a Hugging Face Space for AMD hardware testing and ongoing work to broaden GPU compatibility and kernel coverage. The post explains how to run a multi-GPU setup (for example with four GPUs) using accelerate or torchrun and provides practical code snippets for local experimentation with transformers serve and the Responses API. The models are designed to work with tool-use during reasoning and to support a variety of enterprise deployments via Azure and Dell’s ecosystems.

Why it matters (impact for developers/enterprises)

Accessibility and deployment choices: The 20B model’s 16 GB RAM requirement makes on-device and consumer hardware deployments viable, enabling private or local deployments without a large-scale data-center footprint. The 120B model, while larger, can be run on a single H100 GPU, with scalable multi-GPU options for broader workloads. This balance broadens the audience for high-capability AI models.
Open-source licensing and governance: Apache 2.0 licensing paired with a minimal usage policy aims to maximize community control and responsible usage, aligning with a broader open-source ethos while framing acceptable use within legal contexts.
Ecosystem integrations: Availability via Hugging Face Inference Providers and compatibility with OpenAI’s Responses API enables developers to build apps with familiar interfaces while leveraging open weights. Enterprise integrations with Azure AI Model Catalog and the Dell Enterprise Hub extend deployment options to managed, enterprise-grade environments.
Hardware and software ecosystem momentum: The release ties into a broader hardware-aware inference stack—MXFP4 quantization, Flash Attention 3, and optimization with kernels—while supporting AMD ROCm and NVIDIA CUDA hardware. This reflects ongoing collaboration across runtimes, kernels, and accelerators to maximize performance.
Research and evaluation emphasis: The GPT OSS release emphasizes reasoning capabilities that rely on large generation sizes for evaluation. The blog provides guidance on filtering reasoning traces when computing metrics, highlighting careful evaluation practices for reasoning-enabled models.

Technical details or Implementation

Model family and quantization: GPT OSS consists of two MoE models, gpt-oss-120b and gpt-oss-20b, quantized with MXFP4 (4-bit). This quantization reduces the number of active parameters during inference, enabling faster execution and lower resource usage on compatible hardware.
Hardware requirements and deployment options: The 20B model runs on GPUs with 16 GB RAM, including consumer cards like certain RTX models and cloud environments such as Colab or Kaggle. The 120B model fits on a single H100 GPU, with options to scale across multiple GPUs using accelerate or torchrun. The release also notes that, if MXFP4 isn’t available for a given GPU, the models load in a bfloat16 unpacked form from the quantized weights.
Software stack and optimization: The project references the Transformers library (v4.55.1+), accelerate, and kernels. It recommends Triton 3.4+ to enable optimized MXFP4 kernels on CUDA hardware and mentions the vLLM-optimized Flash Attention 3 kernels for performance, including support for sink attention. For Hopper-class GPUs, the post notes tested performance with PyTorch 2.7 and 2.8 and provides guidance on installing updated kernels to access pre-compiled optimized code from the kernels-community repository.
Alternative kernel paths: If MXFP4 is unavailable, MegaBlocks MoE kernels are suggested, though these require running the model in bfloat16 and come with higher memory usage than MXFP4. The post emphasizes choosing MXFP4 when the GPU supports it, with MegaBlocks as a fallback option.
Tool use and evaluation guidance: The GPT OSS models are trained to leverage tool use as part of their reasoning. The blog includes a dedicated evaluation approach using lighteval with prompts, and notes the importance of parsing the reasoning trace correctly to avoid parsing errors in metrics, particularly for math and instruct evaluations. For the 20B model, specific evaluation numbers are cited (IFEval: 69.5 ± 1.9; AIME25: 63.3 ± 8.9 at pass@1).
Output structure and safety: OpenAI GPT OSS uses channels in its outputs, typically an analysis channel and a final channel. The recommended practice is to append only the final channel content to user-visible responses, unless tools are used for training or other purposes. This approach is intended to keep reasoning traces separate from user-deliverable text.
Experimental and enterprise pathways: The models are accessible via Hugging Face Inference Providers, and there are examples of running servers with multiple GPUs, including two-H100 setups, with Python-based or JavaScript-based clients. The blog points to dedicated model cards, a guide, and a Python snippet to illustrate simple inference workflows.
Ecosystem partners and rollout: In addition to Hugging Face, the release aligns with Azure and Dell for enterprise deployments. The GPT OSS models are available in the Azure AI Model Catalog (GPT OSS 20B and GPT OSS 120B) and Dell Enterprise Hub for on-prem deployments. These integrations illustrate a path from open weights to secured managed endpoints and on-prem infrastructure.

Key takeaways

GPT OSS delivers two open-weight MoE models, 120B and 20B, with MXFP4 4-bit quantization to balance performance and resource needs.
The 20B model runs on consumer hardware with 16 GB RAM; the 120B model can fit on a single H100, with scalable multi-GPU options.
Apache 2.0 licensing with a minimal usage policy promotes open, responsible use while preserving user control.
Inference is facilitated through Hugging Face Inference Providers and supports an OpenAI-compatible Responses API for flexible chat interactions.
The release integrates with Azure and Dell for enterprise deployments and includes AMD ROCm support and CUDA-accelerated kernel paths, with ongoing optimization for various GPUs.

FAQ

What models are included in GPT OSS?

The GPT OSS family includes two MoE models: gpt-oss-120b and gpt-oss-20b.
What quantization is used, and why does it matter?

The models use MXFP4, a 4-bit quantization scheme that reduces active parameters during inference to enable faster performance with lower resource usage.
What hardware is required to run GPT OSS?

The 20B model can run on GPUs with 16 GB of RAM, while the 120B model can fit on a single H100 GPU; multi-GPU configurations are supported for larger workloads using accelerate or torchrun.
Under what license are the models released?

The models are released under the Apache 2.0 license with a minimal usage policy.
How can developers access and deploy GPT OSS?

Access is provided via Hugging Face Inference Providers, with an OpenAI-compatible Responses API, and enterprise deployments are supported through Azure AI Model Catalog and Dell Enterprise Hub.

Welcome GPT OSS: OpenAI’s Open-Source 120B and 20B MoE Models Arrive

TL;DR

Context and background

What’s new

Why it matters (impact for developers/enterprises)

Technical details or Implementation

Key takeaways

FAQ

References

More news

First look at the Google Home app powered by Gemini

NVIDIA HGX B200 Reduces Embodied Carbon Emissions Intensity

Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection

Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)

Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference

Google expands Gemini in Chrome with cross-platform rollout and no membership fee