Skip to content
Open AI and NVIDIA logos.
Source: developer.nvidia.com

Delivering 1.5 M TPS Inference on NVIDIA GB200 NVL72: OpenAI gpt-oss Models From Cloud to Edge

Sources: https://developer.nvidia.com/blog/delivering-1-5-m-tps-inference-on-nvidia-gb200-nvl72-nvidia-accelerates-openai-gpt-oss-models-from-cloud-to-edge, developer.nvidia.com

TL;DR

  • NVIDIA and OpenAI optimized the gpt-oss-120b and gpt-oss-20b open-weight models for accelerated inference on the NVIDIA Blackwell architecture, delivering up to 1.5 million tokens per second (TPS) on a GB200 NVL72 system.
  • The gpt-oss models are text-reasoning LLMs with chain-of-thought and tool-calling, built on a Mixture of Experts (MoE) design with SwigGLU activations and RoPE attention across a 128k context.
  • The models run in FP4 precision, fitting on a single 80 GB data center GPU and is natively supported by Blackwell. Training used NVIDIA H100 GPUs, with gpt-oss-120b requiring over 2.1 million hours and gpt-oss-20b roughly 10x less.
  • NVIDIA collaborations span Hugging Face Transformers, Ollama, and vLLM, with TensorRT-LLM optimized kernels and deployment tooling. Day 0 performance is supported on Blackwell and Hopper platforms.
  • Dynamo, TensorRT-LLM, and NIM microservices provide flexible, low-latency deployment options from cloud to edge, including 32k ISL improvements and disaggregated serving.

Context and background

NVIDIA and OpenAI began pushing the boundaries of AI with the early launch of NVIDIA DGX systems in 2016. The collaboration has continued into OpenAI gpt-oss-20b and gpt-oss-120b, now optimized for accelerated inference on the NVIDIA Blackwell architecture. This work delivers up to 1.5 million tokens per second on a GB200 NVL72 rack-scale system, illustrating the pace of innovation from cloud to edge deployments. The gpt-oss models are text-reasoning LLMs with chain-of-thought capabilities and tool-calling features. They use a Mixture of Experts (MoE) design with SwigGLU activations. Attention layers employ RoPE with a 128k context window, alternating between a full-context pass and a sliding 128-token window. The models are released in FP4 precision, which fits on a single 80 GB data-center GPU and is natively supported by Blackwell. Training for these models occurred on NVIDIA H100 Tensor Core GPUs, with the gpt-oss-120b model requiring over 2.1 million training hours and the gpt-oss-20b model requiring about 10x fewer hours. NVIDIA collaborated with OpenAI and the broader open-source community to maximize performance and validate accuracy. In addition to Hugging Face Transformers, Ollama, and vLLM, NVIDIA contributed optimized kernels and model enhancements via TensorRT-LLM. The result is a cohesive deployment path that covers the software platform from development to production, enabling Day 0 performance on primary NVIDIA platforms. The launch emphasizes a holistic deployment story: from data-center to edge, with tools and ecosystems that developers already rely on, such as vLLM for spinning up OpenAI-compatible web servers, and Docker-based deployment via the NVIDIA/TensorRT-LLM GitHub repository with pre-downloaded model checkpoints from Hugging Face. Day 0 performance was demonstrated across Blackwell and Hopper platforms, underscoring NVIDIA’s objective to enable high-throughput, low-cost-per-token inference on next-generation GPUs. Developers can leverage TensorRT-LLM through a Python API in JupyterLab, or deploy optimized models via NVIDIA Dynamo for long-input scenarios, as described in the accompanying deployment guides and cookbook resources.

What’s new

The primary advancement is the delivery of up to 1.5 million tokens per second for the gpt-oss-120b model on a GB200 NVL72 rack, enabling roughly 50,000 concurrent users. This performance marks a significant extension of inference throughput for large open-weight models on the Blackwell platform and demonstrates effective scaling across the hardware stack, including the 72 Blackwell GPUs acting as a single device via fifth-generation NVIDIA NVLink and NVLink Switch. Key new elements include:

  • FP4 precision across the models, with native hardware support on Blackwell, enabling dense packaging on 80 GB data-center GPUs.
  • A second-generation Transformer Engine with FP4 Tensor Cores to improve throughput for large models.
  • Integration with TensorRT-LLM for optimized kernels and model enhancements, along with open-source tooling and workflows that include vLLM, Hugging Face Transformers, Ollama, and related ecosystems.
  • Deployment options and tooling that span from cloud to edge, including Docker deployment guides, the OpenAI Cookbook integration, and pre-packaged NVIDIA NIM microservices for enterprise deployments.
  • A Dynamo-based deployment path that disaggregates the inference pipeline to optimize long input sequences, delivering a 4x improvement in interactivity at 32k input sequence length (ISL) on Blackwell.
  • Day-0 support for gpt-oss-120b and gpt-oss-20b across NVIDIA Blackwell and Hopper platforms, with optimized accuracy and performance verified in collaboration with OpenAI and the community. These advancements also highlight a broad ecosystem approach: developers can run the models on professional workstations with RTX PRO GPUs or GeForce RTX AI PCs, access pre-packaged, portable NVIDIA NIM microservices, and test configurations via the NVIDIA API Catalog or OpenAI Cookbook examples.

Why it matters (impact for developers/enterprises)

For developers and enterprises building AI-powered applications, the ability to run large open-weight models with high throughput and low token cost is transformative. The combination of 1.5M TPS on a single GB200 NVL72 rack and FP4 efficiency reduces the per-token cost while expanding latency-friendly use cases across cloud and edge environments. The Day 0 support across Blackwell and Hopper platforms helps shorten time-to-value for organizations seeking rapid deployment of gpt-oss models. Additionally, the MoE-based gpt-oss models offer scalable inference that benefits from specialized activation and attention mechanisms, enabling more capable reasoning and tool-use in production. The ecosystem around deployment tools—Dynamo for disaggregated serving, TensorRT-LLM for optimized kernels, and NIM microservices for portable deployment—helps enterprises tailor serving architectures to their data privacy, latency, and throughput requirements. From a deployment perspective, developers can rely on established workflows with vLLM for web servers, Hugging Face and Transformers integrations for model management, and Docker-based environments to simplify provisioning. The combination of these tools with RTX-based workstations and cloud-ready pipelines supports a spectrum of use cases, from research prototyping to production-grade services.

Technical details or Implementation

This work centers on a set of architectural and software optimizations designed to maximize inference performance for gpt-oss-120b and gpt-oss-20b on Blackwell. Core technical elements include:

  • Model design: text-reasoning LLMs with chain-of-thought and tool-calling, built on a Mixture of Experts (MoE) design with SwigGLU activations. Attention uses RoPE with a 128k context, alternating between full-context processing and a sliding 128-token window.
  • Precision and hardware: models released in FP4 precision, fitting on 80 GB GPUs, with native FP4 support on Blackwell. Training used H100 GPUs, with the 120b model requiring over 2.1 million hours and the 20b model roughly 10x less.
  • Hardware architecture: 2nd-generation Transformer Engine with FP4 Tensor Cores and fifth-generation NVLink/NVLink Switch, enabling 72 Blackwell GPUs to act as a single processor for inference workloads.
  • Software and frameworks: NVIDIA optimized kernels via TensorRT-LLM, with integration through Hugging Face Transformers, Ollama, and vLLM. Deployment guidance and model downloads are provided via the NVIDIA/TensorRT-LLM GitHub repository and the OpenAI Cookbook, with model checkpoints sourced from Hugging Face.
  • Deployment patterns: developers can run TensorRT-LLM through a Python API in JupyterLab, or deploy pre-packaged NIM microservices in enterprise environments. The Dynamo option enables disaggregated inference, improving performance for long sequences without increasing GPU budgets.
  • 32k ISL improvements: Dynamo enables a 4x improvement in interactivity at 32k ISL on Blackwell through disaggregated serving, LLM-aware routing, elastic autoscaling, and low-latency data transfer.
  • Platforms and accessibility: models can run on RTX PRO workstations or GeForce RTX AI PCs with minimum VRAM requirements, and enterprises can access NIM microservices through the NVIDIA API Catalog UI or the OpenAI Cookbook guides.

Key facts and a quick spec table

| Model | Size (B) | Context window | Inference note | Training hours (approx.) |---|---:|---|---|---:| | gpt-oss-120b | 120 | 128k RoPE context, full vs sliding window | Up to 1.5M tokens/s on GB200 NVL72 | >2.1M hours |gpt-oss-20b | 20 | 128k RoPE context, full vs sliding window | Not specified for TPS here | ~1/10 of 120b hours |

  • High-throughput serving: The 1.5M TPS for the 120b model on GB200 NVL72 demonstrates the capacity of the Blackwell-based rack to handle demanding workloads, with an estimated 50,000 concurrent users.
  • Deployment ecosystem: TensorRT-LLM provides optimized kernels; vLLM can be used to spin up an OpenAL-compatible web server; the NVIDIA Dynamo platform supports disaggregated inference and LLM-aware routing for longer inputs.
  • Developer experience: The integration with the Transformers library, together with Docker containers and the OpenAI Cookbook guide, provides end-to-end workflows from model download to production deployment.
  • Availability and accessibility: The optimized models are packaged as NVIDIA NIM microservices for deployment on GPU-accelerated infrastructure, with enterprise-focused options accessible via the NVIDIA API Catalog and developer guides.

Key takeaways

  • High-throughput, edge-friendly inference: 1.5M TPS on GB200 NVL72 for gpt-oss-120b demonstrates scalable, edge-capable deployment.
  • FP4 efficiency and robust training history: The models’ FP4 precision and substantial pre-launch training on H100 GPUs underpin the performance gains.
  • Comprehensive deployment toolkit: TensorRT-LLM, vLLM, Dynamo, and NIM microservices cover development to production paths.
  • Long-sequence optimization: Dynamo’s disaggregated inference delivers notable gains at 32k ISL.
  • Day 0 availability across platforms: Blackwell and Hopper support ensures early access to optimized models in diverse environments.

FAQ

  • What are gpt-oss models?

    OpenAI gpt-oss-120b and gpt-oss-20b are open-weight, text-reasoning LLMs with MoE architecture and tool-calling features.

  • What hardware enables the reported performance?

    A GB200 NVL72 rack-scale system built on NVIDIA Blackwell GPUs, leveraging NVLink for a single-system experience.

  • What tooling supports deployment?

    TensorRT-LLM for optimized kernels, vLLM for web-server setup, Hugging Face Transformers integration, and Dynamo for disaggregated inference; NIM microservices provide portable enterprise deployment.

  • How can developers start testing these models?

    Use the NVIDIA TensorRT-LLM deployment guide, OpenAI Cookbook guidance, Docker-based environments, and pre-packaged NIM microservices via NVIDIA’s developer ecosystem.

  • Where can I read more or access the original details?

    See the NVIDIA developer blog: https://developer.nvidia.com/blog/delivering-1-5-m-tps-inference-on-nvidia-gb200-nvl72-nvidia-accelerates-openai-gpt-oss-models-from-cloud-to-edge

References

More news