Nemotron Nano 2 9B: Open Reasoning Model with 6x Throughput for Edge and Enterprise
Sources: https://huggingface.co/blog/nvidia/supercharge-ai-reasoning-with-nemotron-nano-2, https://huggingface.co/blog/nvidia/supercharge-ai-reasoning-with-nemotron-nano-2?nvid=nv-int-tblg-513492%20, NVIDIA Dev Blog
Overview
NVIDIA Nemotron Nano 2 9B is an open model in the Nemotron family designed for enterprise grade reasoning and agentic AI. It combines a Hybrid Transformer–Mamba backbone with a configurable thinking budget to balance accuracy, throughput and cost, making it suitable for edge and PC footprints while preserving strong reasoning capabilities. The model is released with open weights, open datasets and training techniques to support the open‑source community, and it targets reasoning across math, coding, science, tool use and general instruction following. Nemotron Nano 2 is built to fit within common edge GPU memory limits and deliver low latency thinking for agent workflows.
Key features
- Hybrid Transformer–Mamba backbone designed for long thinking traces
- Majority of layers are Mamba‑2 selective state‑space modules with linear time and constant memory per token
- Interleaved attention islands preserve Transformer strength for linking distant facts
- 128k context window for long context reasoning
- 6x higher throughput versus the next best open model
- Configurable thinking budget to control how much internal reasoning the model does
- Post‑training process including supervised fine‑tuning on reasoning on/off data, reinforcement learning and preference optimization
- Model compression and distillation from a 12B base to a 9B Nano 2 using pruning and logit‑based distillation
- Open weights, open datasets and training techniques via NVIDIA open science initiative
- Reasoning modes: Reasoning ON with chain-of-thought tokens, and Reasoning OFF for direct responses
- The thinking budget can reduce token generation and potentially lower inference costs by up to 60%
- The model is designed to fit within the A10G memory limits and to run with 128k context inference
Common use cases
- Edge and PC/edge footprint deployments where latency matters
- Enterprise‑grade reasoning and agentic AI workflows
- Multistep problem solving across math, coding, science, tool use and safety
- Tool calling and RAG lookups where memory and throughput matter
- Long context reasoning tasks that require sustained thinking without growing memory usage
- Scenarios requiring configurable accuracy and cost through the thinking budget
Setup & installation
The source describes spinning up a vLLM server for Nemotron Nano 2 and experimenting with a thinking budget. It notes that the model will be available to download and deploy via NVIDIA NIM in the future, and that NVIDIA provides open weights, open datasets and training techniques to support the open‑source community. Specific installation commands are not included in the provided material; refer to the technical report for detailed setup steps.
# Not provided in the source excerpt
# See official technical report for detailed setup steps
Quick start
A minimal, runnable quick start is not provided in the source; the article discusses a vLLM server setup and a thinking budget example. See the technical report for concrete steps and examples.
Pros and cons
Pros
- Leading accuracy in its size class across reasoning tasks
- High throughput enabling low‑latency agent steps
- Open weights and data to support community experimentation
- Flexible thinking budget to right‑size accuracy and cost Cons
- Requires careful memory budgeting and hardware support (A10G memory limits described)
- Complex compression and distillation pipeline (teacher–student setup)
- Tuning the thinking budget for different domains may require experimentation
Alternatives (brief comparisons)
- 12B base Nemotron model is used as the teacher for distillation to obtain the 9B Nano 2; the 12B base consumes about 22.9 GiB of memory for weights (bfloat16)
- The Nano 2 9B is designed to fit within the A10G memory limit with a target budget of about 19.66 GiB and a 5% buffer
- Other open models in the Nemotron family aim to balance accuracy and throughput; Nano 2 claims 6x throughput advantage over the next best open model | Model | Parameters | Context | Throughput note | Memory / budget | Notes |---|---:|---:|---:|---:|---| | Nemotron Nano 2 9B | 9B | 128k | 6x higher than next best open model | 19.66 GiB budget; 5% buffer; 1.3 GiB for vision encoder | Open weights, datasets and training techniques; post‑training and distillation used |Nemotron 12B base (teacher) | 12B | 128k | — | 22.9 GiB for weights (bfloat16) | Used as teacher for distillation to Nano 2; larger memory footprint |
Pricing or License
The post emphasizes open weights, open datasets and training techniques as part of NVIDIA open science. No pricing details are provided in the material.
References
More resources
CUDA Toolkit 13.0 for Jetson Thor: Unified Arm Ecosystem and More
Unified CUDA toolkit for Arm on Jetson Thor with full memory coherence, multi-process GPU sharing, OpenRM/dmabuf interoperability, NUMA support, and better tooling across embedded and server-class targets.
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap
Leverage GPU memory swap (model hot-swapping) to share GPUs across multiple LLMs, reduce idle GPU costs, and improve autoscaling while meeting SLAs.
Improving GEMM Kernel Auto-Tuning Efficiency with nvMatmulHeuristics in CUTLASS 4.2
Introduces nvMatmulHeuristics to quickly select a small set of high-potential GEMM kernel configurations for CUTLASS 4.2, drastically reducing auto-tuning time while approaching exhaustive-search performance.
Make ZeroGPU Spaces faster with PyTorch ahead-of-time (AoT) compilation
Learn how PyTorch AoT compilation speeds up ZeroGPU Spaces by exporting a compiled model once and reloading instantly, with FP8 quantization, dynamic shapes, and careful integration with the Spaces GPU workflow.
Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
Guide to fine-tuning gpt-oss with SFT + QAT to recover FP4 accuracy while preserving efficiency, including upcasting to BF16, MXFP4, NVFP4, and deployment with TensorRT-LLM.
How Small Language Models Are Key to Scalable Agentic AI
Explores how small language models enable cost-effective, flexible agentic AI alongside LLMs, with NVIDIA NeMo and Nemotron Nano 2.