How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo
Sources: https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo, https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo/, NVIDIA Dev Blog
TL;DR
- KV Cache offloading reduces GPU memory pressure and enables longer context windows for large-scale LLMs.
- NVIDIA Dynamo offloads KV Cache from GPU memory to CPU RAM, local SSDs, or remote storage using the low-latency NIXL transfer library.
- Integration with LMCache and vLLM enables cache reuse, reduces recomputation, and improves throughput in multi-user, high-context workloads.
- Real-world tests from partners show high-throughput KV Cache movement and reduced Time to First Token, enabling scalable, high-concurrency inference.
Context and background
Inference is increasingly a bottleneck as AI models grow. Large language models (LLMs) rely heavily on attention data stored in the Key-Value (KV) Cache created during the prefill phase of inference. The KV Cache stores intermediate attention data that helps the model stay contextually relevant during generation. However, the KV Cache grows linearly with prompt length and must reside in GPU memory during generation for fast access. As models expand context windows—sometimes reaching millions of tokens—the KV Cache becomes a major constraint because GPU memory is both limited and costly. In use cases like multi-turn conversations, deep research, and code generation, the KV Cache must be kept in memory for extended periods. When GPU memory limits are reached, inference systems face trade-offs that impact cost, latency, and capability. The latest Dynamo release addresses this bottleneck by enabling KV Cache offloading to more scalable storage, allowing the instant transfer of KV Cache blocks from limited GPU memory to cost-effective storage such as CPU RAM, local SSDs, or remote storage. This offloading is powered by the low-latency NIXL transfer library, which moves KV Cache blocks between GPU memory and external storage without interrupting inference. NVIDIA Dynamo enables longer context windows, higher concurrency, and reduced infrastructure costs by lowering the need for additional GPUs and reducing recomputation of cached input tokens. Storage offload is most effective when the KV Cache exceeds GPU memory and cache reuse outweighs the transfer overhead. Dynamo is especially valuable in long-context, high-concurrency, or resource-constrained inference environments. It is designed to integrate with popular inference engines like vLLM and open-source tools like LMCache, promoting open architecture and flexible deployment options.
What’s new
NVIDIA Dynamo introduces several key capabilities to tackle KV Cache bottlenecks:
- KV Cache offloading to scalable storage: Offload KV Cache blocks from GPU memory to CPU RAM, local SSDs, or networked storage, enabling larger context windows and higher concurrency without requiring more GPUs.
- Low-latency data transfer: The NIXL library provides fast KV Cache movement between GPU memory and external storage, helping maintain inference latency.
- KV Cache management with KVBM: The KV Block Manager coordinates memory usage and storage access, separating memory management from specific model engines and standardizing storage access to simplify integration and scalability.
- Open architecture and integrations: Dynamo can work with LMCache for caching and reusing memory across CPUs, local/remote storage, and is designed to integrate with vLLM and third-party components.
- Real-world validation: Tests with Vast demonstrated 35 GB/s throughput to a single H100 GPU using the GDS plugin for KV Cache movement; separate tests with Qwen3-32B and 130K-token prompts showed reduced TTFT when precomputed KV cache was reused from storage. WEKA demonstrated high-throughput KV Cache movement using their Augmented Memory Grid with a DGX system, achieving read throughput up to 270 GB/s across eight GPUs. To use KVBM with LMCache and vLLM, users follow setup steps provided in the Dynamo documentation. Grafana metrics (http://localhost:3001) expose KVBM dashboards for monitoring KV offloading and onboarding. For benchmarking KVBM performance with LMBenchmark (LMCache), the documentation provides guidance for baseline comparisons by turning KVBM off against a standard vLLM deployment. This open architecture supports choices between built-in functionality and third-party integrations, emphasizing interoperability and scalable cache management.
Why it matters (impact for developers/enterprises)
KV Cache offloading enables models with longer context windows and higher user concurrency without requiring proportionally larger GPU clusters. By moving KV Cache data to cost-effective storage, inference services can reduce GPU memory usage, allowing clusters to serve more users simultaneously and lowering the overall cost per token. Offloading also avoids expensive KV Cache recomputation, contributing to faster response times and an improved user experience. For developers and enterprises, Dynamo provides a practical path to scale large-context generative AI deployments. The KV Cache Block Manager standardizes access to storage, making it easier to integrate with a range of engines and storage backends. The open-architecture approach also supports partnerships with storage providers and inference frameworks, helping teams optimize latency, throughput, and total cost of ownership.
Technical details or Implementation
Dynamo’s architecture centers on offloading the KV Cache from GPU memory to scalable storage while maintaining efficient, low-latency access during generation. The KV Cache offload is managed by the Dynamo KV Block Manager (KVBM), which coordinates memory management and storage access across different engines. By decoupling memory management from specific inference engines, KVBM simplifies integration and scalability and enables storage and compute to evolve independently. A core design principle is openness. Dynamo integrates with LMCache, an open-source system for caching and reusing memory across CPUs, local and remote storage. LMCache provides a KV caching layer for inference engines such as vLLM, and supports offloading frequently used data like conversation histories or prompts to cost-effective storage, along with eviction and retrieval strategies suitable for high-volume workloads. Partner integrations illustrate Dynamo’s versatility:
- Vast: Tested high-performance KV Cache movement between GPU and storage, enabling persistent KV Cache movement with the Vast OS integration and a DGX H100 system, including Qwen3-32B prompts.
- WEKA: Demonstrated a RDMA-based, zero-copy data path that streams KV Cache from a token warehouse to GPUs at near-memory speeds, validating the feasibility of disaggregated inference without bottlenecks. In a DGX system with eight H100 GPUs, WEKA achieved read throughput up to 270 GB/s across GPUs. Storage offload options include CPU RAM, local SSDs, and remote network storage. The NIXL transfer library provides the low-latency transport required to move KV Cache blocks quickly without disrupting ongoing inference. When KV Cache reuse is significant and the transfer overhead is outweighed by the benefits of avoiding recomputation, Dynamo delivers improved throughput and reduced latency across large-scale deployments. To enable KVBM with LMCache and vLLM, Dynamo documentation outlines practical steps and environment configurations. Grafana dashboards accessible at http://localhost:3001 offer visibility into KV onboarding and offloading activity, while LMBenchmark guidance helps teams evaluate KVBM performance against baselines.
KV Cache offload architecture: a quick look
| Storage backends | Characteristics |
| Typical use cases |
|---|
| --- |
| --- |
| CPU RAM |
| Long-context, multi-user inference |
| Local SSDs |
| Burst workloads, long sessions |
| Remote network storage |
| Large-scale, distributed inference |
Key takeaways
- KV Cache offloading with Dynamo reduces GPU memory pressure and enables longer prompts and higher concurrency.
- The KV Block Manager coordinates memory and storage access, enabling storage-based cache reuse and scalable integration.
- LMCache and vLLM integrations support caching, eviction strategies, and reduced recomputation, improving throughput.
- Real-world tests demonstrate substantial transfer speeds and reduced latency, validating Dynamo’s effectiveness in disaggregated inference.
- An open architecture gives teams flexibility to choose built-in features or third-party integrations while pursuing cost efficiency and scalability.
FAQ
-
What problem does KV Cache offloading solve?
It moves KV Cache data from GPU memory to scalable storage to support longer context windows and higher concurrency without proportionally more GPUs.
-
What is the role of KVBM in Dynamo?
KVBM is the system that powers cache offloading and memory coordination, separating memory management from engine-specific logic and standardizing storage access.
-
How does LMCache fit into Dynamo?
LMCache provides a KV Cache layer for inference engines like vLLM and supports offloading frequently used data to cost-effective storage with smart eviction and retrieval strategies.
-
What performance results have partners demonstrated?
Tests include 35 GB/s throughput to a single H100 GPU via GDS, and up to 270 GB/s read throughput across eight GPUs with WEKA’s RDMA-based path, plus reduced TTFT when reusing precomputed KV cache from storage.
References
- NVIDIA Dev Blog: https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo/
- Grafana metrics dashboard: http://localhost:3001
More news
NVIDIA HGX B200 Reduces Embodied Carbon Emissions Intensity
NVIDIA HGX B200 lowers embodied carbon intensity by 24% vs. HGX H100, while delivering higher AI performance and energy efficiency. This article reviews the PCF-backed improvements, new hardware features, and implications for developers and enterprises.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference
Scaleway is now a supported Inference Provider on the Hugging Face Hub, enabling serverless inference directly on model pages with JS and Python SDKs. Access popular open-weight models and enjoy scalable, low-latency AI workflows.
Kaggle Grandmasters Playbook: 7 Battle-Tested Techniques for Tabular Data Modeling
A detailed look at seven battle-tested techniques used by Kaggle Grandmasters to solve large tabular datasets fast with GPU acceleration, from diversified baselines to advanced ensembling and pseudo-labeling.
Microsoft to turn Foxconn site into Fairwater AI data center, touted as world's most powerful
Microsoft unveils plans for a 1.2 million-square-foot Fairwater AI data center in Wisconsin, housing hundreds of thousands of Nvidia GB200 GPUs. The project promises unprecedented AI training power with a closed-loop cooling system and a cost of $3.3 billion.
Monitor Amazon Bedrock batch inference using Amazon CloudWatch metrics
Learn how to monitor and optimize Amazon Bedrock batch inference jobs with CloudWatch metrics, alarms, and dashboards to improve performance, cost efficiency, and operational oversight.