Skip to content
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo
Source: developer.nvidia.com

How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo

Sources: https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo, https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo/, NVIDIA Dev Blog

TL;DR

  • KV Cache offloading reduces GPU memory pressure and enables longer context windows for large-scale LLMs.
  • NVIDIA Dynamo offloads KV Cache from GPU memory to CPU RAM, local SSDs, or remote storage using the low-latency NIXL transfer library.
  • Integration with LMCache and vLLM enables cache reuse, reduces recomputation, and improves throughput in multi-user, high-context workloads.
  • Real-world tests from partners show high-throughput KV Cache movement and reduced Time to First Token, enabling scalable, high-concurrency inference.

Context and background

Inference is increasingly a bottleneck as AI models grow. Large language models (LLMs) rely heavily on attention data stored in the Key-Value (KV) Cache created during the prefill phase of inference. The KV Cache stores intermediate attention data that helps the model stay contextually relevant during generation. However, the KV Cache grows linearly with prompt length and must reside in GPU memory during generation for fast access. As models expand context windows—sometimes reaching millions of tokens—the KV Cache becomes a major constraint because GPU memory is both limited and costly. In use cases like multi-turn conversations, deep research, and code generation, the KV Cache must be kept in memory for extended periods. When GPU memory limits are reached, inference systems face trade-offs that impact cost, latency, and capability. The latest Dynamo release addresses this bottleneck by enabling KV Cache offloading to more scalable storage, allowing the instant transfer of KV Cache blocks from limited GPU memory to cost-effective storage such as CPU RAM, local SSDs, or remote storage. This offloading is powered by the low-latency NIXL transfer library, which moves KV Cache blocks between GPU memory and external storage without interrupting inference. NVIDIA Dynamo enables longer context windows, higher concurrency, and reduced infrastructure costs by lowering the need for additional GPUs and reducing recomputation of cached input tokens. Storage offload is most effective when the KV Cache exceeds GPU memory and cache reuse outweighs the transfer overhead. Dynamo is especially valuable in long-context, high-concurrency, or resource-constrained inference environments. It is designed to integrate with popular inference engines like vLLM and open-source tools like LMCache, promoting open architecture and flexible deployment options.

What’s new

NVIDIA Dynamo introduces several key capabilities to tackle KV Cache bottlenecks:

  • KV Cache offloading to scalable storage: Offload KV Cache blocks from GPU memory to CPU RAM, local SSDs, or networked storage, enabling larger context windows and higher concurrency without requiring more GPUs.
  • Low-latency data transfer: The NIXL library provides fast KV Cache movement between GPU memory and external storage, helping maintain inference latency.
  • KV Cache management with KVBM: The KV Block Manager coordinates memory usage and storage access, separating memory management from specific model engines and standardizing storage access to simplify integration and scalability.
  • Open architecture and integrations: Dynamo can work with LMCache for caching and reusing memory across CPUs, local/remote storage, and is designed to integrate with vLLM and third-party components.
  • Real-world validation: Tests with Vast demonstrated 35 GB/s throughput to a single H100 GPU using the GDS plugin for KV Cache movement; separate tests with Qwen3-32B and 130K-token prompts showed reduced TTFT when precomputed KV cache was reused from storage. WEKA demonstrated high-throughput KV Cache movement using their Augmented Memory Grid with a DGX system, achieving read throughput up to 270 GB/s across eight GPUs. To use KVBM with LMCache and vLLM, users follow setup steps provided in the Dynamo documentation. Grafana metrics (http://localhost:3001) expose KVBM dashboards for monitoring KV offloading and onboarding. For benchmarking KVBM performance with LMBenchmark (LMCache), the documentation provides guidance for baseline comparisons by turning KVBM off against a standard vLLM deployment. This open architecture supports choices between built-in functionality and third-party integrations, emphasizing interoperability and scalable cache management.

Why it matters (impact for developers/enterprises)

KV Cache offloading enables models with longer context windows and higher user concurrency without requiring proportionally larger GPU clusters. By moving KV Cache data to cost-effective storage, inference services can reduce GPU memory usage, allowing clusters to serve more users simultaneously and lowering the overall cost per token. Offloading also avoids expensive KV Cache recomputation, contributing to faster response times and an improved user experience. For developers and enterprises, Dynamo provides a practical path to scale large-context generative AI deployments. The KV Cache Block Manager standardizes access to storage, making it easier to integrate with a range of engines and storage backends. The open-architecture approach also supports partnerships with storage providers and inference frameworks, helping teams optimize latency, throughput, and total cost of ownership.

Technical details or Implementation

Dynamo’s architecture centers on offloading the KV Cache from GPU memory to scalable storage while maintaining efficient, low-latency access during generation. The KV Cache offload is managed by the Dynamo KV Block Manager (KVBM), which coordinates memory management and storage access across different engines. By decoupling memory management from specific inference engines, KVBM simplifies integration and scalability and enables storage and compute to evolve independently. A core design principle is openness. Dynamo integrates with LMCache, an open-source system for caching and reusing memory across CPUs, local and remote storage. LMCache provides a KV caching layer for inference engines such as vLLM, and supports offloading frequently used data like conversation histories or prompts to cost-effective storage, along with eviction and retrieval strategies suitable for high-volume workloads. Partner integrations illustrate Dynamo’s versatility:

  • Vast: Tested high-performance KV Cache movement between GPU and storage, enabling persistent KV Cache movement with the Vast OS integration and a DGX H100 system, including Qwen3-32B prompts.
  • WEKA: Demonstrated a RDMA-based, zero-copy data path that streams KV Cache from a token warehouse to GPUs at near-memory speeds, validating the feasibility of disaggregated inference without bottlenecks. In a DGX system with eight H100 GPUs, WEKA achieved read throughput up to 270 GB/s across GPUs. Storage offload options include CPU RAM, local SSDs, and remote network storage. The NIXL transfer library provides the low-latency transport required to move KV Cache blocks quickly without disrupting ongoing inference. When KV Cache reuse is significant and the transfer overhead is outweighed by the benefits of avoiding recomputation, Dynamo delivers improved throughput and reduced latency across large-scale deployments. To enable KVBM with LMCache and vLLM, Dynamo documentation outlines practical steps and environment configurations. Grafana dashboards accessible at http://localhost:3001 offer visibility into KV onboarding and offloading activity, while LMBenchmark guidance helps teams evaluate KVBM performance against baselines.

KV Cache offload architecture: a quick look

| Storage backends | Characteristics |

Typical use cases
---
---
CPU RAM
Long-context, multi-user inference
Local SSDs
Burst workloads, long sessions
Remote network storage
Large-scale, distributed inference

Key takeaways

  • KV Cache offloading with Dynamo reduces GPU memory pressure and enables longer prompts and higher concurrency.
  • The KV Block Manager coordinates memory and storage access, enabling storage-based cache reuse and scalable integration.
  • LMCache and vLLM integrations support caching, eviction strategies, and reduced recomputation, improving throughput.
  • Real-world tests demonstrate substantial transfer speeds and reduced latency, validating Dynamo’s effectiveness in disaggregated inference.
  • An open architecture gives teams flexibility to choose built-in features or third-party integrations while pursuing cost efficiency and scalability.

FAQ

  • What problem does KV Cache offloading solve?

    It moves KV Cache data from GPU memory to scalable storage to support longer context windows and higher concurrency without proportionally more GPUs.

  • What is the role of KVBM in Dynamo?

    KVBM is the system that powers cache offloading and memory coordination, separating memory management from engine-specific logic and standardizing storage access.

  • How does LMCache fit into Dynamo?

    LMCache provides a KV Cache layer for inference engines like vLLM and supports offloading frequently used data to cost-effective storage with smart eviction and retrieval strategies.

  • What performance results have partners demonstrated?

    Tests include 35 GB/s throughput to a single H100 GPU via GDS, and up to 270 GB/s read throughput across eight GPUs with WEKA’s RDMA-based path, plus reduced TTFT when reusing precomputed KV cache from storage.

References

More news