Skip to content
Reducing Cold Start Latency for LLM Inference with NVIDIA Run:ai Model Streamer
Source: developer.nvidia.com

Reducing Cold Start Latency for LLM Inference with NVIDIA Run:ai Model Streamer

Sources: https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer, https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/, NVIDIA Dev Blog

TL;DR

  • NVIDIA Run:ai Model Streamer is an open source Python SDK that concurrently reads model weights from storage and streams them directly into GPU memory to reduce cold-start latency for LLM inference. NVIDIA Dev Blog
  • Benchmark results across storage types show significant reductions in total readiness time when using Model Streamer, especially on higher-throughput storage; in some cases, times drop by more than 70% versus baselines like the HF Safetensors Loader or Tensorizer. NVIDIA Dev Blog
  • The study highlights that storage choice and concurrency are key levers to achieve faster time-to-inference for cloud-based LLM deployments. NVIDIA Dev Blog

Context and background

Deploying large language models (LLMs) for production inference faces a persistent bottleneck: cold-start latency. Loading model weights—often tens to hundreds of gigabytes—into GPU memory can stall responsiveness and complicate scaling under unpredictable demand. In cloud and hybrid environments, the download-then-load workflow further expands this latency. NVIDIA introduces Run:ai Model Streamer, an open source Python SDK that aims to mitigate these delays by overlapping storage reads with GPU transfers. The SDK’s backend is a high-performance C++ system that reads tensors concurrently from storage into CPU buffers and streams them to GPU memory, enabling simultaneous reads and transfers to exploit PCIe-based data movement. NVIDIA Dev Blog

What’s new

  • The Model Streamer architecture enables concurrency: multiple threads read tensors from storage into CPU memory while other tensors are moved to the GPU, overlapping I/O and compute to reduce overall loading time. This approach uses physical separation between CPU and GPU subsystems to minimize CPU intervention and maximize streaming throughput. NVIDIA Dev Blog
  • It remains compatible with the Safetensor format, avoiding weight conversion in common workflows. In benchmarking, the combination of Model Streamer with vLLM was contrasted against the HF Safetensors Loader and CoreWeave Tensorizer across storage options. NVIDIA Dev Blog
  • The experiments used a mix of storage backends—GP3 SSD, IO2 SSD, and Amazon S3—and highlighted how throughput limits of storage hardware become the practical cap on gains for both Model Streamer and competing loaders. NVIDIA Dev Blog
  • In cloud tests, Model Streamer demonstrated clear advantages in total readiness time when comparing against the HF Safetensors Loader and Tensorizer, including strong gains on S3. NVIDIA Dev Blog

Why it matters (impact for developers/enterprises)

  • Reduced cold-start latency directly improves end-user experience and operational efficiency for serving LLMs in production. Faster readiness times enable more responsive chatbots, faster content generation, and better scale-out under burst demand. NVIDIA Dev Blog
  • For deployments in the cloud, the ability to saturate storage throughput and overlap reads with GPU transfers helps reduce total time-to-inference, which can translate to lower latency SLAs and more predictable autoscaling behavior. NVIDIA Dev Blog
  • The work emphasizes that choices around storage hardware and concurrency configurations materially affect LLM serving performance, guiding infrastructure decisions for enterprises running large models. NVIDIA Dev Blog

Technical details or Implementation

NVIDIA’s Model Streamer is designed to accelerate loading of large model weights into GPU memory from diverse storage sources, including network file systems, cloud storage, and local disks. The core idea is to read and transfer tensors concurrently: as some tensors are read from storage into CPU memory, other tensors are moved from CPU memory to GPU memory via PCIe, enabling real-time overlap between I/O and GPU compute. The system uses a multi-threaded backend that assigns a unique identifier to each tensor, enabling parallel reading and transfer while preserving tensor boundaries and data layout. The model loader and the inference engine (vLLM) were tested on an AWS g5.12xlarge instance with NVIDIA A10G GPUs and 2nd Gen AMD EPYC CPUs, a balanced configuration for high-throughput parallel data handling. NVIDIA Dev Blog

Comparisons and benchmarks

  • Model Streamer vs. HF Safetensors Loader and CoreWeave Tensorizer were benchmarked under cold-start conditions across three storage types: GP3 SSD, IO2 SSD, and Amazon S3. HF Safetensors Loader uses a memory-mapped file system for zero-copy loading on CPU and a cudaMemcpy-based transfer to GPU, while Safetensors lacks S3 support in these tests. Tensorizer streams model data tensor by tensor from HTTP/HTTPS or S3 sources, requiring weight format conversions in some cases. NVIDIA Dev Blog
  • In the GP3 SSD tests, Model Streamer reached a throughput near 1 GiB/s, and benefits from higher concurrency were clear: 1 thread loaded in 47.56 seconds, while concurrency 16 reduced the time to 14.34 seconds. HF Safetensors Loader was close at 47.99 seconds, and Tensorizer improved from 50.74 seconds (1 worker) to 16.11 seconds (16 workers), showing the same concurrency-driven improvement pattern. The storage bandwidth (GP3) became the practical bottleneck for further gains. NVIDIA Dev Blog
  • With IO2 SSD, the results shifted to even more favorable gains for Model Streamer: 1 concurrent thread yielded 43.71 seconds, while Bandwidth allowed a notable drop to 7.53 seconds with concurrency 8. HF Safetensors Loader remained around 47 seconds at low concurrency, and Tensorizer achieved 10.36 seconds with eight workers (throughput ~2 GiB/s). The tests observed practical throughput ceilings around 2 GiB/s for Model Streamer and 1.6 GiB/s for Tensorizer, suggesting infra limits in the AWS environment rather than loader limits. NVIDIA Dev Blog
  • In S3, since HF Safetensors Loader does not support S3, the comparison focused on Model Streamer and Tensorizer. Model Streamer delivered a total readiness of 23.18 seconds, while Tensorizer required 65.18 seconds at best (16 workers). This demonstrated a strong advantage for Model Streamer on cloud storage, albeit with observed caching effects when tests were repeated quickly. To ensure cold-start conditions, the authors imposed a minimum wait between runs. NVIDIA Dev Blog

Consolidated results for vLLM integrations

  • For the end-to-end setup with vLLM, Model Streamer reduced total readiness times to 35.08 seconds on GP3 and 28.28 seconds on IO2, versus 66.13 seconds (GP3) and 62.69 seconds (IO2) for HF Safetensors Loader. Tensorizer achieved 36.19 seconds (GP3) and 30.88 seconds (IO2). On S3, Model Streamer reached 23.18 seconds total readiness, while Tensorizer required 65.18 seconds. These results illustrate Model Streamer’s efficiency advantage in both on-prem and cloud storage scenarios. NVIDIA Dev Blog

Practical takeaways and setup guidance

  • The experiments underscore two practical levers for reducing cold-start latency: storage throughput and concurrency. Higher-throughput storage (IO2) combined with a tuned concurrency level yielded the most dramatic improvements for Model Streamer in cloud environments. NVIDIA Dev Blog
  • When using S3, Model Streamer consistently outperformed Tensorizer in the tested configurations, highlighting its suitability for cloud-based LLM deployment. The observation of caching behavior on AWS S3 also signals the importance of enforcing cold-start discipline with repeated testing windows in production-like conditions. NVIDIA Dev Blog
  • The compatibility with Safetensor formats means teams can adopt Model Streamer without forcing a weight format conversion in many workflows, preserving existing tooling while gaining faster load times. NVIDIA Dev Blog

Key takeaways

  • Concurrency reduces model loading time by overlapping storage reads with GPU transfers.
  • Storage throughput is a practical ceiling; higher-throughput storage enables larger gains.
  • Model Streamer provides clear advantages on GP3, IO2, and S3 storage compared to HF Safetensors Loader and Tensorizer in the tested configurations.
  • S3-based deployments benefit particularly from Model Streamer in terms of total readiness time, although caching effects should be accounted for in repeated tests.
  • The approach reinforces the value of aligning storage strategy with model loading pipelines to shorten time-to-inference for LLMs.

FAQ

  • How does concurrency affect loading times?

    Increasing concurrency (more threads reading from storage to CPU memory) significantly reduces model loading time, up to the storage throughput limit. For example, on GP3, concurrency 16 reduced loading time from about 48 s to 14 s; on IO2, concurrency of 8 yielded about 7.5 s. [NVIDIA Dev Blog](https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/)

  • How does Model Streamer compare with HF Safetensors Loader and Tensorizer?

    Model Streamer generally outperformed both in the tested scenarios, especially on higher-throughput storage and in cloud storage (S3). For example, S3 readings showed 23.18 s with Model Streamer vs 65.18 s for Tensorizer; GP3 and IO2 benchmarks also favored Model Streamer under optimized concurrency. [NVIDIA Dev Blog](https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/)

  • Are there storage limitations to consider?

    Yes. Even with high concurrency, the practical throughput limits of GP3 and IO2 storage affected gains, indicating that storage infrastructure can cap improvements. The tests note AWS infrastructure-related ceilings rather than loader limitations. [NVIDIA Dev Blog](https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/)

  • Is Safetensor format conversion required to use Model Streamer?

    Model Streamer remains compatible with Safetensor format, avoiding weight conversion in typical workflows. [NVIDIA Dev Blog](https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/)

References

More news