Skip to content
Build High-Performance Vision AI Pipelines with NVIDIA CUDA-Accelerated VC-6
Source: developer.nvidia.com

Build High-Performance Vision AI Pipelines with NVIDIA CUDA-Accelerated VC-6

Sources: https://developer.nvidia.com/blog/build-high-performance-vision-ai-pipelines-with-nvidia-cuda-accelerated-vc-6, https://developer.nvidia.com/blog/build-high-performance-vision-ai-pipelines-with-nvidia-cuda-accelerated-vc-6/, NVIDIA Dev Blog

TL;DR

  • GPU compute throughput is rising faster than traditional data pipelines; this gap can cause GPU starvation when I/O, PCIe transfers, and CPU-bound steps don’t keep pace with model workloads. VC-6 is a codec designed for massively parallel execution that maps naturally to GPUs.
  • NVIDIA and V-Nova’s CUDA-accelerated VC-6 (alpha) provides native multi-resolution hierarchy, selective decoding, and selective data recall to reduce I/O and memory bandwidth while delivering AI-ready tensors on the GPU.
  • The CUDA path (vc6_cuda12) enables a native CUDA library and a pre-compiled Python wheel for easy installation; decoder outputs can expose cuda_array_interface to work with CuPy, PyTorch, and nvImageCodec. Partial decode and ROI support let you fetch only the necessary bytes for a target LoQ or region.
  • Early tests on the DIV2K dataset show I/O savings of approximately 37% to 72% (depending on the target resolution), reflecting lower network, storage, PCIe, and VRAM traffic versus full-resolution decoding. The approach aligns VC-6’s flexible hierarchy with CUDA’s SIMT model to maximize throughput.
  • Even in early alpha form, the CUDA path demonstrates significant performance gains over OpenCL and CPU while laying the groundwork for further batching optimizations and GPU-wide efficiency improvements. For more details, see the NVIDIA Dev Blog post on CUDA-accelerated VC-6. NVIDIA Dev Blog

Context and background

As GPU performance scales, data pipelines must keep pace to avoid wasting compute cycles on the accelerator. Traditional codecs typically present a flat block of pixels and require reading the entire file to satisfy any output resolution, forcing substantial data movement and CPU-side work. VC-6 (SMPTE ST 2117-1) is designed from the ground up for modern compute architectures, generating an efficient multi-resolution hierarchy rather than a single image block. This structure enables selective decoding and fetching, allowing different components — color planes, echelons, or tiles — to be accessed independently and in parallel. The result is a pathway from compressed bits to model-ready tensors that better matches the parallelism of GPUs. The standard’s multi-resolution hierarchy enables powers-of-two downscaling (e.g., 8K → 4K → Full HD) with a decoding process that progressively reconstructs from a root LoQ to finer levels by upsampling and adding residuals. The architectural principles behind VC-6 emphasize low inter-dependency and orthogonal hierarchies to support concurrent processing and high throughput, aligning well with the CPU-bound stages of traditional pipelines and the massively parallel work performed by GPUs. In AI training pipelines, frameworks such as PyTorch schedule data loading with parallelism to hide latency. The VOC-6 project (from V-Nova) aimed to tailor VC-6 for CUDA to minimize CPU copies and synchronization points and to maximize GPU utilization in demanding AI workloads. This work reflects a broader industry push to reduce data-to-tensor latency and to enable rapid iteration of vision models on accelerated hardware. What makes VC-6 especially relevant for AI pipelines is its selective data recall. Instead of reading entire bitstreams, consumers fetch only the bytes needed for the target LoQ, ROI, or plane. This capability reduces network transfer, storage I/O, PCIe bandwidth, and memory traffic, which translates into higher effective throughput for data loaders and more room to increase batch sizes without changing model code. In practice, this means faster start-to-tensor conversion and a smoother feed of data to the accelerator.

What’s new

The CUDA-accelerated VC-6 path represents a dedicated effort to translate VC-6’s architectural strengths into a GPU-first AI workflow. The current release is an alpha-stage CUDA implementation that brings native batching capabilities and additional optimizations planned based on evolving AI requirements. Porting VC-6 from OpenCL to CUDA unlocks deeper integration with common AI tooling and eliminates extra CPU copies or synchronization points, enabling a more seamless path from storage to tensor on modern NVIDIA GPUs. Key user-facing capabilities include:

  • A native CUDA library that exploits VC-6’s hierarchical, multi-resolution structure for AI workloads.
  • A pre-compiled Python wheel (vc6_cuda12) that simplifies installation via pip and provides a high-level interface to encode, decode, and transcode VC-6 bitstreams.
  • Decoder outputs that can yield a CUDA array interface; when output memory type is GPU_DEVICE, decoded images expose cuda_array_interface and can be used directly with CuPy, PyTorch, and nvImageCodec.
  • Partial decoding and region-of-interest (RoI) support; functions accept optional ROI parameters or can operate on memory to fetch only the necessary data, allowing partial data recall and faster data access.
  • Mechanisms to peek at headers and report target LoQ sizes, enabling efficient preflight sizing for memory and IO planning.
  • A trajectory toward greater batching and reduced kernel-level overhead via CUDA graphs and kernel fusion, with the goal of higher throughput and lower latency in end-to-end pipelines. The CUDA path aligns VC-6’s architecture with NVIDIA GPUs’ SIMT execution model, minimizing inter-dependencies to maximize parallelism. A table in the source summarizes the architectural benefits of VC-6, underscoring selective resolution, RoI decoding, and selective recall as core features that deliver AI-friendly efficiency. The CUDA implementation is designed to complement existing libraries and continue to serve workloads where selective LoQ/RoI decoding and GPU-resident data offer immediate advantages. In early testing on RTX PRO 6000 Blackwell Server-Edition hardware with the DIV2K dataset (800 images), the approach demonstrated meaningful throughput gains over CPU and OpenCL paths, especially in throughput mode, which benefits from larger, consolidated kernels.

How the CUDA implementation translates to performance

The architecture supports a chain of upsampling kernels across LoQs, reconstructing images progressively. At lower LoQs, a sizable portion of computation is non-trivial, while at higher resolutions, the computation becomes more dense. CUDA-specific optimizations, such as kernel fusion and graph-based execution, are expected to reduce overhead between successive kernels and better utilize GPU resources. Nsight Systems traces reveal that early decodes can underutilize GPUs when each upsample kernel launches a small grid. Redirecting workloads to larger, more coalesced grids can improve scheduler efficiency and enable multiple decodes to run in parallel without competing for kernel launches. This is particularly important when deploying multiple decoders in a single pipeline to sustain high data throughput. In practice, the test setup used a pseudo-batch approach to simulate native batching by running several asynchronous single-image decoders in parallel across CPU threads and GPU streams. The results showed that moving from CPU/OpenCL to CUDA yields a clear performance uplift, with the GPU performing the heavy lifting on the residual decode and reconstruction steps while the CPU handles bitstream parsing and root-nodes work. The alpha stage demonstrates that even with a limited feature set, CUDA offers robust gains and a path toward further improvements as features mature.

Technical details or Implementation

VC-6’s design hinges on a hierarchical, multi-resolution approach that enables independent, parallel access to components such as color planes, echelons, or image tiles. Its two orthogonal hierarchies support concurrent processing and minimize serial dependencies, making the codec a strong match for GPUs’ SIMT model. The CUDA implementation leverages this structure to map decoding tasks to GPU threads and blocks in a way that preserves parallelism while reducing CPU-GPU synchronization. Specific features enabled by the CUDA port include:

  • Native CUDA workflows that maximize throughput for AI pipelines and reduce CPU copies.
  • A Python packaging model that publishes a pre-compiled wheel (vc6_cuda12) for easy installation and rapid experimentation with PyTorch, CuPy, and related frameworks.
  • The ability to obtain outputs as CUDA-accessible arrays, enabling seamless integration with GPU-accelerated frameworks and libraries.
  • Partial decoding capabilities, where a decoder can be instructed to read only the data required for a given LoQ or ROI.
  • A lightweight header inspection utility that reports the required size for a target LoQ, enabling efficient preallocation and I/O planning. From a developer perspective, the VC-6 CUDA path provides a practical building block for accelerating data pipelines today. It offers an AI-native alternative to traditional CPU/OpenCL-based workflows and aligns with the needs of high-throughput vision AI workloads where data ingestion, decoding, and tensor preparation must keep pace with powerful neural networks.

Key takeaways

  • VC-6 is purpose-built to exploit modern GPU architectures through a multi-resolution, hierarchical design that enables selective decoding and data recall.
  • The CUDA-accelerated path brings native CUDA optimizations to VC-6, delivering significant performance improvements over CPU and OpenCL in early testing and enabling efficient integration with popular AI tooling.
  • Partial decoding, ROI-focused access, and selective data recall dramatically reduce I/O and memory bandwidth requirements, helping data loaders scale batch sizes and throughput in AI pipelines.
  • A pre-compiled Python wheel and cuda_array_interface support make it straightforward to experiment with VC-6 in PyTorch, CuPy, and related ecosystems.
  • The ongoing alpha work aims to increase native batching, optimize kernel execution, and further minimize CPU-GPU synchronization overhead.

FAQ

  • What is VC-6 and why is it relevant for vision AI pipelines?

    VC-6 is SMPTE ST 2117-1-compliant image/video coding designed for direct, efficient interaction with modern compute architectures, with a focus on hierarchical, multi-resolution decoding and selective data recall to accelerate AI workloads. The CUDA-accelerated path maps this architecture to GPUs to reduce data movement and improve throughput.

  • How do I install and start using VC-6 with CUDA?

    The CUDA package vc6_cuda12 is distributed as a pre-compiled Python wheel, enabling installation via pip and quick creation of codec objects for encoding, decoding, and transcoding. The decoder can expose CUDA array interfaces for integration with libraries like PyTorch and CuPy.

  • Can VC-6 decode only part of a file or image? How does ROI work?

    Yes. Partial decoding and region-of-interest support allow the decoder to read and process only the data required for a given LoQ or ROI, reducing I/O and memory usage. A utility is provided to report the required size for the target LoQ, enabling efficient data access planning.

  • What are the current limitations and future directions of the CUDA path?

    The CUDA path is currently in alpha, with native batching and further optimizations on the roadmap. Ongoing work focuses on improving throughput, reducing kernel launch overhead, and expanding GPU-side processing to better exploit CUDA graphs and kernel fusion.

References

More news