What’s New in CUDA Toolkit 13.0 for Jetson Thor: Unified Arm Ecosystem and More
Sources: https://developer.nvidia.com/blog/whats-new-in-cuda-toolkit-13-0-for-jetson-thor-unified-arm-ecosystem-and-more, https://developer.nvidia.com/blog/whats-new-in-cuda-toolkit-13-0-for-jetson-thor-unified-arm-ecosystem-and-more/, NVIDIA Dev Blog
TL;DR
- CUDA Toolkit 13.0 unifies the toolchain for Arm platforms, bringing server-class and embedded devices under a single CUDA runtime and toolkit (Orin remains on its current path for now).
- Jetson Thor gains Unified Virtual Memory with full coherence, memory mapping via mmap()/cudaMallocManaged(), and seamless access to pageable host memory from the GPU.
- GPU sharing and resource-management features improve multi-process workload efficiency: Multi-Process Service (MPS), green contexts, and forthcoming MIG support for deterministic SM allocations.
- Developer tooling and interoperability expand with nvidia-smi, NVML, OpenRM, and dmabuf-based memory sharing, plus NUMA support on Tegra.
Context and background
CUDA Toolkit 13.0 for Jetson Thor is built on the NVIDIA Blackwell GPU architecture and introduces a unified CUDA toolkit for Arm platforms. This unification eliminates the need to maintain separate toolchains for server-class SBSA-compliant servers and next-generation embedded systems like Thor, with Orin kept on its current path for now. The move unlocks significant productivity gains by enabling a single binary build to run across simulation environments (e.g., GB200, DGX Spark) and target embedded hardware without code changes. The unification also extends to containers, consolidating image ecosystems and reducing CI overhead while improving portability across evolving GPU generations and platforms. Unified tooling and containers pave the way for concurrent usage of integrated and discrete GPUs on Jetson and IGX platforms, delivering a smoother and more efficient edge-computing experience.
What’s new
CUDA 13.0 introduces several major capabilities for Jetson Thor and Arm-based systems:
- Unified ARM Toolkit: A single CUDA toolkit across server-class and embedded devices; Orin remains on its current path for now.
- Unified Virtual Memory (UVM) with full coherence: Jetson Thor now supports UVM with host-page-table access to pageable memory, and gpu-side access cached in the GPU, with coherence managed by hardware interconnect.
- Memory model enhancements: pageable host memory via host page tables is accessible by the GPU; allocations via cudaMallocManaged() report concurrentManagedAccess = 1, enabling CPU-GPU concurrency without explicit copies (though CUDA allocations are not GPU-cached).
- Memory interoperability example: mapping a file with mmap() and using the resulting pointer directly in a GPU kernel; input, and output buffers can be mmap()-backed and cached in L2, avoiding explicit cudaMemcpy() calls.
- GPU sharing and scheduling: Multi-Process Service (MPS) enables multiple processes to share the GPU concurrently, consolidating workloads into a single context to improve occupancy and throughput without app-code changes.
- Green contexts and future MIG: lightweight pre-allocated CUDA contexts (green contexts) provide deterministic SM allocations; used with MPS, they help preserve latency guarantees in multi-process workloads. MIG (Multi-Instance GPU) is referenced as an upcoming feature to partition the GPU into isolated slices.
- Developer tooling enhancements: nvidia-smi and NVIDIA Management Library (NVML) support bring improved GPU visibility and control on Jetson Thor; basic metrics are available, with some features (clock, power, thermal queries; per-process utilization; SoC memory monitoring) not yet available.
- OpenRM and dmabuf interoperability: CUDA can import and export dmabuf memory via the CUDA External Resource Interoperability API; dmabuf-based sharing is available on OpenRM platforms, with imports using CUDA external memory types and exports via cuMemGetHandleForAddressRange(). Support can be checked with cuDeviceGetAttribute(CU_DEVICE_ATTRIBUTE_HOST_ALLOC_DMA_BUF_SUPPORTED).
- NUMA support for Tegra: NUMA awareness enables explicit memory-placement control on multi-socket systems, streamlining development and improving compatibility for single-socket configurations.
Why it matters (impact for developers/enterprises)
The unification of the CUDA toolkit for Arm platforms simplifies development, testing, and deployment pipelines. Developers can build once, simulate on high-performance systems, and deploy the exact same binary to embedded targets such as Jetson Thor without code changes, reducing duplication and CI overhead. A single source of truth for builds across simulation and edge platforms improves consistency and portability as GPU generations evolve. UVM with full coherence enables direct use of system memory on the GPU, simplifying data workflows, while MPS, green contexts, and MIG offer avenues to improve GPU utilization, occupancy, and real-time performance in robotics and other edge workloads.
Technical details or Implementation
- Unified toolchain and containers: The CUDA 13.0 approach unifies the toolkit across SBSA server-class devices and embedded Thor devices; the only exception is Orin (sm_87).
- Unified Virtual Memory (UVM) on Jetson: Jetson Thor exposes pageable host memory via host page tables to the GPU, with cudaDeviceProp::pageableMemoryAccessUsesHostPageTables = 1. Memory allocated via mmap() or malloc() can be used directly by the GPU; cudaMallocManaged() allocations report concurrentManagedAccess = 1, supporting CPU-GPU concurrency. CUDA allocations are not GPU-cached under 13.0.
- Example workflow: a file mapped with mmap() can be used as input to a GPU kernel (e.g., histogram) with the output mapped similarly, all staying in GPU-L2 cache without separate CUDA allocations.
- MPS (Multi-Process Service): Two binaries exist—nvidia-cuda-mps-control and nvidia-cuda-mps-server—and the service can be started to allow multiple processes to share the GPU in a single context, improving occupancy and throughput. To run, start the daemon, run applications as MPS clients with shared pipes and logs, and stop with the appropriate commands. The CUDA_MPS_ACTIVE_THREAD_PERCENTAGE environment variable can influence threading behavior.
- Green contexts and SM allocation: Green contexts pre-assign SM resources to CUDA contexts to ensure deterministic execution. When used with MPS, multiple processes can run concurrently with non-overlapping SM allocations, provided CUDA_MPS_ACTIVE_THREAD_PERCENTAGE is set appropriately. Example usage includes combining green contexts with MIG slices for real-time robotics workloads (e.g., SLAM, object detection, motion planning).
- MIG (Multi-Instance GPU): Mentioned as an upcoming feature to partition the GPU into isolated slices, enabling deterministic performance within each slice.
- NVIDIA tools and monitoring: nvidia-smi and NVML integration provides device details, utilization, and basic capability information on Jetson Thor, though some deeper queries (clock/power/thermal, per-process utilization, SoC memory monitoring) are not yet exposed.
- OpenRM and dmabuf interoperability: CUDA can import dmabuf memory through the External Resource Interoperability API and treat it as a CUDA external memory type; export is possible via cuMemGetHandleForAddressRange() on supported OpenRM platforms. Applications can check support using cuDeviceGetAttribute(CU_DEVICE_ATTRIBUTE_HOST_ALLOC_DMA_BUF_SUPPORTED).
- NUMA support: Tegra NUMA support enables explicit memory placement optimization for multi-socket systems, improving performance and compatibility for porting NUMA-aware apps from dGPU platforms.
Key takeaways
- CUDA 13.0 unifies the ARM toolkit for Jetson Thor, reducing toolchain and container fragmentation.
- Unified Virtual Memory and full coherence enable seamless CPU-GPU memory sharing on Jetson Thor.
- MPS, green contexts, and MIG potential enhancements improve multi-process GPU utilization and real-time determinism.
- OpenRM/dmabuf interoperability and NUMA support extend interoperability and performance across platforms.
- nvidia-smi and NVML support on Jetson Thor provide visibility and management capabilities for edge deployments.
FAQ
-
What is unified toolchain in CUDA 13.0 for Jetson Thor?
It unifies the CUDA toolkit for Arm platforms, removing the need for separate toolchains for server-class SBSA servers and embedded Thor devices (Orin is an exception for now).
-
How does Unified Virtual Memory affect development on Jetson Thor?
UVM provides full coherence between host and device memory, allowing pageable host memory to be accessed by the GPU via host page tables, and enabling concurrent CPU-GPU memory usage with cudaMallocManaged() while not caching allocations on the GPU.
-
What tools help manage and monitor GPU usage on Jetson Thor?
The NVIDIA tooling includes nvidia-smi and NVML for querying device details and usage metrics, with some advanced queries not yet available.
-
What interoperability options exist for memory sharing?
OpenRM with dmabuf support enables importing and exporting CUDA memory as dmabuf file descriptors via the CUDA External Resource Interoperability API.
-
What role do MPS, green contexts, and MIG play in this release?
MPS enables multiple processes to share the GPU concurrently; green contexts pre-allocate SMs for deterministic execution; MIG is referenced as an upcoming feature to partition the GPU into isolated slices for real-time workloads.
References
More news
NVIDIA HGX B200 Reduces Embodied Carbon Emissions Intensity
NVIDIA HGX B200 lowers embodied carbon intensity by 24% vs. HGX H100, while delivering higher AI performance and energy efficiency. This article reviews the PCF-backed improvements, new hardware features, and implications for developers and enterprises.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo
NVIDIA Dynamo offloads KV Cache from GPU memory to cost-efficient storage, enabling longer context windows, higher concurrency, and lower inference costs for large-scale LLMs and generative AI workloads.
Kaggle Grandmasters Playbook: 7 Battle-Tested Techniques for Tabular Data Modeling
A detailed look at seven battle-tested techniques used by Kaggle Grandmasters to solve large tabular datasets fast with GPU acceleration, from diversified baselines to advanced ensembling and pseudo-labeling.
Microsoft to turn Foxconn site into Fairwater AI data center, touted as world's most powerful
Microsoft unveils plans for a 1.2 million-square-foot Fairwater AI data center in Wisconsin, housing hundreds of thousands of Nvidia GB200 GPUs. The project promises unprecedented AI training power with a closed-loop cooling system and a cost of $3.3 billion.
Speculative Decoding to Reduce Latency in AI Inference: EAGLE-3, MTP, and Draft-Target Approaches
A detailed look at speculative decoding for AI inference, including draft-target and EAGLE-3 methods, how they reduce latency, and how to deploy on NVIDIA GPUs with TensorRT.