How to Improve CUDA Kernel Performance with Shared Memory Register Spilling (CUDA 13.0)
Sources: https://developer.nvidia.com/blog/how-to-improve-cuda-kernel-performance-with-shared-memory-register-spilling, https://developer.nvidia.com/blog/how-to-improve-cuda-kernel-performance-with-shared-memory-register-spilling/, NVIDIA Dev Blog
Overview
When a CUDA kernel requires more hardware registers than are available, the compiler may spill the excess variables to local memory, which is located in device global memory. This register spilling can hurt performance due to the higher latency and bandwidth costs of local memory accesses. CUDA 13.0 introduces a new optimization: shared memory register spilling for CUDA kernels. The compiler prioritizes spilling registers into available shared memory first, and only spills to local memory if shared memory space is insufficient. This on‑chip, low‑latency memory keeps spilled data closer to the SM, reducing pressure on the L2 cache and improving performance for register‑heavy code paths. In CUDA 13.0, PTXAS adds support for spilling registers to shared memory for CUDA kernels. If the feature is enabled, the compiler attempts to spill into shared memory first and falls back to local memory when necessary, preserving correctness. This optimization complements larger L1/L2 caches but targets the reduced latency of on‑chip memory for spilled values. Previously, spills were always directed to local memory; the new path can lead to notable gains in kernels with high register pressure, such as tight loops or frequently executed sections. The feature was evaluated across CUDA kernels from the QUDA lattice QCD library and showed typical gains in the 5–10% range, attributed to reduced or eliminated spills to local memory. The optimization is not available in toolkits prior to CUDA 13.0. It requires explicit opt‑in by developers targeting CUDA 13.0+.
What changes in the compiler
In CUDA 13.0, PTXAS introduces an opt‑in mechanism that redirects spills to shared memory when space permits. When enabled, the compiler first uses shared memory for spills and only uses local memory if shared memory is exhausted. A kernel compiled with this feature shows on the order of tens of kilobytes of per‑block shared memory usage in the debug/benchmark outputs (for example, a kernel demonstrating 46080 bytes of shared memory usage). The mechanism aims to keep spilled data close to the SM and reduce L2 eviction pressures in performance‑critical regions. This optimization is activated by inserting a PTX pragma enable_smem_spilling via inline assembly inside the function, directly after the function declaration. The feature is valid only within a function scope. It is not recommended when launch bounds are not explicitly specified, since the runtime appearance of blocks per SM may be misestimated and could limit concurrency or reduce occupancy. When launch bounds are clearly defined and shared memory usage is stable, enabling spilled registers to shared memory can improve performance.
Key features
- PTXAS support for spilling registers to shared memory in CUDA kernels (CUDA 13.0+).
- Spills prioritized to on‑chip shared memory; fallback to local memory if space runs out.
- Potential performance gains from reduced latency and less L2 contention in high‑pressure regions.
- Opt‑in via PTX inline assembly: enable_smem_spilling, placed after the function declaration.
- Valid only within a function scope; guidance favors kernels with explicit launch bounds.
- Demonstrated gains around 5–10% in workloads like QUDA lattice QCD kernels.
- Not available in CUDA toolkits prior to 13.0; requires CUDA 13.0 or newer.
- If shared memory per block is consistently unused, the optimization can still be beneficial when register pressure is high.
Common use cases
- Kernels with high register pressure and tight loops where local memory spills were previously a bottleneck.
- Workloads where a significant portion of per‑block shared memory remains unused, making it feasible to allocate spilled data in on‑chip memory.
- Code paths where occupancy is constrained by register usage rather than shared memory requirements.
- Applications with well‑defined launch bounds and predictable shared memory usage that can tolerate the extra per‑block shared memory footprint.
Setup & installation (exact commands)
- Ensure you are using CUDA 13.0 or later.
- Opt‑in to shared memory spilling by inserting a PTX inline assembly pragma after the function declaration: enable_smem_spilling. This inline exploitation is described as being valid within function scope and should be used when launch bounds are explicitly defined. Note: The source specifies the opt‑in method and cautions about the constraints, but it does not provide external installation commands or toolchain steps beyond the CUDA toolkit version requirement.
// Enable shared memory spilling (CUDA 13.0+)
// Place after the function declaration
// Inline PTX pragma: enable_smem_spilling
Quick start (minimal runnable example)
Below are two minimal kernel sketches to illustrate the idea conceptually. The first shows a simple kernel without any spilling opt‑in. The second indicates where and how the spilling opt‑in would be applied in code, in line with the CUDA 13.0 guidance.
// Minimal kernel (baseline, no spilling opt‑in)
extern "C" __global__ void demo_kernel(float* in, float* out) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
float v = in[idx];
out[idx] = v * 2.0f;
}
// Minimal kernel (conceptual: enable_smem_spilling after function declaration)
extern "C" __global__ void demo_kernel(float* in, float* out) {
// Enable shared memory spilling (CUDA 13.0+)
// Inline PTX pragma: enable_smem_spilling
int idx = threadIdx.x + blockIdx.x * blockDim.x;
float v = in[idx];
out[idx] = v * 2.0f;
}
Note: The exact syntax for enabling via inline PTX is described in the source as a PTX pragma placed after the function declaration. The demonstrations above illustrate placement and intent rather than a verbatim compiler directive.
Pros and cons
- Pros
- Potential performance gains (typical 5–10%) in register‑pressure kernels by reducing spills to local memory.
- Keeps spilled data in on‑chip shared memory, reducing latency and L2 pressure when space is available.
- Provides a targeted optimization path for kernels with well‑defined launch bounds and stable shared memory usage.
- Cons
- Requires explicit opt‑in; not automatic for all kernels.
- Not applicable if per‑block shared memory is already saturated or if occupancy is not limited by register pressure.
- Misestimation of blocks per SM can occur if launch bounds are not specified, potentially reducing occupancy.
- Only available in CUDA 13.0 and later; older toolchains do not support shared memory register spilling.
Alternatives (brief comparisons)
| Approach | Pros | Cons |---|---|---| | Shared memory register spilling (CUDA 13.0+) | Reduces local memory spills; leverages on‑chip memory | Requires explicit opt‑in; depends on available shared memory per block |Spilling to local memory (pre‑CUDA 13.0) | Simple default behavior; no changes needed | Higher latency; potential L2 eviction; may reduce occupancy in pressure regions |Code optimization to lower register pressure | Can reduce spills across all toolchains; broader applicability | May require substantial refactoring; not always feasible |
Pricing or License
Pricing or license information is not provided in the referenced source. The feature is described as part of CUDA 13.0 tooling and documentation.
References
More resources
CUDA Toolkit 13.0 for Jetson Thor: Unified Arm Ecosystem and More
Unified CUDA toolkit for Arm on Jetson Thor with full memory coherence, multi-process GPU sharing, OpenRM/dmabuf interoperability, NUMA support, and better tooling across embedded and server-class targets.
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap
Leverage GPU memory swap (model hot-swapping) to share GPUs across multiple LLMs, reduce idle GPU costs, and improve autoscaling while meeting SLAs.
Improving GEMM Kernel Auto-Tuning Efficiency with nvMatmulHeuristics in CUTLASS 4.2
Introduces nvMatmulHeuristics to quickly select a small set of high-potential GEMM kernel configurations for CUTLASS 4.2, drastically reducing auto-tuning time while approaching exhaustive-search performance.
Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
Guide to fine-tuning gpt-oss with SFT + QAT to recover FP4 accuracy while preserving efficiency, including upcasting to BF16, MXFP4, NVFP4, and deployment with TensorRT-LLM.
How Small Language Models Are Key to Scalable Agentic AI
Explores how small language models enable cost-effective, flexible agentic AI alongside LLMs, with NVIDIA NeMo and Nemotron Nano 2.
Getting Started with NVIDIA Isaac for Healthcare Using the Telesurgery Workflow
A production-ready, modular telesurgery workflow from NVIDIA Isaac for Healthcare unifies simulation and clinical deployment across a low-latency, three-computer architecture. It covers video/sensor streaming, robot control, haptics, and simulation to support training and remote procedures.