Skip to content
Improving GEMM Kernel Auto-Tuning Efficiency on NVIDIA GPUs with Heuristics and CUTLASS 4.2
Source: developer.nvidia.com

Improving GEMM Kernel Auto-Tuning Efficiency on NVIDIA GPUs with Heuristics and CUTLASS 4.2

Sources: https://developer.nvidia.com/blog/improving-gemm-kernel-auto-tuning-efficiency-on-nvidia-gpus-with-heuristics-and-cutlass-4-2, https://developer.nvidia.com/blog/improving-gemm-kernel-auto-tuning-efficiency-on-nvidia-gpus-with-heuristics-and-cutlass-4-2/, NVIDIA Dev Blog

TL;DR

  • nvMatmulHeuristics provides fast heuristics for GEMMs to predict a small set of high-potential kernel configurations.
  • Integrated with CUTLASS 4.2, it dramatically shortens end-to-end kernel tuning from exhaustive searches to targeted candidate evaluation.
  • Early results show substantial time savings: using 16 configurations achieves about 96% of peak performance in roughly 150 minutes, vs. over 700 minutes for exhaustive search (LLama 3 405B on H100 SXM).
  • For DeepSeek-R1 671B on a B200 GPU, 8 configurations reach about 99% of exhaustive-search performance with more than 5x speedup in build and auto-tuning time. This enables practical JIT-like workflows in DL frameworks.
  • The approach emphasizes statically known cluster sizes at compile time, enabling more efficient kernels compared to dynamic cluster sizing in some precompiled kernels. These findings illustrate how a good heuristic can push performance toward the best possible kernel while dramatically reducing tuning costs. NVIDIA Developer Blog

Context and background

Selecting the best General Matrix Multiplication (GEMM) kernel for a given problem and hardware is a complex optimization problem. GEMM performance depends on a wide array of compile-time and runtime meta-parameters, including CTA, warp and instruction-level tile sizes, kernel schedules, rasterization strategies, cluster dimensions, and split-k factors. Historically, finding the optimal kernel involved generating thousands of candidate configurations, compiling them, and running exhaustive auto-tuning. This brute-force workflow can take many hours and presents a barrier to adoption in offline compiled libraries like CUTLASS, as well as in JIT-compiled ecosystems such as Torch Inductor or OpenAI Triton where fast model compilation is critical. The friction often leads users to pick suboptimal kernels to avoid long tuning times. The NVIDIA blog introduces nvMatmulHeuristics, a GPU kernel meta-parameter optimization module, designed to deliver a small, high-potential set of kernel configurations for a given GEMM problem and hardware. The module analyzes the operation’s parameters and the target hardware capabilities and outputs a concise list of configurations with near-peak potential. The integration into CUTLASS 4.2 aims to transform the kernel generation and tuning workflow into a faster, more predictable process. This feature is described as a core part of the cuBLAS heuristics and is available in early access for general use, alongside an integration into CUTLASS. NVIDIA Developer Blog The workflow shifts away from brute force toward targeted exploration. Users prepare a list of GEMM problems in JSON format, build CUTLASS with specific CUDA cache- and heuristics-related options, and then run auto-tuning on a small set of configurations emitted per problem. A generated CSV test list can drive automated profiling or be consumed by cutlass_profiler to reproduce results. The approach is designed to deliver consistent profiling results by using locked clocks and to facilitate integration into existing workflows and libraries. The feature is part of the CUTLASS ecosystem and is compatible with FP16 I/O and FP32 compute (HSH) examples described in the documentation. NVIDIA Developer Blog

What’s new

nvMatmulHeuristics introduces a practical, heuristic-driven workflow for GEMM kernel selection within CUTLASS. Rather than enumerating thousands of kernels, users can query a curated set of top-N configurations predicted by the heuristic model. The workflow for CUTLASS users typically follows these steps: prepare a JSON file listing GEMM problems, build CUTLASS with the library heuristics enabled, and specify -DCUTLASS_LIBRARY_HEURISTICS_PROBLEMS_FILE= and -DCUTLASS_LIBRARY_HEURISTICS_CONFIGS_PER_PROBLEM=N, where N is the number of configurations nvMatmulHeuristics will emit for each GEMM. The build step produces a CSV test list that enumerates all test cases required to run auto-tuning over the emitted configurations. This CSV can be consumed by custom benchmarking code or by cutlass_profiler to run the configurations out of the box. For consistent results, clocks should be locked during profiling. The feature is presented as part of CUTLASS 4.2 with early access to nvMatmulHeuristics and an integration into CUTLASS. NVIDIA Developer Blog

An FP16 example and practical performance

For an FP16 input/output with FP32 compute (HSH) scenario, nvMatmulHeuristics can emit eight configurations. These eight kernels can be compiled and benchmarked, and the best among them can be chosen for deployment, often delivering performance very close to the best possible kernel discovered by exhaustive search. In the published results, eight configurations achieved 104% of the baseline when tested on the same GEMMs with static cluster sizes, illustrating how focusing the build and profiling effort on a small, curated set can reach peak-like performance with lower cost than exhaustive tuning. NVIDIA Developer Blog

Empirical results and takeaways

Two representative workloads demonstrated the impact of nvMatmulHeuristics. First, a Llama 3 405B training workload on an NVIDIA H100 SXM GPU showed that exhaustive search could require more than 700 minutes to identify the optimal kernel, while a search restricted to 16 candidate kernels reached 96% of the peak performance in roughly 150 minutes. This highlights a dramatic reduction in total tuning time with only a modest drop from peak performance. A second workload, a DeepSeek-R1 671B training workload on an NVIDIA B200 GPU, showed that using a small candidate set can reach 99% of the exhaustive-search performance with more than a 5x speedup in build and auto-tuning time. In this case, the baseline used dynamic cluster sizes, whereas the recommended nvMatmulHeuristics configurations were built with known static cluster sizes at compile time, resulting in 104% of the baseline performance for these GEMMs. These results illustrate how heuristics can push toward peak performance while dramatically reducing tuning costs and enabling practical deployment in framework-driven or offline compilation scenarios. NVIDIA Developer Blog

Workflow implications for developers and enterprises

By focusing the tuning process on a small, high-potential set of kernels, nvMatmulHeuristics reduces the end-to-end time to deploy high-performance GEMMs in production systems. This makes it feasible to incorporate highly-optimized kernels into JIT-compiled stacks and offline libraries, accelerating development cycles for deep learning frameworks, compilers, and kernel libraries. It also enables more predictable performance outcomes across hardware generations by leveraging hardware-aware heuristics embedded in the workflow. The integration into CUTLASS 4.2 further streamlines adoption for developers already using CUTLASS and cuBLAS heuristics as part of their stack. NVIDIA Developer Blog

Technical details or Implementation

The nvMatmulHeuristics workflow relies on a two-layer process: a predictive heuristics module that analyzes the GEMM problem parameters and hardware capabilities, and a CUTLASS integration that uses the predicted top-N configurations to build and benchmark a small, focused kernel set. Key steps include:

  • Prepare a GEMM problem list in JSON format, describing the transposition/layout of A, B, and C/D as needed (for example, FP16 GEMMs with tnn notation).
  • Build CUTLASS with the heuristics integration enabled and specify -DCUTLASS_LIBRARY_HEURISTICS_PROBLEMS_FILE and -DCUTLASS_LIBRARY_HEURISTICS_CONFIGS_PER_PROBLEM to control how many configurations are emitted per problem.
  • The build step outputs a CSV test list enumerating all test cases to run; you can use your own benchmarking code to perform auto-tuning or use cutlass_profiler to run the configurations out of the box.
  • For consistent profiling results, run with locked clocks. The documentation provides guidance on usage and integration. The eight-configuration example for FP16/FP32 (HSH) demonstrates the practical effect: compiling and testing these eight kernels enables selecting the best performing kernel without the overhead of a full exhaustive search, while still achieving near-peak performance. In comparative terms, this approach substantially reduces the end-to-end time to reach high-performance kernels while preserving competitive results relative to exhaustive search. NVIDIA Developer Blog

Key takeaways

  • Heuristics deliver a small, targeted set of high-potential GEMM configurations instead of a large brute-force search.
  • Integration into CUTLASS 4.2 enables a streamlined workflow with JSON problem lists, exported CSVs, and optional cutlass_profiler usage.
  • Real-world workloads show substantial time savings while maintaining near-optimal performance, supporting faster deployment in JIT and offline compilation scenarios.
  • Static cluster sizes at compile time can yield additional performance gains over dynamic clustering in some baselines.
  • nvMatmulHeuristics is a core part of cuBLAS heuristics and is available in early access for general use, with CUTLASS integration documented by NVIDIA. NVIDIA Developer Blog

FAQ

  • Q: What does nvMatmulHeuristics do? A: It analyzes GEMM problem parameters and hardware capabilities to predict a small set of high-potential kernel configurations for GEMMs.
  • Q: How is it used with CUTLASS? A: Build CUTLASS with heuristics enabled and provide a problem list file and a per-problem configuration count; the build step outputs a CSV that can be used with profiling tools or cutlass_profiler.
  • Q: What kind of performance gains can be expected? A: In reported results, selecting a small set of configurations achieved near-peak performance with substantially reduced tuning time (e.g., 16 configurations achieving about 96% of peak in ~150 minutes vs. >700 minutes for exhaustive search). Additional workloads showed similar gains and significant build-time reductions. NVIDIA Developer Blog
  • Q: Is nvMatmulHeuristics available publicly? A: It is described as a core part of the cuBLAS heuristics and is available in early access for general use, with CUTLASS integration. NVIDIA Developer Blog

References

More news