Improving GEMM Kernel Auto-Tuning Efficiency on NVIDIA GPUs with Heuristics and CUTLASS 4.2
Sources: https://developer.nvidia.com/blog/improving-gemm-kernel-auto-tuning-efficiency-on-nvidia-gpus-with-heuristics-and-cutlass-4-2, https://developer.nvidia.com/blog/improving-gemm-kernel-auto-tuning-efficiency-on-nvidia-gpus-with-heuristics-and-cutlass-4-2/, NVIDIA Dev Blog
TL;DR
- nvMatmulHeuristics provides fast heuristics for GEMMs to predict a small set of high-potential kernel configurations.
- Integrated with CUTLASS 4.2, it dramatically shortens end-to-end kernel tuning from exhaustive searches to targeted candidate evaluation.
- Early results show substantial time savings: using 16 configurations achieves about 96% of peak performance in roughly 150 minutes, vs. over 700 minutes for exhaustive search (LLama 3 405B on H100 SXM).
- For DeepSeek-R1 671B on a B200 GPU, 8 configurations reach about 99% of exhaustive-search performance with more than 5x speedup in build and auto-tuning time. This enables practical JIT-like workflows in DL frameworks.
- The approach emphasizes statically known cluster sizes at compile time, enabling more efficient kernels compared to dynamic cluster sizing in some precompiled kernels. These findings illustrate how a good heuristic can push performance toward the best possible kernel while dramatically reducing tuning costs. NVIDIA Developer Blog
Context and background
Selecting the best General Matrix Multiplication (GEMM) kernel for a given problem and hardware is a complex optimization problem. GEMM performance depends on a wide array of compile-time and runtime meta-parameters, including CTA, warp and instruction-level tile sizes, kernel schedules, rasterization strategies, cluster dimensions, and split-k factors. Historically, finding the optimal kernel involved generating thousands of candidate configurations, compiling them, and running exhaustive auto-tuning. This brute-force workflow can take many hours and presents a barrier to adoption in offline compiled libraries like CUTLASS, as well as in JIT-compiled ecosystems such as Torch Inductor or OpenAI Triton where fast model compilation is critical. The friction often leads users to pick suboptimal kernels to avoid long tuning times. The NVIDIA blog introduces nvMatmulHeuristics, a GPU kernel meta-parameter optimization module, designed to deliver a small, high-potential set of kernel configurations for a given GEMM problem and hardware. The module analyzes the operation’s parameters and the target hardware capabilities and outputs a concise list of configurations with near-peak potential. The integration into CUTLASS 4.2 aims to transform the kernel generation and tuning workflow into a faster, more predictable process. This feature is described as a core part of the cuBLAS heuristics and is available in early access for general use, alongside an integration into CUTLASS. NVIDIA Developer Blog The workflow shifts away from brute force toward targeted exploration. Users prepare a list of GEMM problems in JSON format, build CUTLASS with specific CUDA cache- and heuristics-related options, and then run auto-tuning on a small set of configurations emitted per problem. A generated CSV test list can drive automated profiling or be consumed by cutlass_profiler to reproduce results. The approach is designed to deliver consistent profiling results by using locked clocks and to facilitate integration into existing workflows and libraries. The feature is part of the CUTLASS ecosystem and is compatible with FP16 I/O and FP32 compute (HSH) examples described in the documentation. NVIDIA Developer Blog
What’s new
nvMatmulHeuristics introduces a practical, heuristic-driven workflow for GEMM kernel selection within CUTLASS. Rather than enumerating thousands of kernels, users can query a curated set of top-N configurations predicted by the heuristic model. The workflow for CUTLASS users typically follows these steps: prepare a JSON file listing GEMM problems, build CUTLASS with the library heuristics enabled, and specify -DCUTLASS_LIBRARY_HEURISTICS_PROBLEMS_FILE= and -DCUTLASS_LIBRARY_HEURISTICS_CONFIGS_PER_PROBLEM=N, where N is the number of configurations nvMatmulHeuristics will emit for each GEMM. The build step produces a CSV test list that enumerates all test cases required to run auto-tuning over the emitted configurations. This CSV can be consumed by custom benchmarking code or by cutlass_profiler to run the configurations out of the box. For consistent results, clocks should be locked during profiling. The feature is presented as part of CUTLASS 4.2 with early access to nvMatmulHeuristics and an integration into CUTLASS. NVIDIA Developer Blog
An FP16 example and practical performance
For an FP16 input/output with FP32 compute (HSH) scenario, nvMatmulHeuristics can emit eight configurations. These eight kernels can be compiled and benchmarked, and the best among them can be chosen for deployment, often delivering performance very close to the best possible kernel discovered by exhaustive search. In the published results, eight configurations achieved 104% of the baseline when tested on the same GEMMs with static cluster sizes, illustrating how focusing the build and profiling effort on a small, curated set can reach peak-like performance with lower cost than exhaustive tuning. NVIDIA Developer Blog
Empirical results and takeaways
Two representative workloads demonstrated the impact of nvMatmulHeuristics. First, a Llama 3 405B training workload on an NVIDIA H100 SXM GPU showed that exhaustive search could require more than 700 minutes to identify the optimal kernel, while a search restricted to 16 candidate kernels reached 96% of the peak performance in roughly 150 minutes. This highlights a dramatic reduction in total tuning time with only a modest drop from peak performance. A second workload, a DeepSeek-R1 671B training workload on an NVIDIA B200 GPU, showed that using a small candidate set can reach 99% of the exhaustive-search performance with more than a 5x speedup in build and auto-tuning time. In this case, the baseline used dynamic cluster sizes, whereas the recommended nvMatmulHeuristics configurations were built with known static cluster sizes at compile time, resulting in 104% of the baseline performance for these GEMMs. These results illustrate how heuristics can push toward peak performance while dramatically reducing tuning costs and enabling practical deployment in framework-driven or offline compilation scenarios. NVIDIA Developer Blog
Workflow implications for developers and enterprises
By focusing the tuning process on a small, high-potential set of kernels, nvMatmulHeuristics reduces the end-to-end time to deploy high-performance GEMMs in production systems. This makes it feasible to incorporate highly-optimized kernels into JIT-compiled stacks and offline libraries, accelerating development cycles for deep learning frameworks, compilers, and kernel libraries. It also enables more predictable performance outcomes across hardware generations by leveraging hardware-aware heuristics embedded in the workflow. The integration into CUTLASS 4.2 further streamlines adoption for developers already using CUTLASS and cuBLAS heuristics as part of their stack. NVIDIA Developer Blog
Technical details or Implementation
The nvMatmulHeuristics workflow relies on a two-layer process: a predictive heuristics module that analyzes the GEMM problem parameters and hardware capabilities, and a CUTLASS integration that uses the predicted top-N configurations to build and benchmark a small, focused kernel set. Key steps include:
- Prepare a GEMM problem list in JSON format, describing the transposition/layout of A, B, and C/D as needed (for example, FP16 GEMMs with tnn notation).
- Build CUTLASS with the heuristics integration enabled and specify -DCUTLASS_LIBRARY_HEURISTICS_PROBLEMS_FILE and -DCUTLASS_LIBRARY_HEURISTICS_CONFIGS_PER_PROBLEM to control how many configurations are emitted per problem.
- The build step outputs a CSV test list enumerating all test cases to run; you can use your own benchmarking code to perform auto-tuning or use cutlass_profiler to run the configurations out of the box.
- For consistent profiling results, run with locked clocks. The documentation provides guidance on usage and integration. The eight-configuration example for FP16/FP32 (HSH) demonstrates the practical effect: compiling and testing these eight kernels enables selecting the best performing kernel without the overhead of a full exhaustive search, while still achieving near-peak performance. In comparative terms, this approach substantially reduces the end-to-end time to reach high-performance kernels while preserving competitive results relative to exhaustive search. NVIDIA Developer Blog
Key takeaways
- Heuristics deliver a small, targeted set of high-potential GEMM configurations instead of a large brute-force search.
- Integration into CUTLASS 4.2 enables a streamlined workflow with JSON problem lists, exported CSVs, and optional cutlass_profiler usage.
- Real-world workloads show substantial time savings while maintaining near-optimal performance, supporting faster deployment in JIT and offline compilation scenarios.
- Static cluster sizes at compile time can yield additional performance gains over dynamic clustering in some baselines.
- nvMatmulHeuristics is a core part of cuBLAS heuristics and is available in early access for general use, with CUTLASS integration documented by NVIDIA. NVIDIA Developer Blog
FAQ
- Q: What does nvMatmulHeuristics do? A: It analyzes GEMM problem parameters and hardware capabilities to predict a small set of high-potential kernel configurations for GEMMs.
- Q: How is it used with CUTLASS? A: Build CUTLASS with heuristics enabled and provide a problem list file and a per-problem configuration count; the build step outputs a CSV that can be used with profiling tools or cutlass_profiler.
- Q: What kind of performance gains can be expected? A: In reported results, selecting a small set of configurations achieved near-peak performance with substantially reduced tuning time (e.g., 16 configurations achieving about 96% of peak in ~150 minutes vs. >700 minutes for exhaustive search). Additional workloads showed similar gains and significant build-time reductions. NVIDIA Developer Blog
- Q: Is nvMatmulHeuristics available publicly? A: It is described as a core part of the cuBLAS heuristics and is available in early access for general use, with CUTLASS integration. NVIDIA Developer Blog
References
- https://developer.nvidia.com/blog/improving-gemm-kernel-auto-tuning-efficiency-on-nvidia-gpus-with-heuristics-and-cutlass-4-2/
- The NVIDIA blog post cited above serves as the primary source for nvMatmulHeuristics, its integration with CUTLASS 4.2, and the highlighted results. NVIDIA Developer Blog
More news
NVIDIA HGX B200 Reduces Embodied Carbon Emissions Intensity
NVIDIA HGX B200 lowers embodied carbon intensity by 24% vs. HGX H100, while delivering higher AI performance and energy efficiency. This article reviews the PCF-backed improvements, new hardware features, and implications for developers and enterprises.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo
NVIDIA Dynamo offloads KV Cache from GPU memory to cost-efficient storage, enabling longer context windows, higher concurrency, and lower inference costs for large-scale LLMs and generative AI workloads.
Kaggle Grandmasters Playbook: 7 Battle-Tested Techniques for Tabular Data Modeling
A detailed look at seven battle-tested techniques used by Kaggle Grandmasters to solve large tabular datasets fast with GPU acceleration, from diversified baselines to advanced ensembling and pseudo-labeling.
Microsoft to turn Foxconn site into Fairwater AI data center, touted as world's most powerful
Microsoft unveils plans for a 1.2 million-square-foot Fairwater AI data center in Wisconsin, housing hundreds of thousands of Nvidia GB200 GPUs. The project promises unprecedented AI training power with a closed-loop cooling system and a cost of $3.3 billion.
Speculative Decoding to Reduce Latency in AI Inference: EAGLE-3, MTP, and Draft-Target Approaches
A detailed look at speculative decoding for AI inference, including draft-target and EAGLE-3 methods, how they reduce latency, and how to deploy on NVIDIA GPUs with TensorRT.