From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels
Sources: https://huggingface.co/blog/kernel-builder, Hugging Face Blog
Overview
Custom CUDA kernels can unlock meaningful performance gains for modern models, but turning a local kernel into a robust, production-ready component is non-trivial. The kernel-builder library from Hugging Face provides a workflow to develop a kernel locally, and then build it for multiple architectures and publish it for broad usage. The guide walks through building a complete, modern CUDA kernel from scratch and addressing production and deployment concerns with engineering practices aimed at speed, efficiency, and maintainability. A typical workflow demonstrated is a practical RGB-to-grayscale kernel built with PyTorch’s modern C++ API, registered as a native PyTorch operator. The operator is exposed under the torch.ops namespace, enabling seamless integration with PyTorch graph execution and tooling such as torch.compile. A key architectural goal is to enable hardware-specific backends so the same operator can dispatch to CUDA or CPU implementations as appropriate, depending on the input tensor device. The kernel-builder approach emphasizes reproducibility and collaboration. It uses a flake.nix file to lock exact versions of the kernel-builder and dependencies, avoiding “it works on my machine” issues. The end-to-end story starts with a minimal kernel and ends with a compliant, multi-version artifact set ready for distribution through Hugging Face Hub. In production terms, the guide ties together kernel development, PyTorch bindings, and deployment considerations. It shows how to package kernels so other developers can load them directly from their Hub repositories, bypassing traditional install steps and enabling straightforward versioning and governance.
Key features
- Local development of a CUDA kernel with a clean, predictable project structure.
- Ability to build for multiple architectures and PyTorch/CUDA versions from a single source.
- Reproducible builds via a flake.nix configuration that locks exact dependency versions.
- Native PyTorch operator registration so kernels appear as first-class operators under torch.ops.
- Support for hardware-specific backends (CUDA and CPU) through TORCH_LIBRARY_IMPL blocks.
- Automated Python bindings with a generated _ops module, exposing a stable namespace for registered functions.
- Iterative development workflow using nix develop for pre-installed dependencies and devShells.
- End-to-end packaging for Hub distribution, including build variant automation and versioning.
- Versioning and compatibility controls to minimize breaking changes for downstream users.
- Guidance on cleaning development artifacts and preparing the final artifacts for release.
Common use cases
- Developing domain-specific CUDA kernels that align with PyTorch’s execution model and optimization strategies.
- Building multi-arch kernels that automatically dispatch to CUDA or CPU implementations depending on tensor device.
- Iterating locally with a reproducible environment, then publishing to the Hugging Face Hub for easy downstream usage.
- Creating production-grade kernels that can be pinned by version (via git tags like v1.2.3) to minimize breaking changes for dependents.
- Integrating new kernels into PyTorch graphs with compatibility with torch.compile for graph fusion and reduced overhead.
Setup & installation (exact commands)
nix build . -L
This builds the kernel and produces a set of build artifacts, including CMakeLists.txt, pyproject.toml, setup.py, and a cmake directory.
nix develop
Enter a development shell with all necessary dependencies pre-installed. The devShell lets you select exact CUDA and PyTorch versions for iterative development. Note: The article mentions an example using PyTorch 2.7 with CUDA 12.6 for building the kernel, illustrating how the devShell can be configured to target specific versions.
## Quick start (minimal runnable example)
The guide demonstrates building a practical kernel (RGB to grayscale) and registering it as a native PyTorch operator so it shows up under the torch.ops namespace. A minimal runnable example follows the general pattern of loading the kernel from its Hub repository and invoking the operator on a CUDA tensor. A typical workflow might look like this:
```python
import torch
# The kernel is loaded from its Hub repository and registers itself as a PyTorch operator.
# The operator is available under torch.ops and can be called like a regular PyTorch function.
# Prepare a small input image tensor on CUDA
img = torch.randn(1, 3, 224, 224, device='cuda')
# Call the registered operator (the exact path may vary by repo/config)
gray = torch.ops.img2gray(img)
print(gray.shape)
If you need to verify the operator is registered, you can inspect torch.ops to see the new entry under the expected namespace (e.g., img2gray) and perform a quick forward pass on a small tensor.
Pros and cons
- Pros
- Reproducible builds with flake.nix lock files, reducing environment drift.
- Native PyTorch operator integration, enabling fusion opportunities with torch.compile and graph execution.
- Hardware-agnostic API that can dispatch to CUDA or CPU backends depending on input tensors.
- Hub-based distribution simplifies sharing and versioning across teams.
- Clear path to multi-version support and compliant kernels across PyTorch and CUDA releases.
- Cons
- The article does not enumerate formal cons; adoption requires familiarity with Nix-based workflows and kernel-build tooling.
Alternatives (brief comparisons)
- Ad-hoc CUDA kernel development without kernel-builder:
- Pros: Potentially faster to prototype for small experiments.
- Cons: Lacks reproducible builds, multi-arch automation, and hub-based distribution.
- Custom packaging without the hub workflow:
- Pros: Local control over distribution.
- Cons: No centralized versioning or easy cross-team sharing; more manual maintenance.
- Other PyTorch extension approaches (e.g., traditional TorchScript/CUDA extension patterns):
- Pros: Well-established ecosystem and tooling.
- Cons: May not provide the same level of automated multi-arch builds and Hub distribution as described.
Pricing or License
- License/pricing details are not explicitly provided in the article.
References
More resources
CUDA Toolkit 13.0 for Jetson Thor: Unified Arm Ecosystem and More
Unified CUDA toolkit for Arm on Jetson Thor with full memory coherence, multi-process GPU sharing, OpenRM/dmabuf interoperability, NUMA support, and better tooling across embedded and server-class targets.
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap
Leverage GPU memory swap (model hot-swapping) to share GPUs across multiple LLMs, reduce idle GPU costs, and improve autoscaling while meeting SLAs.
Improving GEMM Kernel Auto-Tuning Efficiency with nvMatmulHeuristics in CUTLASS 4.2
Introduces nvMatmulHeuristics to quickly select a small set of high-potential GEMM kernel configurations for CUTLASS 4.2, drastically reducing auto-tuning time while approaching exhaustive-search performance.
Make ZeroGPU Spaces faster with PyTorch ahead-of-time (AoT) compilation
Learn how PyTorch AoT compilation speeds up ZeroGPU Spaces by exporting a compiled model once and reloading instantly, with FP8 quantization, dynamic shapes, and careful integration with the Spaces GPU workflow.
Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
Guide to fine-tuning gpt-oss with SFT + QAT to recover FP4 accuracy while preserving efficiency, including upcasting to BF16, MXFP4, NVFP4, and deployment with TensorRT-LLM.
How Small Language Models Are Key to Scalable Agentic AI
Explores how small language models enable cost-effective, flexible agentic AI alongside LLMs, with NVIDIA NeMo and Nemotron Nano 2.