From Zero to GPU: Building and Scaling Production-Ready CUDA Kernels
Sources: https://huggingface.co/blog/kernel-builder, Hugging Face Blog
Overview\n\nCustom CUDA kernels can unlock significant performance gains for models, but turning a lab prototype into a production-ready solution involves robust build processes, multi-architecture targets, and reliable deployment. The kernel-builder library from Hugging Face provides a cohesive workflow to develop a kernel locally, build it for multiple architectures, and publish it for wide reuse. This guide walks through building a complete, modern CUDA kernel and addressing real-world production concerns—maintaining speed, efficiency, and maintainability as the project evolves. Once complete, other developers can access your kernel directly from the Hub.\n\nThe example uses PyTorch’s modern C++ API to register a function as a first-class native operator, showing how a clean project structure and tooling enable long-term maintainability. The repository structure and tooling emphasize reproducibility and portability across environments.\n\nThe kernel-builder workflow emphasizes a reproducible, shared environment via Nix, ensuring that builds are deterministic across machines. A flake.nix file locks exact versions of the kernel-builder and dependencies to avoid “it works on my machine” issues. This makes it possible to develop, test, and build in a consistent environment, regardless of the host system.\n\nCode and bindings are organized to support both ease of use and future extensibility. The core GPU code (e.g., csrc/img2gray.cu) is written to a natural 2D grid of threads, which aligns well with image processing workloads. The operator is registered as a PyTorch native operator in TorchScript, exposing a function under the torch.ops namespace. The file torch-ext/torch_binding.cpp uses TORCH_LIBRARY_EXPAND to declare the operator in a way that is extensible for future backends. This architecture is central to the production story: compatibility with torch.compile allows fusion of your custom operator into larger graphs, reducing overhead and boosting end-to-end performance. Hardware-specific implementations — for CUDA or CPU — are supported by dispatch blocks (e.g., TORCH_LIBRARY_IMPL(img2gray, CUDA, …) and TORCH_LIBRARY_IMPL(img2gray, CPU, …)) so the same API works across devices.\n\nA Python wrapper is generated so users can import and call the operator in a familiar Pythonic way. The wrapper module (e.g., _ops) is auto-generated by kernel-builder from a template to provide a stable namespace for registered C++ functions.\n\n## Key features\n- Local kernel development in a reproducible Nix dev shell.\n- Built-in, multi-architecture support (CUDA, CPU backends) with automatic dispatch.\n- Native PyTorch operator registration via TORCH_LIBRARY_EXPAND for seamless graph integration.\n- Hardware-specific backends and portable backends via separate TORCH_LIBRARY_IMPL blocks.\n- Clean project structure with a dedicated Python package exposing the operator under torch.ops.\n- Reproducible builds with flake.nix locking exact versions of kernel-builder and dependencies.\n- Automated, multi-version builds to cover different PyTorch and CUDA combinations.\n- Hub publishing workflow to share kernels with the world (e.g., a repo like drbh/img2gray).\n- Versioning support using semantic version tags (vx.y.z) and version bounds to stabilize downstream usage.\n- A project-level kernels management workflow (pyproject.toml and a kernels CLI) to lock and coordinate kernel versions across a project.\n- Clean separation between build-time artifacts and runtime artifacts to simplify distribution.\n- Guidance on removing development artifacts before publishing to ensure compact releases.\n\n## Common use cases\n- Prototyping and validating high-performance CUDA kernels locally, then extending them to match target production workloads.\n- Building multi-arch kernels (CUDA-enabled GPUs and CPU fallback) that can be selected automatically by PyTorch’s dispatcher.\n- Packaging kernels for distribution via the Hugging Face Hub, enabling easy reuse across teams and projects.\n- Integrating custom kernels with PyTorch graphs using torch.compile to minimize overhead and maximize fusion opportunities.\n- Managing kernel versions at the project level with version pins and semantic versioning to avoid breaking downstream code.\n- Operating in environments with fixed PyTorch/CUDA combinations while still keeping a path to newer versions as needed.\n\n## Setup & installation (exact commands; fenced code with language tags)\n- Prerequisites: you will interact with Nix and the Hugging Face Hub during the workflow.\n- Build the kernel in a local, reproducible environment:\nbash\nnix build . -L\n\n- Enter a development shell with pre-installed dependencies for iterative work and version selection (CUDA & PyTorch):\nbash\nnix develop\n\n- Log in to the Hugging Face Hub to publish your kernel:\nbash\nhuggingface-cli login\n\n- After building and testing locally, push your kernel to a Hub repository (for example, drbh/img2gray) so others can use it. The exact commands will depend on your VCS workflow, but the repository path will be your hub target. The kernel-builder workflow and Hub publishing are designed to minimize friction for downstream users.\n```
\n## Quick start (minimal runnable example)\nThis quick start demonstrates using a registered PyTorch operator loaded from the Hub. The operator is exposed as a native PyTorch operator under the torch.ops namespace. The example converts a synthetic RGB image to grayscale using the registered kernel.\npython\nimport torch\n\n# Create a small dummy RGB image on CUDA\nrgb = torch.randn(1, 3, 64, 64, device='cuda')\n\n# Use the registered kernel via the PyTorch dispatcher\ngray = torch.ops.img2gray(rgb)\n\nprint(gray.shape)\n
\n## Pros and cons\n- Pros:\n - Reproducible builds and dependency locking reduce “works on my machine” issues.\n - Multi-arch support ensures your kernel runs across CUDA GPUs and CPUs.\n - PyTorch-native registration enables seamless graph fusion and runtime integration.\n - Hub-based distribution makes sharing and versioning straightforward.\n - Semantic versioning and version bounds help maintain backward compatibility.\n- Cons:\n - Initial setup can be complex, especially around multi-arch configuration and hub publishing.\n - Build times may be longer when generating many variants for different PyTorch/CUDA combinations.\n - Requires familiarity with Nix, PyTorch C++ extensions, and Hub workflows.\n\n## Alternatives (brief comparisons)\n| Approach
| What it provides
| Trade-offs
|\n|---|---|---|\n| Manual CUDA kernel development + custom build scripts | Full control over builds, no hub dependency | Higher maintenance, bespoke versioning, portability challenges |\n| Kernel-builder + Hub distribution (recommended) | Reproducible builds, multi-arch, easy sharing |
Requires hub adoption and version management |\n\n## Pricing or License\nPricing or licensing information is not explicitly provided in the source material.\n\n## References\n- https://huggingface.co/blog/kernel-builder\n- https://huggingface.co/drbh/img2gray\n
More resources
CUDA Toolkit 13.0 for Jetson Thor: Unified Arm Ecosystem and More
Unified CUDA toolkit for Arm on Jetson Thor with full memory coherence, multi-process GPU sharing, OpenRM/dmabuf interoperability, NUMA support, and better tooling across embedded and server-class targets.
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap
Leverage GPU memory swap (model hot-swapping) to share GPUs across multiple LLMs, reduce idle GPU costs, and improve autoscaling while meeting SLAs.
Improving GEMM Kernel Auto-Tuning Efficiency with nvMatmulHeuristics in CUTLASS 4.2
Introduces nvMatmulHeuristics to quickly select a small set of high-potential GEMM kernel configurations for CUTLASS 4.2, drastically reducing auto-tuning time while approaching exhaustive-search performance.
Make ZeroGPU Spaces faster with PyTorch ahead-of-time (AoT) compilation
Learn how PyTorch AoT compilation speeds up ZeroGPU Spaces by exporting a compiled model once and reloading instantly, with FP8 quantization, dynamic shapes, and careful integration with the Spaces GPU workflow.
Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
Guide to fine-tuning gpt-oss with SFT + QAT to recover FP4 accuracy while preserving efficiency, including upcasting to BF16, MXFP4, NVFP4, and deployment with TensorRT-LLM.
How Small Language Models Are Key to Scalable Agentic AI
Explores how small language models enable cost-effective, flexible agentic AI alongside LLMs, with NVIDIA NeMo and Nemotron Nano 2.