Warp 1.9 Released: CUDA 13.0 Support, AOT Modules, and a Fully Warp-based Marching Cubes

TL;DR

Warp 1.9 introduces CUDA 13.0 toolkit compatibility, a fully Warp-based and differentiable Marching Cubes implementation, and fixes a long-standing off-by-one bug.
New AOT tooling enables flexible ahead-of-time module compilation with wp.compile_aot_module() and wp.load_aot_module(), including strip_hash=True to distribute pre-compiled modules without shipping source.
The programming model gains more flexible indexing for composite types, direct IntEnum/IntFlag support, in-kernel local arrays via wp.zeros(), and three new indexed tile operations for advanced memory access patterns.
CUDA graph capture is now fully supported for many solvers, with device-side convergence via wp.capture_while() and options to choose between host-side or device-side termination.
Ongoing improvements cover warp.sparse, warp.fem, and broader support for tile-based execution, with automatic heuristics and manual overrides through tile_size.

Context and background

Warp continues to evolve as a Python-based GPU programming framework that targets high-performance kernels across CPU and GPU devices. The v1.9 release aligns Warp with CUDA 13.x tooling and expands its ahead-of-time (AOT) capabilities to support distribution of pre-compiled modules. A key highlight is the fully re-implemented Marching Cubes algorithm in Warp, transforming a previously native CUDA C++ implementation into Warp-native code that runs on both CPU and GPU. This change also resolves a longstanding off-by-one issue identified in prior work, underscoring Warp’s ongoing focus on correctness and portability. For developers evaluating CUDA ecosystem shifts, CUDA Toolkit 13.0 was released in early August, and Warp provides compatibility options to ease upgrades and driver constraints. See the release notes for details on these changes.

What’s new

Warp 1.9 ships with a broad set of improvements and new capabilities. The most visible changes include the rewritten Marching Cubes pipeline, compatibility with CUDA 13.0, and a suite of AOT enhancements. In addition to the major feature work, the update delivers more flexible indexing for composite types, direct support for IntEnum, and the ability to allocate local arrays inside kernels using wp.zeros(), with those arrays stored in registers for fast access. The Marching Cubes implementation is fully differentiable and contributed by community developers, enabling it to be executed entirely in Warp on both CPU and GPU devices. The new AOT workflow introduces wp.compile_aot_module() and wp.load_aot_module(), with a strip_hash=True option that removes unique hashes from module and function names to enable distribution of pre-compiled artifacts without shipping Python source. AOT and deployment

wp.compile_aot_module() and wp.load_aot_module() for flexible ahead-of-time compilation
strip_hash=True to remove hashes from names for distribution of pre-compiled modules
Documentation updates detailing AOT workflows and future expansion plans Model and language enhancements
More flexible indexing for composite types (vectors, matrices, quaternions, transforms)
Direct IntEnum and IntFlag support inside Warp functions and kernels
In-kernel views that support dynamic shapes and struct types via the ptr attribute
wp.zeros() enables fixed-size local arrays allocated in registers
Three new indexed tile operations for loading, storing, and atomic tile operations with custom index mappings
Proper support for writing to matrix elements stored inside struct fields Algorithms and performance
Fully differentiable Marching Cubes implementation rewritten in Warp
Iterative solvers (CG, BiCGSTAB, GMRES) now fully compatible with CUDA graph capture; device-side convergence with wp.capture_while()
warp.sparse supports arbitrary-sized blocks and tile-based computations with heuristics to choose tiled vs non-tiled execution
warp.fem.integrate leverages tile-based quadrature accumulation with automatic tile size selection Stability, testing, and notes
Early testing on NVIDIA Jetson Thor indicates occasional segmentation faults when CPU kernels are launched; GPU kernels are unaffected. Workarounds suggest building Warp from source against LLVM/Clang v18 or newer
Deprecated features from prior releases will be removed in v1.10 (early November)
The release acknowledges community contributions and points readers to the v1.9.0 section of CHANGELOG.md for a full change list Compatibility and transition
CUDA Toolkit 13.0 compatibility is provided with two options: Warp wheels built for CUDA 12.8 can run on CUDA 13.x drivers due to backward compatibility, or users can target CUDA 13.x builds for Warp
PyPI distributions of Warp wheels will continue to be built with CUDA 12.8 during a transition period
The release notes underscore ongoing plans to expand AOT workflows in future updates References and sources
The official release notes and details are available at the NVIDIA Warp GitHub release page: https://github.com/NVIDIA/warp/releases/tag/v1.9.0#new_tab

Why it matters (impact for developers/enterprises)

For developers, Warp 1.9 lowers the barrier to distributing high-performance kernels by enabling AOT workflows and stripping unique hashes from compiled modules, simplifying packaging and deployment in environments with restricted source access. The CUDA 13.0 compatibility reduces upgrade friction for teams adopting newer GPUs and drivers, while maintaining backward compatibility for Warp wheels built under CUDA 12.8. The rewritten Marching Cubes algorithm in Warp opens opportunities for differentiable rendering and volume visualization pipelines to run more efficiently on both CPU and GPU targets. In addition, the extended indexing capabilities, in-kernel array allocations, and new tile-based memory access patterns give developers finer control over memory behavior and performance characteristics, particularly for sparse matrices and finite element methods. The CUDA graph capture improvements, including device-side convergence with wp.capture_while(), enable more scalable execution of iterative solvers such as CG, BiCGSTAB, and GMRES. These advances help enterprises deploying large- scale scientific computing and machine learning workloads to capture entire workflows as graphs, reducing runtime overhead and improving reproducibility. The enhanced support for warp.sparse and warp.fem.integrate indicates Warp’s continued emphasis on numerical linear algebra and finite element workflows, where adaptive tiling and heuristic-based execution choices can yield meaningful speedups on diverse matrix structures. From a tooling perspective, the AOT enhancements reflect a broader industry shift toward offline compilation and distribution of pre-built kernels, which can simplify software packaging and improve start-up latency in production environments. The combined effects of these features—CUDA 13.0 readiness, AOT workflows, improved kernel-local memory usage, and robust graph-capture compatibility—shape Warp as a flexible platform for researchers and engineers aiming to optimize performance across CPU and GPU deployments.

Technical details or Implementation

| Feature area | Key items | Notes |---|---|---| | Marching Cubes | Rewritten in Warp; fully differentiable; CPU and GPU execution | Replaces previous CUDA C++ implementation; fixes off-by-one bug (#324) |AOT tooling | wp.compile_aot_module(), wp.load_aot_module() | strip_hash=True option for distributing pre-compiled modules |CUDA toolkit | CUDA 13.0 compatibility; wheels built with CUDA 12.8 during transition | Dual-path deployment strategy for drivers |Indexing and types | Flexible indexing for composite types; direct IntEnum/IntFlag support | In-kernel views, dynamic shapes, ptr-based arrays |Local arrays | wp.zeros() in kernels | Allocates in registers for speed |Tile operations | Three new indexed tile operations | Enhanced memory access patterns beyond contiguous tiles |Sparse and FEM | warp.sparse with arbitrary-sized blocks; warp.fem.integrate with tile-based accumulation | Heuristic-based auto-tuning with optional tile_size override |Graph capture | Iterative solvers (CG, BiCGSTAB, GMRES) compatible; device-side convergence via wp.capture_while() | Full CUDA graph capture support for these solvers |Stability notes | Jetson Thor CPU kernel segfaults observed | Resolution via LLVM/Clang v18+; GPU kernels unaffected |Deprecations | Some features slated for v1.10 | Forward-looking roadmap mentioned |

Key takeaways

Warp 1.9 advances CUDA 13.0 readiness and enables robust AOT module workflows with strip_hash support.
The Marching Cubes implementation is now a fully Warp-native, differentiable pipeline, portable across CPU and GPU.
Memory access and kernel programming are more flexible thanks to enhanced indexing, in-kernel arrays, and new tile operations.
CUDA graph capture is more broadly supported for iterative solvers, with device-side convergence checks available.
Compatibility and transition plans provide path for users upgrading CUDA drivers while preserving existing wheel builds during the transition.

FAQ

What is the central improvement of Warp 1.9?

fully Warp-based, differentiable Marching Cubes implementation, CUDA 13.0 compatibility, and new AOT module tooling to support flexible ahead-of-time workflows.
How do I use AOT in Warp 1.9?

Use wp.compile_aot_module() to compile and wp.load_aot_module() to load pre-compiled modules; the new strip_hash=True option enables distributing pre-compiled artifacts without Python source.
How does CUDA 13.0 compatibility affect deployment?

Warp wheels built with CUDA 12.8 can run on CUDA 13.x drivers due to backward compatibility, while there is also a path to compile against CUDA 13.x for Warp.
What about GPU graph capture and convergence checks?

Iterative solvers in warp.optim.linear are now fully compatible with CUDA graph capture, and device-side convergence can be checked using wp.capture_while().
Are there any known issues with this release?

Early testing on NVIDIA Jetson Thor indicated possible segmentation faults when launching CPU kernels; GPU kernel launches remain unaffected. Resolution guidance points to using LLVM/Clang v18+ when building Warp from source.

References

https://github.com/NVIDIA/warp/releases/tag/v1.9.0#new_tab

Warp 1.9 Released: CUDA 13.0 Support, AOT Modules, and a Fully Warp-based Marching Cubes

TL;DR

Context and background

What’s new

Why it matters (impact for developers/enterprises)

Technical details or Implementation

Key takeaways

FAQ

References

More news

OpenAI reportedly developing smart speaker, glasses, voice recorder, and pin with Jony Ive

How chatbots and their makers are enabling AI psychosis

Reddit Pushes for Bigger AI Deal with Google: Users and Content in Exchange

NVIDIA RAPIDS 25.08 Adds New Profiler for cuML, Polars GPU Engine Enhancements, and Expanded Algorithm Support

OpenAI, NVIDIA, and Nscale Launch Stargate UK to Enable Sovereign AI Compute in the UK

OpenAI introduces GPT-5-Codex: faster, more reliable coding assistant with advanced code reviews