Warp 1.9 Released: CUDA 13.0 Support, AOT Modules, and a Fully Warp-based Marching Cubes
Sources: https://github.com/NVIDIA/warp/releases/tag/v1.9.0, https://github.com/NVIDIA/warp/releases/tag/v1.9.0#new_tab, NVIDIA Dev Blog
TL;DR
- Warp 1.9 introduces CUDA 13.0 toolkit compatibility, a fully Warp-based and differentiable Marching Cubes implementation, and fixes a long-standing off-by-one bug.
- New AOT tooling enables flexible ahead-of-time module compilation with wp.compile_aot_module() and wp.load_aot_module(), including strip_hash=True to distribute pre-compiled modules without shipping source.
- The programming model gains more flexible indexing for composite types, direct IntEnum/IntFlag support, in-kernel local arrays via wp.zeros(), and three new indexed tile operations for advanced memory access patterns.
- CUDA graph capture is now fully supported for many solvers, with device-side convergence via wp.capture_while() and options to choose between host-side or device-side termination.
- Ongoing improvements cover warp.sparse, warp.fem, and broader support for tile-based execution, with automatic heuristics and manual overrides through tile_size.
Context and background
Warp continues to evolve as a Python-based GPU programming framework that targets high-performance kernels across CPU and GPU devices. The v1.9 release aligns Warp with CUDA 13.x tooling and expands its ahead-of-time (AOT) capabilities to support distribution of pre-compiled modules. A key highlight is the fully re-implemented Marching Cubes algorithm in Warp, transforming a previously native CUDA C++ implementation into Warp-native code that runs on both CPU and GPU. This change also resolves a longstanding off-by-one issue identified in prior work, underscoring Warp’s ongoing focus on correctness and portability. For developers evaluating CUDA ecosystem shifts, CUDA Toolkit 13.0 was released in early August, and Warp provides compatibility options to ease upgrades and driver constraints. See the release notes for details on these changes.
What’s new
Warp 1.9 ships with a broad set of improvements and new capabilities. The most visible changes include the rewritten Marching Cubes pipeline, compatibility with CUDA 13.0, and a suite of AOT enhancements. In addition to the major feature work, the update delivers more flexible indexing for composite types, direct support for IntEnum, and the ability to allocate local arrays inside kernels using wp.zeros(), with those arrays stored in registers for fast access. The Marching Cubes implementation is fully differentiable and contributed by community developers, enabling it to be executed entirely in Warp on both CPU and GPU devices. The new AOT workflow introduces wp.compile_aot_module() and wp.load_aot_module(), with a strip_hash=True option that removes unique hashes from module and function names to enable distribution of pre-compiled artifacts without shipping Python source. AOT and deployment
- wp.compile_aot_module() and wp.load_aot_module() for flexible ahead-of-time compilation
- strip_hash=True to remove hashes from names for distribution of pre-compiled modules
- Documentation updates detailing AOT workflows and future expansion plans Model and language enhancements
- More flexible indexing for composite types (vectors, matrices, quaternions, transforms)
- Direct IntEnum and IntFlag support inside Warp functions and kernels
- In-kernel views that support dynamic shapes and struct types via the ptr attribute
- wp.zeros() enables fixed-size local arrays allocated in registers
- Three new indexed tile operations for loading, storing, and atomic tile operations with custom index mappings
- Proper support for writing to matrix elements stored inside struct fields Algorithms and performance
- Fully differentiable Marching Cubes implementation rewritten in Warp
- Iterative solvers (CG, BiCGSTAB, GMRES) now fully compatible with CUDA graph capture; device-side convergence with wp.capture_while()
- warp.sparse supports arbitrary-sized blocks and tile-based computations with heuristics to choose tiled vs non-tiled execution
- warp.fem.integrate leverages tile-based quadrature accumulation with automatic tile size selection Stability, testing, and notes
- Early testing on NVIDIA Jetson Thor indicates occasional segmentation faults when CPU kernels are launched; GPU kernels are unaffected. Workarounds suggest building Warp from source against LLVM/Clang v18 or newer
- Deprecated features from prior releases will be removed in v1.10 (early November)
- The release acknowledges community contributions and points readers to the v1.9.0 section of CHANGELOG.md for a full change list Compatibility and transition
- CUDA Toolkit 13.0 compatibility is provided with two options: Warp wheels built for CUDA 12.8 can run on CUDA 13.x drivers due to backward compatibility, or users can target CUDA 13.x builds for Warp
- PyPI distributions of Warp wheels will continue to be built with CUDA 12.8 during a transition period
- The release notes underscore ongoing plans to expand AOT workflows in future updates References and sources
- The official release notes and details are available at the NVIDIA Warp GitHub release page: https://github.com/NVIDIA/warp/releases/tag/v1.9.0#new_tab
Why it matters (impact for developers/enterprises)
For developers, Warp 1.9 lowers the barrier to distributing high-performance kernels by enabling AOT workflows and stripping unique hashes from compiled modules, simplifying packaging and deployment in environments with restricted source access. The CUDA 13.0 compatibility reduces upgrade friction for teams adopting newer GPUs and drivers, while maintaining backward compatibility for Warp wheels built under CUDA 12.8. The rewritten Marching Cubes algorithm in Warp opens opportunities for differentiable rendering and volume visualization pipelines to run more efficiently on both CPU and GPU targets. In addition, the extended indexing capabilities, in-kernel array allocations, and new tile-based memory access patterns give developers finer control over memory behavior and performance characteristics, particularly for sparse matrices and finite element methods. The CUDA graph capture improvements, including device-side convergence with wp.capture_while(), enable more scalable execution of iterative solvers such as CG, BiCGSTAB, and GMRES. These advances help enterprises deploying large- scale scientific computing and machine learning workloads to capture entire workflows as graphs, reducing runtime overhead and improving reproducibility. The enhanced support for warp.sparse and warp.fem.integrate indicates Warp’s continued emphasis on numerical linear algebra and finite element workflows, where adaptive tiling and heuristic-based execution choices can yield meaningful speedups on diverse matrix structures. From a tooling perspective, the AOT enhancements reflect a broader industry shift toward offline compilation and distribution of pre-built kernels, which can simplify software packaging and improve start-up latency in production environments. The combined effects of these features—CUDA 13.0 readiness, AOT workflows, improved kernel-local memory usage, and robust graph-capture compatibility—shape Warp as a flexible platform for researchers and engineers aiming to optimize performance across CPU and GPU deployments.
Technical details or Implementation
| Feature area | Key items | Notes |---|---|---| | Marching Cubes | Rewritten in Warp; fully differentiable; CPU and GPU execution | Replaces previous CUDA C++ implementation; fixes off-by-one bug (#324) |AOT tooling | wp.compile_aot_module(), wp.load_aot_module() | strip_hash=True option for distributing pre-compiled modules |CUDA toolkit | CUDA 13.0 compatibility; wheels built with CUDA 12.8 during transition | Dual-path deployment strategy for drivers |Indexing and types | Flexible indexing for composite types; direct IntEnum/IntFlag support | In-kernel views, dynamic shapes, ptr-based arrays |Local arrays | wp.zeros() in kernels | Allocates in registers for speed |Tile operations | Three new indexed tile operations | Enhanced memory access patterns beyond contiguous tiles |Sparse and FEM | warp.sparse with arbitrary-sized blocks; warp.fem.integrate with tile-based accumulation | Heuristic-based auto-tuning with optional tile_size override |Graph capture | Iterative solvers (CG, BiCGSTAB, GMRES) compatible; device-side convergence via wp.capture_while() | Full CUDA graph capture support for these solvers |Stability notes | Jetson Thor CPU kernel segfaults observed | Resolution via LLVM/Clang v18+; GPU kernels unaffected |Deprecations | Some features slated for v1.10 | Forward-looking roadmap mentioned |
Key takeaways
- Warp 1.9 advances CUDA 13.0 readiness and enables robust AOT module workflows with strip_hash support.
- The Marching Cubes implementation is now a fully Warp-native, differentiable pipeline, portable across CPU and GPU.
- Memory access and kernel programming are more flexible thanks to enhanced indexing, in-kernel arrays, and new tile operations.
- CUDA graph capture is more broadly supported for iterative solvers, with device-side convergence checks available.
- Compatibility and transition plans provide path for users upgrading CUDA drivers while preserving existing wheel builds during the transition.
FAQ
-
What is the central improvement of Warp 1.9?
fully Warp-based, differentiable Marching Cubes implementation, CUDA 13.0 compatibility, and new AOT module tooling to support flexible ahead-of-time workflows.
-
How do I use AOT in Warp 1.9?
Use wp.compile_aot_module() to compile and wp.load_aot_module() to load pre-compiled modules; the new strip_hash=True option enables distributing pre-compiled artifacts without Python source.
-
How does CUDA 13.0 compatibility affect deployment?
Warp wheels built with CUDA 12.8 can run on CUDA 13.x drivers due to backward compatibility, while there is also a path to compile against CUDA 13.x for Warp.
-
What about GPU graph capture and convergence checks?
Iterative solvers in warp.optim.linear are now fully compatible with CUDA graph capture, and device-side convergence can be checked using wp.capture_while().
-
Are there any known issues with this release?
Early testing on NVIDIA Jetson Thor indicated possible segmentation faults when launching CPU kernels; GPU kernel launches remain unaffected. Resolution guidance points to using LLVM/Clang v18+ when building Warp from source.
References
More news
OpenAI reportedly developing smart speaker, glasses, voice recorder, and pin with Jony Ive
OpenAI is reportedly exploring a family of AI devices with Apple's former design chief Jony Ive, including a screen-free smart speaker, smart glasses, a voice recorder, and a wearable pin, with release targeted for late 2026 or early 2027. The Information cites sources with direct knowledge.
How chatbots and their makers are enabling AI psychosis
Explores AI psychosis, teen safety, and legal concerns as chatbots proliferate, based on Kashmir Hill's reporting for The Verge.
Reddit Pushes for Bigger AI Deal with Google: Users and Content in Exchange
Reddit seeks a larger licensing deal with Google, aiming to drive more users and access to Reddit data for AI training, potentially via dynamic pricing and traffic incentives.
NVIDIA RAPIDS 25.08 Adds New Profiler for cuML, Polars GPU Engine Enhancements, and Expanded Algorithm Support
RAPIDS 25.08 introduces a function- and line-level profiler for cuml.accel, a default streaming executor for the Polars GPU engine, expanded datatype and string support, a new Spectral Embedding algorithm in cuML, and zero-code-change accelerations for several estimators.
OpenAI, NVIDIA, and Nscale Launch Stargate UK to Enable Sovereign AI Compute in the UK
OpenAI, NVIDIA, and Nscale announce Stargate UK, a sovereign AI infrastructure partnership delivering local compute power in the UK to support public services, regulated industries, and national AI goals.
OpenAI introduces GPT-5-Codex: faster, more reliable coding assistant with advanced code reviews
OpenAI unveils GPT-5-Codex, a version of GPT-5 optimized for agentic coding in Codex. It accelerates interactive work, handles long tasks, enhances code reviews, and works across terminal, IDE, web, GitHub, and mobile.