NVIDIA RAPIDS 25.08 Adds New Profiler for cuML, Polars GPU Engine Enhancements, and Expanded Algorithm Support
Sources: https://developer.nvidia.com/blog/nvidia-rapids-25-08-adds-new-profiler-for-cuml-updates-to-the-polars-gpu-engine-additional-algorithm-support-and-more, https://developer.nvidia.com/blog/nvidia-rapids-25-08-adds-new-profiler-for-cuml-updates-to-the-polars-gpu-engine-additional-algorithm-support-and-more/, NVIDIA Dev Blog
TL;DR
- The 25.08 RAPIDS release adds two new profiling options for cuml.accel: a function-level profiler and a line-level profiler, with notebook and CLI usage options for profiling GPU vs CPU execution and per-function or per-line timings.
- The Polars GPU engine streaming execution mode is now the default, enabling processing of datasets larger than GPU memory via data partitioning, with in-memory fallback for unsupported operations.
- The Polars GPU engine gains struct data support in columns, expanded string operators, and broader datatype support to accelerate common end-user workflows.
- cuML adds a new Spectral Embedding algorithm for dimensionality reduction and manifold learning, and cuml.accel gains LinearSVC, LinearSVR, and KernelRidge with zero-code-change accelerations; all estimators in the SVM family are now supported.
- CUDA 11 support has been dropped as of 25.08; users needing CUDA 11 can pin RAPIDS to version 25.06.
Context and background
RAPIDS continues its mission to accelerate and scale data science workloads by expanding GPU-accelerated capabilities and reducing friction for developers. The 25.08 release builds on a trajectory that began with profiling enhancements for cuDF/Pandas and extends it to cuML, providing deeper visibility into which operations run on the GPU versus the CPU and how long they take. The Polars GPU engine has evolved since its early experimental streaming mode introduced in 25.06, and 25.08 marks the point at which streaming execution becomes the default, reflecting its maturation and the need to handle datasets larger than VRAM through partitioned execution. This release also broadens the set of supported datatypes and operators in Polars’ GPU engine, notably adding struct-column support and expanding string operators. Together with the new cuML algorithm, these changes target both performance and end-user productivity by reducing the need to move data back to CPU or to rewrite code for GPU acceleration. Spectral Embedding was added to cuML, aligning its API with the scikit-learn implementation, while cuml.accel now accelerates several algorithms with zero code changes, including LinearSVC, LinearSVR, and KernelRidge. With these improvements, developers can build end-to-end GPU-accelerated ML pipelines with broader algorithm coverage and fewer integration steps. Finally, NVIDIA notes the deprecation of CUDA 11 with the 25.08 release, affecting containers, packages, and build-from-source workflows. Users who require CUDA 11 can pin to RAPIDS 25.06 and continue operating in that environment.
What’s new
- New cuml.accel profilers: function-level and line-level profilers for PyNotebook and CLI usage.
- Polars GPU engine streaming executor: default mode, enabling large-scale data processing via partitioning, with in-memory fallback for unsupported operations.
- Expanded Polars datatype and string support: struct column operations accelerated on GPU; broader suite of string operators.
- cuML Spectral Embedding: new dimensionality-reduction algorithm with an API matching scikit-learn’s SpectralEmbedding.
- Zero-code-change accelerations for additional algorithms: LinearSVC, LinearSVR, and KernelRidge added to cuml.accel; all SVM estimators now supported.
- CUDA 11 deprecation: drop of CUDA 11 support; users can pin to 25.06 to continue using CUDA 11.
For a deeper dive, see the RAPIDS documentation linked in References.
Why it matters (impact for developers/enterprises)
Profiling visibility directly within cuml.accel helps data scientists and engineers locate performance bottlenecks in their ML workflows, speeding optimization cycles without leaving their preferred environment. By making the Polars GPU engine streaming mode the default, users gain the ability to handle datasets that exceed GPU memory, unlocking more scalable GPU-accelerated data processing with substantial speedups as data grows. The expanded datatype support and new string operators reduce the need for CPU-backed fallbacks, delivering more end-to-end GPU execution for common workflows. The Spectral Embedding addition broadens cuML’s capability in dimensionality reduction and manifold learning, enabling more diverse ML pipelines entirely on the GPU. The zero-code-change accelerations for LinearSVC, LinearSVR, and KernelRidge mean teams can upgrade to RAPIDS 25.08 and immediately benefit from GPU-accelerated performance without rewriting estimators or ML code. Finally, the CUDA 11 deprecation represents a necessary evolution for RAPIDS users, aligning with newer CUDA toolchains and hardware capabilities. Organizations planning long-term GPU acceleration should plan to operate on RAPIDS versions aligned with CUDA 12+ to maintain compatibility and receive ongoing optimizations.
Technical details or Implementation
- Profilers in cuml.accel:
- Function-level profiler: reports all GPU- vs CPU-executed operations within a script or notebook cell, including per-function timing.
- Line-level profiler: reports execution time by code line, enabling fine-grained performance debugging.
- Usage: in notebooks, run %%cuml.accel.profile after cuml.accel is loaded to profile a cell; in scripts, use the —profile flag on the CLI. The line profiler follows a similar pattern with %%cuml.accel.line_profile and the —line-profile CLI flag.
- Polars GPU engine streaming executor:
- Default mode as of 25.08, leveraging data partitioning to process datasets larger than VRAM.
- Can fall back to in-memory execution for unsupported operations.
- As of 25.08, streaming supports nearly all operators available for in-memory GPU execution, with smaller datasets incurring only minor overhead when using streaming on a single GPU.
- Polars datatype and operators:
- Struct data in columns is now GPU-accelerated; previous behavior fell back to CPU for struct-based operations.
- Expanded string operators broaden the GPU-accelerated feature set and improve performance for string-heavy workloads.
- cuML advancements:
- Spectral Embedding added, providing a GPU-accelerated option for spectral methods; API aligns with scikit-learn’s SpectralEmbedding.
- cuml.accel now accelerates LinearSVC, LinearSVR, and KernelRidge with zero code changes; when combined, this means all estimators in the SVM family are supported within cuml.accel.
- API compatibility and deprecations:
- CUDA 11 support has been dropped in 25.08; for CUDA 11, users should pin to RAPIDS 25.06.
- Documentation and API references are updated to reflect the new capabilities and the API parity with scikit-learn in Spectral Embedding.
Key takeaways
- New profilers provide both function-level and line-level insights for cuml.accel workloads.
- Streaming execution in the Polars GPU engine enables scalable processing beyond GPU memory.
- Struct data and broader string operator support enhance Polars GPU performance.
- cuML expands with Spectral Embedding and zero-code-change accelerations for LinearSVC, LinearSVR, and KernelRidge.
- CUDA 11 support is dropped in this release; plan upgrades accordingly.
FAQ
-
What are the new profiling options for cuml.accel?
There are function-level and line-level profilers. The function-level profiler shows GPU vs CPU execution and time per function, while the line-level profiler shows timing per line of code. In notebooks, use %%cuml.accel.profile or %%cuml.accel.line_profile; in scripts, use the --profile or --line-profile CLI flags.
-
What does it mean that the Polars GPU engine streaming executor is the default?
It enables processing datasets larger than GPU memory by partitioning data, with a fallback to in-memory execution for unsupported operations; it now covers nearly all operators available for in-memory GPU execution.
-
Which new algorithms are accelerated with zero code changes in cuml.accel?
LinearSVC, LinearSVR, and KernelRidge were added, bringing the full SVM family under cuml.accel’s zero-code-change acceleration.
-
What should I know about CUDA version support in 25.08?
CUDA 11 support has been dropped; if you need CUDA 11, you can pin to RAPIDS version 25.06.
-
Is Spectral Embedding API the same as scikit-learn’s?
Yes, the Spectral Embedding API in cuML matches the scikit-learn SpectralEmbedding implementation.
References
More news
NVIDIA HGX B200 Reduces Embodied Carbon Emissions Intensity
NVIDIA HGX B200 lowers embodied carbon intensity by 24% vs. HGX H100, while delivering higher AI performance and energy efficiency. This article reviews the PCF-backed improvements, new hardware features, and implications for developers and enterprises.
OpenAI reportedly developing smart speaker, glasses, voice recorder, and pin with Jony Ive
OpenAI is reportedly exploring a family of AI devices with Apple's former design chief Jony Ive, including a screen-free smart speaker, smart glasses, a voice recorder, and a wearable pin, with release targeted for late 2026 or early 2027. The Information cites sources with direct knowledge.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
How chatbots and their makers are enabling AI psychosis
Explores AI psychosis, teen safety, and legal concerns as chatbots proliferate, based on Kashmir Hill's reporting for The Verge.
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo
NVIDIA Dynamo offloads KV Cache from GPU memory to cost-efficient storage, enabling longer context windows, higher concurrency, and lower inference costs for large-scale LLMs and generative AI workloads.
Kaggle Grandmasters Playbook: 7 Battle-Tested Techniques for Tabular Data Modeling
A detailed look at seven battle-tested techniques used by Kaggle Grandmasters to solve large tabular datasets fast with GPU acceleration, from diversified baselines to advanced ensembling and pseudo-labeling.