Scikit-learn on GPUs with Array API
11-03, 11:45–12:25 (America/New_York), Winter Garden (Room 5412)

Scikit-learn was traditionally built and designed to run machine learning algorithms on CPUs using NumPy arrays. In version 1.3, scikit-learn can now run a subset of its functionality with other array libraries, such as CuPy and PyTorch, using the Array API standard. By adopting the standard, scikit-learn can now operate on accelerators such as GPUs. In this talk, we learn about scikit-learn's API for enabling the feature and the benefits and challenges of adopting the Array API standard.


For most of Scikit-learn's development, it could only run machine learning algorithms on CPUs using NumPy arrays. In version 1.3, scikit-learn expanded experimental support for accelerators such as GPUs by operating on CuPy arrays or PyTorch tensors. This achievement was made possible by the Consortium for Python Data API Standards, which released the Array API standard. The Array API standard defines a Python-level specification for array libraries like NumPy, CuPy, or PyTorch.

In this talk, we explore scikit-learn's user-facing API for enabling Array API, thus allowing computation to run with CuPy arrays and PyTorch tensors. Moreover, we compare benchmarks to see how performance differs between various array libraries. Then, we dive into the Array API standard and the strategies scikit-learn used to write code compliant with the standard. Specifically, scikit-learn uses the array_api_compat compatibility library, which is critical for Array API adoption. Finally, we learn about the challenges of migrating existing NumPy code to comply with the standard and the actions taken to overcome these challenges. By the end of this talk, you will learn how to enable Array API support in scikit-learn and how you can utilize the Array API standard in your code.


Prior Knowledge Expected

No previous knowledge expected

Thomas J. Fan is a maintainer for scikit-learn, an open-source machine learning library for Python. He led the development of scikit-learn's set_output API, which allows transformers to return pandas DataFrames. Previously, Thomas worked at Columbia University to improve interoperability between scikit-learn and AutoML systems. He is a maintainer for skorch, a neural network library that wraps PyTorch. Thomas holds a Master's in Mathematics from NYU and a Master's in Physics from Stony Brook University.