Spark, Dask, DuckDB, and Polars: Benchmarks
11-03, 13:30–14:10 (America/New_York), Winter Garden (Room 5412)

Spark, Dask, DuckDB, and Polars are popular dataframe tools used at large scale. We benchmark these across a variety of scales (10 GiB to 10 TiB) on both local and cloud architectures with the standard TPC-H benchmark. No project emerges unscathed.


Users have many options for big dataframes today, including Apache Spark, Dask DataFrame, DuckDB, and Polars. These libraries have different architectures, languages, data models, and communities, but they’re all useful in this space. Which should you choose?

The answer to this question is complex and involves application area, familiarity, personal style, thoughts on typing, and so on. But for the purposes of this talk, we’ll just consider performance, and in particular, performance on the TPC-H benchmark set, focused on database queries with heavy joins and groupbys.

We’ll run this benchmark suite across compute environments ranging from a Macbook Pro to hundreds of cores on the cloud. We’ll also explore ranges in scale from 10 GiB up to 10 TiB, and see how the different libraries respond at different scales.

No library performs perfectly across this range, and by comparing performance we’ll be able to gain insight into what to use when. We'll dive more deeply into Dask performance in order to learn generally what's important at large scale. Finally we'll also talk about deploying these projects in the cloud with Coiled.


Prior Knowledge Expected

No previous knowledge expected

Matthew is an open source software developer in the numeric Python ecosystem. He maintains several PyData libraries, but today focuses mostly on Dask a library for scalable computing. Matthew worked for Anaconda Inc for several years, then built out the Dask team at NVIDIA for RAPIDS, and most recently founded Coiled to improve Python's scalability with Dask for large organizations.

Matthew holds a bachelors degree from UC Berkeley in physics and mathematics, and a PhD in computer science from the University of Chicago.