Patrick Hoefler PyData NYC 2023

Patrick Hoefler
.ical

Patrick Hoefler is a member of the pandas core team and a Dask maintainer. He is currently working at Coiled where he focuses on Dask development and the integration of a logical query planning layer into Dask. He holds a Msc degree in Mathematics and works towards a Msc in Software engineering at the University of Oxford.

Sessions

Learn how to parallelize or distribute your Python code. We will start with parallelizing it on your local laptop before we move onto a distributed cluster.

This tutorial shows how to use Dask, a popular open source framework for parallel computing, to scale Python code. We start with parallelizing simple for loops, and move on to scaling out pandas code.

Along the way we will learn about concepts like partitioning data and why a good chunksize matters, parallel performance tracking and general metrics that are built into Dask, and managing exceptions and debugging on remote machines.

This will be a hands-on tutorial with Jupyter notebooks.

Central Park West (Room 6501)

11-03

10:15

40min

The Arrow revolution in pandas and Dask

Patrick Hoefler

The pandas library for data manipulation and data analysis is the most widely used open source data science software library. Dask is the natural extension for scaling pandas workloads to more than a single machine. The continuing integration and adoption of Apache Arrow accelerates historical bottlenecks in both libraries.

Winter Garden (Room 5412)

Patrick Hoefler .ical

Sessions

Patrick Hoefler
.ical