Data of an Unusual Size: A practical guide to analysis and interactive visualization of massive datasets
11-01, 13:20–14:50 (America/New_York), Radio City (Room 6604)

While most scientists aren't at the scale of black hole imaging research teams that analyze Petabytes of data every day, you can easily fall into a situation where your laptop doesn't have quite enough power to do the analytics you need.

In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a cloud provided by the presenter – starting from how the data is stored and read, to how it is processed and visualized.


"Big data" refers to any data that is too large to handle comfortably with your current tools and infrastructure. As the leading language for data science, Python has many mature options that allow you to work with datasets that are orders of magnitudes larger than what can fit into a typical laptop's memory.

This tutorial will help you understand how large-scale analysis differs from local workflows, the unique challenges associated with scale, and some best practices to work productively with your data.

By the end, you will be able to answer:

  • What makes some data formats more efficient at scale?
  • Why, how, and when (and when not) to leverage parallel and distributed computation (primarily with Dask) for your work?
  • How to manage cloud storage, resources, and costs effectively?
  • How can interactive visualization make large and complex data more understandable (primarily with hvPlot)?

The tutorial focuses on the reasoning, intuition, and best practices around big data workflows, while covering the practical details of Python libraries like Dask and hvPlot that are great at handling large data. It includes plenty of exercises to help you build a foundational understanding.

Workshop outline

Introduction
* Get participants set up with tutorial material
* Tutorial overview

Analyze and visualize a subset of data
* Introduce the dataset
* Read, explore the data, and create a processing workflow with pandas
* Create some interactive visualizations with hvPlot and Bokeh backend
* Exercises that highlight best practices for creating interactive visualizations (participants have some options so that we have a range of different visualizations at the end)

Scale to the full dataset
* Introduction to parallel and distributed computing
* Adapt pandas workflow to use Dask (and Dask Gateway)
* Exercises that highlight differences compared to pandas and provide scale-friendly alternatives
* Deep-dive into Dask’s diagnostic dashboard plots
* Exercises to understand performance and memory issues using the diagnostic dashboard

Visualize full dataset
* Adapt the previous interactive visualization to the full large-scale dataset with Dask & hvPlot
* Example to share best practices for better performance and readability

Conclusion
* Mention other tools for large-scale analysis: xarray, zarr, intake, and more
* Brief discussion around machine learning and array-based workflows
* Resources to learn more

Pre-requisites: We expect the audience to have some familiarity with Python programming in a data science context. If they know how to create and import Python functions and have some experience doing exploratory data analysis with pandas or NumPy, they will be able to follow along with the tutorial comfortably.


Prior Knowledge Expected

No previous knowledge expected

Kim Pevey is a Senior Software Engineer at Quansight. She has an extensive background in big data analytics using Dask and delivering those computational insights to clients. She has architected a wide variety of packages to analyze client data, build and automate ML workflows, parallelize existing codebases, and everything in between. She also has a special interest in visualization and building dashboards.

Pavithra Eswaramoorthy is a Developer Advocate at Quansight, where she works to improve the developer experience and community engagement for several open source projects in the PyData community. Currently, she maintains the Bokeh visualization library, and contributes to the Nebari (adjacent to the Jupyter community) and conda-store (part of the conda ecosystem) projects. Pavithra has been involved in the open source community for over 5 years, notable as a maintainer of the Dask library and an administrator for Wikimedia’s OSS programs. In her spare time, she enjoys a good book and hot coffee. :)

This speaker also appears in:

Dharhas Pothina is the CTO at Quansight where he helps clients wrangle their data using the pydata stack. He leads the development teams for the Nebari, Conda-Store and Ragna open source projects.
His background includes expertise in computational modeling, big data/high performance computing, visualization and geospatial analysis. Prior to his current position he worked for 15 years in state and federal research labs where he led large multi-disciplinary, multi-agency research projects.

He holds a PhD in Civil Engineering and an MS in Aerospace Engineering from the University of Texas at Austin and a BTech in Aerospace Engineering from the Indian Institute of Technology Madras.

Dharhas is passionate about enabling scientists and engineers with tools that let them scale as well as share their analyses, he loves woodworking, photography and teaching his daughters to love science.

This speaker also appears in: