To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
08:00
08:00
60min
Breakfast & Registration
Central Park West (Room 6501)
08:00
60min
Breakfast & Registration
Radio City (Room 6604)
08:00
60min
Breakfast & Registration
Winter Garden (Room 5412)
08:00
60min
Breakfast & Registration
Music Box (Room 5411)
09:00
09:00
90min
From Jupyter Notebooks to websites with Quarto
Mine Çetinkaya-Rundel

Quarto is an innovative, open-source scientific and technical publishing system compatible with Jupyter Notebooks and other popular mediums. Quarto provides data scientists with a seamless way to publish their work in a high-quality format that is easily accessible and shareable. In this workshop, we will demonstrate how Quarto enables data scientists to turn their Jupyter products into a variety of high-quality documents and combine them into well-organized, striking websites.

Central Park West (Room 6501)
09:00
90min
Half hour of labeling power: Can we beat GPT?
Ines Montani, Ryan Wesslen

Large Language Models (LLMs) offer a lot of value for modern NLP and can typically achieve surprisingly good accuracy on predictive NLP tasks with a reasonably structured prompt and pretty much no labelled examples. But can we do even better than that? It’s much more effective to use LLMs to create classifiers, instead of using them as classifiers. By using LLMs to assist with annotation, we can quickly create labelled data and systems that are much faster and much more accurate than using LLM prompts alone. In this workshop, we'll show you how to use LLMs at development time to create high-quality datasets and train specific, smaller, private and more accurate fine-tuned models for your business problems.

Winter Garden (Room 5412)
09:00
90min
Shipping like a Spy; Enabling continuous deployment with stealth, security, and smooth moves in mind.
Erin Mikail Staples

We’re moving at a speed faster than ever before, with approximately 328.77 million terabytes of data created each day. For those managing these databases, building a secure and sustainable process for building and releasing elements around the core database is more important than ever. In this tutorial; we’ll cover not only the importance and best practices for thinking about this from an architectural perspective but also how to get our hands dirty with building feature flags and a process for enabling continuous iteration and deployment with stealth, security, and smooth moves in mind.

Music Box (Room 5411)
09:00
90min
Simple Simulators with pandas and Generator Coroutines
James Powell

In this tutorial, we will review some of the most parts of the Python programming language we don't use every day… but should! Using the motivating example of a portfolio construction backtester, we will build the rough outline of a library that allows users to design strategies to be automatically executed and evaluated for constructing a portfolio.
We will design this tool such that:
- it has moderate performance for low-frequency strategies (e.g., no intraday trading, no real-time, more mutual fund than hedge fund)
- it supports high generality in strategy construction
- it supports “foreign” data (i.e., it minimizes assumptions about pandas.Series or pandas.DataFrame representing market information or trades)
- it is reasonably easy for a pandas user to construct a strategy without knowing too many esoteric details of Python or pandas (except, of course, the rules of index alignment)
Along the way, we will look to answer the following questions:
- what are generators, generator coroutines, and decorators, and where do they they actually show up in analytical code?
- why are generators and generator coroutines so well-suited to the design of simulators, backtesters, model training, &c.?
- how do we write libraries that accept and return pandas.Series that do not lose generality?
- why is pandas often considered a “tail-end” analytical tool, and how might we solve the problem of writing libraries that may grow a pandas.Series?

Radio City (Room 6604)
10:30
10:30
20min
Break
Central Park West (Room 6501)
10:30
20min
Break
Radio City (Room 6604)
10:30
20min
Break
Winter Garden (Room 5412)
10:30
20min
Break
Music Box (Room 5411)
10:50
10:50
90min
Covalent - A New Paradigm for High Compute Cloud Workloads in the Age of LLM and Generative AI
santosh kumar radha, Ara Ghukasyan

As the technological landscape evolves from being data-centric to compute-intensive, the challenges in resource allocation, scalability, cost, and workflow complexity have become more pronounced. Traditional cloud tools often fall short in efficiently managing resources like GPUs, and the transition from local to cloud-based environments often involves cumbersome code changes and configurations. Additionally, the high costs and complex workflows associated with compute-intensive tasks are exacerbated by the scarcity and high demand for specialized computing resources. Covalent emerges as a Pythonic framework that addresses these multifaceted challenges. It simplifies the development of compute-intensive products, making them feel like a direct extension of one's local laptop rather than a complex cloud architectural exercise. Moreover, Covalent aids in cost reduction by efficiently managing and allocating resources, thereby optimizing the overall operational expenses. This tutorial will explore how Covalent is uniquely positioned to meet the computational and operational demands of a broad range of high-compute developments, including but not limited to Large Language Models and Generative AI, offering a more efficient and streamlined approach to cloud-based tasks.

Radio City (Room 6604)
10:50
90min
Dask tutorial
Patrick Hoefler

Learn how to parallelize or distribute your Python code. We will start with parallelizing it on your local laptop before we move onto a distributed cluster.

This tutorial shows how to use Dask, a popular open source framework for parallel computing, to scale Python code. We start with parallelizing simple for loops, and move on to scaling out pandas code.

Along the way we will learn about concepts like partitioning data and why a good chunksize matters, parallel performance tracking and general metrics that are built into Dask, and managing exceptions and debugging on remote machines.

This will be a hands-on tutorial with Jupyter notebooks.

Central Park West (Room 6501)
10:50
90min
Deep Learning with PyTorch for the Analysis of Time Series Neural Data
Manuel Illanes

Navigate the intersection of neuroscience and artificial intelligence as we explore the application of PyTorch-based deep learning models to time series neural data. This illuminating talk will equip researchers, data scientists, and machine learning engineers with the tools and techniques needed to tackle the unique challenges presented by neural data. From preprocessing to model architecture selection, gain actionable insights and hands-on experience that will elevate your data analysis capabilities. Don't miss this opportunity to advance your knowledge in one of the most promising fields of interdisciplinary research.

Music Box (Room 5411)
10:50
90min
Ibis: A fast, flexible, and portable tool for data analytics.
Gil Forsyth, Phillip Cloud

Ibis provides a common dataframe-like interface to many popular databases and analytics tools (BigQuery, Snowflake, Spark, DuckDB, …). This lets users analyze data using the same consistent API, regardless of which backend they’re using, and without ever having to learn SQL. No more pains rewriting pandas code to something else when you run into performance issues; write your code once using Ibis and run it on any supported backend. In this tutorial users will get experience writing queries using Ibis on a number of local and remote database engines.

Winter Garden (Room 5412)
12:20
12:20
60min
Lunch
Central Park West (Room 6501)
12:20
60min
Lunch
Radio City (Room 6604)
12:20
60min
Lunch
Winter Garden (Room 5412)
12:20
60min
Lunch
Music Box (Room 5411)
13:20
13:20
90min
Build Simpler Production ML Systems using Feature/Training/Inference Pipelines
Fabio Buso

There is a wide array of tools available to simplify the process for data scientists to package their models and deploy them in production, ranging from serverless functions to Docker containers. However, deploying models in production remains a challenge, particularly when it comes to data access.
Real-time ML systems typically require low-latency access to precomputed features containing history or context data. The code used to create those features should be consistent with the code used to create features using during model training. Similarly, batch ML systems should use the same logic to compute features for training and batch inference.
The FTI (Feature, Training, Inference) pipeline architecture is a unified pattern for building batch and real-time ML systems. It enables the independent development and operation of feature pipelines (that transform raw data into features/labels), training pipelines (that take features/labels as input and produce models as output), and inference pipelines (that take model(s) and features as input and produce predictions as output). The pipelines have clear inputs and outputs, and can even be implemented using different technologies (e.g., Spark for feature pipelines, and Python for training and inference pipelines).

Music Box (Room 5411)
13:20
90min
Building an Expert Question/Answer Bot with Open Source Tools and LLMs
Chris Hoge

In this workshop, we'll explore how LangChain, Chroma, Gradio, and Label Studio can be employed as tools for continuous improvement, specifically in building a Question-Answering (QA) system trained to answer questions about an open-source project using domain-specific knowledge from GitHub documentation.

Winter Garden (Room 5412)
13:20
90min
Data of an Unusual Size: A practical guide to analysis and interactive visualization of massive datasets
Kim Pevey, Pavithra Eswaramoorthy, Dharhas Pothina

While most scientists aren't at the scale of black hole imaging research teams that analyze Petabytes of data every day, you can easily fall into a situation where your laptop doesn't have quite enough power to do the analytics you need.

In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a cloud provided by the presenter – starting from how the data is stored and read, to how it is processed and visualized.

Radio City (Room 6604)
13:20
90min
Plotting with Matplotlib; Telling Static, Animated, & Interactive Stories
Hannah Aizenman

In this tutorial we will explore the Palmer Penguins dataset, starting with exploratory charts, then creating a linked animation, and concluding with converting the animation into a simple interactive visualization. In doing so, this tutorial will unpack some of the fundamental concepts that underlie the architecture of Matplotlib, hopefully providing attendees with the foundation for creating effective visualizations using Matplotlib.

Central Park West (Room 6501)
14:50
14:50
20min
Break
Central Park West (Room 6501)
14:50
20min
Break
Radio City (Room 6604)
14:50
20min
Break
Winter Garden (Room 5412)
14:50
20min
Break
Music Box (Room 5411)
15:10
15:10
90min
Adding Your Own Data Apps to JupyterLab
Daniel Goldfarb

In this practical talk about how to extend JupyterLab, we focus on understanding the underlying extension support infrastructure. As we walk through a step-by-step example of creating an app in JupyterLab, we'll learn, among other things, how to launch that app from different places within JupyterLab, how to style our app, and how to pass parameters to our app to modify its behavior. This talk is for anyone who finds themselves doing complex or repetitive tasks and thinks that they, and others, may benefit from integrating those tasks into JupyterLab.

Winter Garden (Room 5412)
15:10
90min
Building an Interactive Network Graph to Understand Communities
Lucas Durand

People are hard to understand, developers doubly so! In this tutorial, we will explore how communities form in organizations to develop a better solution than "The Org Chart". We will walk through using a few key Python libraries in the space, develop a toolkit for Clustering Attributed Graphs (more on that later) and build out an extensible interactive dashboard application that promises to take your legacy HR reporting structure to the next level.

Central Park West (Room 6501)
15:10
90min
Empowering Data Exploration: Creating Interactive, Animated Reports in Streamlit with Vizzu
Peter Vidos, Zachary Blackwood

Data scientists strive to bridge the gap between raw data and actionable insights. Yet, the actual value of data lies in its accessibility to non-data experts who can unlock its potential independently. Join us in this hands-on tutorial hosted by experts from Vizzu and Streamlit to discover how to transform data analysis into a dynamic, interactive experience.

Streamlit, celebrated for its user-friendly data app development platform, has recently integrated with Vizzu's ipyvizzu - an innovative open-source data visualization tool that emphasizes animation and storytelling. This collaboration empowers you to craft and share interactive, animated reports and dashboards that transcend traditional static presentations.

To maximize our learning time, please come prepared by following the setup steps listed at the end of the tutorial description, allowing us to focus solely on skill-building and progress.

Radio City (Room 6604)
15:10
90min
No fronted, no notebook rewriting, no callbacks. Turning notebook into a web app with Mercury.
Piotr Płoński, Aleksandra Płońska

In this tutorial, you'll discover how to leverage the MERCURY framework to effortlessly transform your computed notebooks, such as those created in Jupyter Notebook, into interactive web apps, insightful reports, and dynamic dashboards. With no need for frontend expertise, you can effectively communicate your work to non-technical team members. Dive into the world of Python and enhance your notebooks to make your insights accessible and engaging.

Music Box (Room 5411)
08:00
08:00
50min
Breakfast & Registration
Central Park West (Room 6501)
08:00
50min
Breakfast & Registration
Radio City (Room 6604)
08:00
50min
Breakfast & Registration
Winter Garden (Room 5412)
08:00
50min
Breakfast & Registration
Music Box (Room 5411)
08:00
50min
Breakfast & Registration
Central Park East (Room 6501a)
08:50
08:50
10min
8:50 - Welcome and Housekeeping Notes
Central Park West (Room 6501)
09:00
09:00
10min
Opening Remarks: Martha Norrick
Martha Norrick

Open remarks by Martha Norrick, the Chief Analytics Officer of New York City.

Central Park West (Room 6501)
09:10
09:10
35min
Keynote: AI and the stuff built for AI -- are they actually useful for data science?
Soumith Chintala

Keynote by Soumith Chintala

Central Park West (Room 6501)
09:45
09:45
25min
Break
Central Park West (Room 6501)
09:45
25min
Break
Radio City (Room 6604)
09:45
25min
Break
Winter Garden (Room 5412)
09:45
25min
Break
Music Box (Room 5411)
09:45
25min
Break
Central Park East (Room 6501a)
10:10
10:10
40min
Fireside Chat with Soumith Chintala
Soumith Chintala, Erin Mikail Staples

Have questions after a keynote? Excited to learn more from one of our keynote speakers? Fireside chats are a great opportunity meet our speakers in a more intimate setting.

Central Park East (Room 6501a)
10:10
40min
Harnessing Test-Driven Development and CI/CD for Smarter Data Analysis
Liz Johnson

This talk will focus on applying Test Driven Development principles and practices to data analysis and basic ML Models.

We will talk about the benefits of unit testing your python scripts and ML engines. We will live code examples of how to build unit tests during data analysis, how we can use fixtures to make basic validation on our ML engines, and once we create a script with the process end-to-end we will look at options for integration testing. Finally, we will set up a github pipeline that will run our tests in different phases in order to give us visibility into the accuracy of our scripts/engines.

Winter Garden (Room 5412)
10:10
40min
How to Empower Utility Vegetation Management: A Blend of AI and LiDAR Data
Kewen Gu, Anjani Prasad Atluri

The great majority (more than 2/3) of power outages are caused by contact from vegetation to an active power line, with the added risk of fire and public safety. The goal for utility companies is to manage the vegetation near their above-ground infrastructure in order to reduce these types of contact during inclement weather. This session will discuss an AI-driven vegetation management solution, using LiDAR and satellite imagery to determine the highest risk areas of a utility's service area. Our approach is able to give insights across a service territory, by line down to the foot level as far as how close a branch may be to a power line. This work will enable them to shift from cycle-based to condition-based trimming. We will go over the data used, the various technical challenges, and our approach to scale the solution.

Radio City (Room 6604)
10:10
40min
Innovation in the Age of Regulation: Federated Learning with Flower
Krishi Sharma

In the world of machine learning, more data and diverse data sets usually leads to better training, particularly with human centered products such as self-driving cars, IOT devices and medical applications. However, privacy and ethical concerns can make it difficult to effectively leverage many different datasets, particularly in medical and legal services. How can a data scientist or machine learning engineer leverage multiple data sources to train a model without centralizing the data in one place? How can one benefit from multiple datasets without the hassle of breaching data privacy and security?

Central Park West (Room 6501)
10:10
40min
Ten Years of Community Organizer
Alexander CS Hendorf

As a community organizer, I have had the privilege of being a part of the Python community for the past ten years.

In that time, I have seen the community grow and evolve in countless ways. Python evolved from a top 10 language to the top 1 language. Many new people joined the community, new topics as data & AI became part of Python. I'm super proud the preferred language to learn is Python nowadays.

I have also learned a great deal about what it takes to be an effective organizer and how to build and sustain a healthy community. I also experience how not to do it and pulling through in hard times.

I learned a lot about leadership. I learned a lot about myself, my strength and my weaknesses.
This helped me also to grow professionally and had a very positive impact on how I work and lead in my day-job as partner in a data and AI consultancy.

Music Box (Room 5411)
10:55
10:55
40min
Comics in NumPy? More Likely than You Think!
Mars Lee

Reading documentation: what if you could flip through a comic book in your hands, instead of scrolling online? What if you read a story instead of a bullet-point list? What if comics were documentation...?

Is that possible? It sure is! In this talk, we'll dive into the NumPy Contributor Comics, which were completed for Google Season of Docs 2023 (GSOD). Let's get more comics in open source!

Music Box (Room 5411)
10:55
40min
Open Source Computational Economics: The State of the Art
Sebastian Benthall

Economics research has widespread policy and industrial applications. It is rapidly changing due to open source tools. Deep learning techniques have widened the range of models that it is feasible to solve, simulate, and estimate. This talk highlights recent contributions to computational economics and their Python packages. It is aimed at quantitative analysts and economists working in finance and public policy.

Winter Garden (Room 5412)
10:55
40min
Sciris: Simplifying scientific software in Python
Cliff Kerr

Sciris aims to streamline the development of scientific software by making it easier to perform common tasks. Sciris provides classes and functions that simplify access to core libraries of the scientific Python ecosystem (such as NumPy), as well as low-level libraries (such as pickle). Some of Sciris' key features include: ensuring consistent list/array types; allowing ordered dictionaries to be accessed by index; and simplifying parallelization, datetime arithmetic, and the saving and loading of complex objects. With Sciris, users can achieve the same functionality with fewer lines of code, reducing the need to copy-paste recipes from Stack Overflow or follow dubious advice from ChatGPT.

Central Park West (Room 6501)
10:55
40min
The Billion-Request Content Recommendation System Challenge
Elaine Liu, Akriti Dhasmana, Sinclair Target

For online publications and media sites, recommendation systems play an important role in engaging readers. At Chartbeat, we are actively developing a recommendation system that caters to billions of daily pageviews across thousands of global websites. While conventional discussions frequently highlight the data science and machine learning facets of the system, the cornerstone of a successful application is its system architecture. In this presentation, we will dissect our architectural decisions designed to meet high-performance requirements and share insights gleaned from our journey in scaling up the recommendation system.

Radio City (Room 6604)
11:40
11:40
40min
Faster SQL with pandas and Apache Arrow
Will Ayd

pandas has long been known for its robust I/O offerings, which includes commonly used functions for interacting with SQL databases. With the new Apache ADBC drivers, users can expect even better performance and data type management.

Winter Garden (Room 5412)
11:40
40min
Lightweight, low-overhead, high-performance: machine learning directly in C++
Ryan Curtin

How big should a machine learning deployment be? What is a reasonable size for a microservice container that performs logistic regression? Although Python is the generally used ecosystem for data science work, it doesn't provide satisfactory answers to the size question. Size matters: if deploying to the edge or to low-resource devices, there's a strict upper bound on how large the deployment can be; if deploying to the cloud, size (and compute overhead) correspond directly to cost. In this talk, I will show how typical data science pipelines can be rewritten directly in C++ in a straightforward way, and that this can provide both performance improvements as well as massive size improvements, with total size of deployments sometimes in the single-digit MBs.

Central Park West (Room 6501)
11:40
40min
Polars; DataFrames in the multi-core era.
Ritchie Vink

Polars is an OLAP query engine that focusses on the DataFrame use case. Machines have changed a lot in the last decade and Polars is a query engine that is written from scratch in Rust to benefit from the modern hardware.

Effective parallelism, cache efficient data structures and algorithms are ingrained in its design. Thanks to those efforts Polars is among the fastest single node OSS query engines out there. Another goal of polars is rethinking the way DataFrame's should be interacted with. Polars comes with a very declarative and versatile API that enables users to write readable. This talk will focus on how Polars can be used and what you gain from using it idiomatically.

Radio City (Room 6604)
11:40
40min
Using Generative AI and Foundation Models to Predict Above Ground Biomass for Nature Based Carbon Sequestration
Harini Srinivasan, Gurkanwar Singh

A key challenge in training AI models is lack of labeled or ground truth data. This is especially true in the remote sensing field where seasonal changes and differences between label characteristics makes difficult creating a common labeled dataset. With the emergence of self-supervised learning the amount and quality of labeled data can be relaxed but model performance is still of paramount importance. This is especially true for quantifying the sources and sinks of greenhouse gases that drive climate change. In this talk, we present how state of the art AI technologies such as generative AI and Foundation Models can be used to estimate Above Ground Biomass (AGBD) changes due to extraction of CO2 from the atmosphere by vegetation. We demonstrate how these tools can be used by companies with NetZero pledge to quantify, monitor, validate and report their offsetting methodologies and sustainability practices.

Music Box (Room 5411)
12:20
12:20
65min
Lunch
Central Park West (Room 6501)
12:20
65min
Lunch
Radio City (Room 6604)
12:20
20min
Grab food and bring to Winter Garden
Winter Garden (Room 5412)
12:20
65min
Lunch
Music Box (Room 5411)
12:40
12:40
90min
Pub Quiz
James Powell

Join us for the traditional PyData NYC Pub Quiz during lunch on Thursday. Grab some food and bring it to Winter Garden.

Winter Garden (Room 5412)
13:25
13:25
40min
Lightning Talks
Erin Mikail Staples

Lightning Talks will be in Central Park East/West after lunch. Sign up for a talk Thursday morning at the NumFOCUS table on the 5th floor.

Central Park West (Room 6501)
14:10
14:10
25min
Break
Central Park West (Room 6501)
14:10
25min
Break
Radio City (Room 6604)
14:10
25min
Break
Winter Garden (Room 5412)
14:10
25min
Break
Music Box (Room 5411)
14:35
14:35
40min
How to Use Python and Mathematical Modeling to Better Understand the Impact of Electricity Pricing on Consumption
Saba Nejad

Electricity is unique in that its storage is prohibitively costly; this makes it essential for supply to, at least, meet demand at every second of every day. In times of crisis where demand exceeds its projected amount, system operators need to fall back on different methods to lower demand. One such method is Dynamic Pricing which incentivizes customers to lower their consumption by increasing electricity prices during those times. There are trials that test out this pricing model against a flat rate pricing model. This talk uses mathematics and statistical techniques applied in python to see the impact of dynamic pricing on consumption.

Winter Garden (Room 5412)
14:35
40min
Python in Excel
Keyur Patel, Timothy Hewitt

Unleash your data analytics superpowers with Python in Excel! Come learn from the Microsoft Excel product team as we share how you can do more with Python in Excel!

Music Box (Room 5411)
14:35
40min
Simplifying Data Analysis with GitHub Codespaces, Jupyter Notebooks & Open AI
Nitya Narasimhan

As web developers, we create large quantities of data (e.g., from test reports and analytics) that require insights for iterative optimization. But how does a JavaScript developer begin their journey into data science? Enter GitHub Codespaces, GitHub Copilot and Open AI. In this talk, I'll share my journey into creating a consistent development and runtime environment with GitHub Codespaces and Jupyter Notebooks, then activating it with Open AI to support an interactive "learn by exploring" process that helped me (as a JavaScript developer) skill up on Python and data analysis techniques in actionable ways. I'll walk through a couple of motivating use cases, and demonstrate some projects related to competitive programming and data visualization to showcase these insights in action.

Central Park West (Room 6501)
14:35
40min
The Causal Toolbox: A Practical Guide to Causality in Python (For The Perplexed)
Aleksander Molak

With an average of 3.2 new papers published on Arxiv every day in 2022, causal inference has exploded in popularity, attracting large amount of talent and interest from top researchers and institutions including industry giants like Amazon or Microsoft.

There’s a very good reason for this upsurge in popularity. In our contemporary data culture we got accustomed to thinking that traditional machine learning methods can provide us with answers to any interesting business or scientific questions.

This view turns out to be incorrect. Many interesting business and scientific questions are causal in their nature and traditional machine learning methods are not suitable to address them.

In this talk, dedicated to data scientists and machine learning engineers with at least 3 years of experience, we’ll show why this is the case, we’ll introduce the fundamental tools for causal thinking and show how to translate them into code.

We’ll discuss a popular use case of churn prevention and demonstrate why only causal models should be used to solve it.

All in Python, repo included!

Radio City (Room 6604)
15:20
15:20
40min
Deciphering Sales Drivers at PepsiCo: Exploring Bayesian and Frequentist Approaches to Media Mix Modeling
Suri Chen

Abstract: In this talk, we will explore the world of Media Mix Modeling (MMM) and examine the application of Bayesian and Frequentist regression methods to derive valuable marketing insights. MMM plays a pivotal role in marketing strategy, enabling businesses to measure marketing effectiveness and optimize budget allocation across various channels. The presentation will offer an overview of both Bayesian and Frequentist approaches, comparing their principles and applications within the MMM context. We will present a real-world case, showcasing how these two models were used to deconstruct sales driven by several factors and generate actionable insights. Finally, we will share reflections and recommendations for marketers and data scientists regarding further exploration and adoption of different techniques in MMM, empowering businesses to make informed decisions and effectively allocate their marketing resources.

Radio City (Room 6604)
15:20
40min
How to Become a Successful Tech YouTuber?
Dhaval Patel

Numerous tech professionals are transitioning to full-time YouTube careers, signifying the platform’s viability beyond a side gig. In this talk, you will get a glimpse of YouTube's impact on data science education and learn about a 5 step framework for becoming a successful tech (or data science) YouTuber.

Music Box (Room 5411)
15:20
40min
Modern Data Pipelines Testing Techniques: A Visual Guide
Moussa Taifi

"Should I just run it in production to see if it works!" is a common starting point for many python data engineers. Don't let it be your end point. Any software product deteriorates rapidly without disciplined testing. However, testing data pipelines is a hellish experience for new data developers. Unfortunately, there are somethings that are only learnt on the job. It is a crude reality that data pipeline testing is one of those fundamental skills that gets glossed over during the training of new data engineers. It can be so much more fun to learn about the latest and greatest python data processing library. But how can a new python data engineer transform a patchwork of scripts, into a well engineered data product? Testing is the cornerstone of any iterative development to reach acceptable confidence in the outputs of any data pipeline. This talk will help with an overview of modern data pipelines testing techniques in a visual and coherent game plan.

Why bother testing data pipelines? Billions of budget dollars regularly rely on the excellence of the data scientists, data engineers, and machine learning engineers behind the countless software data pipelines that inform critical business decisions.

Winter Garden (Room 5412)
15:20
40min
Turning your Data/AI algorithms into full web applications in no time with Taipy
Florian Jacta

Numerous packages exist within the Python open-source ecosystem for algorithm building and data visualization. However, a significant challenge persists, with over 85% of Data Science Pilots failing to transition to the production stage.

This talk introduces Taipy, an open-source Python library for building web applications’ front-end and back-end. It enables Data Scientists and Python Developers to create production-ready applications for end-users.

Its simple syntax facilitates the creation of interactive, customizable, and multi-page dashboards without having to know about HTML or CSS. Taipy generates highly interactive interfaces, including charts and all sorts of widely used controls.

Additionally, Taipy is engineered to construct robust and tailored data-driven back-end applications. It models dataflows and orchestrates pipelines. Each pipeline execution is referred to as a scenario. Scenarios are stored, recorded, and actionable, enabling what-if analysis or KPI comparison.

Lately, Taipy provides the most suitable cloud tool to host, deploy, and share your Taipy applications easily. In addition, this platform provides the ability to manage, store, and maintain the various states of your backend.

Central Park West (Room 6501)
16:05
16:05
40min
Basics of cloud computing for data scientists
Gastón Barbero

As data scientists, we are often unaware of what's happening under the hood when our models are put in the corporate cloud environment. If I had to deploy a model to the cloud provider from scratch, what are the options I have? This talk gives an introduction to basic cloud computing concepts and services so next time you will be confident in evaluating the different alternatives for taking your models to the cloud.

Winter Garden (Room 5412)
16:05
40min
Build Simple and Scalable Apps with Shiny
Tracy Teal, Gordon Shotwell

This talk explores the intuitive algorithm behind Shiny for Python and shows how it allows you to scale your apps from prototype to product. Shiny infers a reactive computation graph from your application and uses this graph to efficiently re-render components. This eliminates the need for data caching, state management, or callback functions which lets you build scalable applications quickly.

Central Park West (Room 6501)
16:05
40min
Taming the toxic python environment on your laptop
Dharhas Pothina

Have you experienced the frustration of installing a new Python package, only to discover that your existing code and Jupyter notebooks no longer function as expected? Have you found yourself uttering the phrase, "It works on my computer," to a colleague? Have you ever looked at Randal Munroe's five-year-old XKCD comic on Python environments (https://xkcd.com/1987/) and felt torn between laughter and despair?

Well, there's hope on the horizon. In this presentation, we will delve into an open-source tool that enables the creation and management of a set of stable, reproducible, and version-controlled environments right on your laptop, all through an intuitive graphical user interface (GUI).

Music Box (Room 5411)
16:05
40min
The dangers of storytelling with feature importance
Roni Kobrosly

It's common for machine learning practitioners to train a supervised learning model, generate feature importance metrics, and then attempt to use these values to tell a data story that suggests what interventions should be taken to drive the outcome variable a favorable way (e.g. "X was an important feature in our churn prediction model, so we should consider doing more X to reduce churn"). This simply does not work, and the idea that standard feature importance measures can be interpretted causally is one of data science's more enduring myths. In this session we'll talk through why this isn't the case, what feature importance is actually good for, and we'll give a brief overview of a simple causal feature importance approach: Meta Learners. This talk should be relevant to machine learning practitioners of any skill level that want to gain actionable, causal insights from their predictive models.

Radio City (Room 6604)
17:00
17:00
120min
PyData NYC Attendee Social Event at Printer's Alley

Everyone is invited to join us at Printer’s Alley (215 W 40 St, New York, NY 10018) on Thursday after the conference, 5-7pm. You will find the party area upstairs at the venue. Please bring your conference pass to claim your free drink tickets.

Central Park West (Room 6501)
08:00
08:00
60min
Breakfast & Registration
Central Park West (Room 6501)
08:00
60min
Breakfast & Registration
Radio City (Room 6604)
08:00
60min
Breakfast & Registration
Winter Garden (Room 5412)
08:00
60min
Breakfast & Registration
Music Box (Room 5411)
08:00
60min
Breakfast & Registration
Central Park East (Room 6501a)
09:00
09:00
45min
Keynote: Scientific Discovery: From the Lab Bench to the GPU
Michelle Gill

Keynote by Michelle Gill

Central Park West (Room 6501)
09:45
09:45
30min
Break
Central Park West (Room 6501)
09:45
30min
Break
Radio City (Room 6604)
09:45
30min
Break
Winter Garden (Room 5412)
09:45
30min
Break
Music Box (Room 5411)
09:45
30min
Break
Central Park East (Room 6501a)
10:15
10:15
40min
Adventures in not writing tests
Andy Fundinger

Developing reliable code without writing tests may be a far off dream, but Hypothesis' ghostwriter function will generate tests from type hints. The resulting tests are powerful and often appropriate for data analysis. In this talk, I'll discuss how to add tests to your data analysis code that cover a wide range of inputs -- all while using just a small amount of code.

Music Box (Room 5411)
10:15
40min
Architecting Data: A Deep Dive Into the world of Synthetic Data
Ramon Perez

Finding good datasets or web assets to build data products or websites with, respectively, can be time-consuming. For instance, data professionals might require data from heavily regulated industries like healthcare and finance, and, in contrast, software developers might want to skip the tedious task of collecting images, text, and videos for a website. Luckily, both scenarios can now benefit from the same solution, Synthetic Data.

Synthetic Data is artificially generated data created with machine learning models, algorithms, and simulations, and this workshop is designed to show you how to enter that synthetic world by teaching you how to create a full-stack tech product with five interrelated projects. These projects include reproducible data pipelines, a dashboard, machine learning models, a web interface, and a documentation site. So, if you want to enhance your data projects or find great assets to build a website with, come and spend 3 fun and knowledge-rich hours in this workshop.

Central Park West (Room 6501)
10:15
40min
Building Robust Reactive ML Systems In A Multiplayer Setting With Metaflow
Valay Dave

Building ML systems in business settings isn't just about algorithms; it's also about navigating unpredictable data, fostering efficient collaboration, and promoting constant experimentation. This talk delves into Metaflow's solutions to these challenges, spotlighting its event-driven patterns that streamline model training while leveraging infrastructure shared by multiple data-scientists/engineers. Through a hands-on demonstration, we'll showcase how Metaflow facilitates asynchronous training upon new data arrivals for finetuning a state-of-the-art LLM like LLaMAv2 (although the framework is general and widely applicable). Additionally, we'll highlight how multiple developers can seamlessly experiment with various models as new data becomes available.

Radio City (Room 6604)
10:15
40min
Fireside Chat with Michelle Gill
Michelle Gill, Erin Mikail Staples

Have questions after a keynote? Excited to learn more from one of our keynote speakers? Fireside chats are a great opportunity meet our speakers in a more intimate setting.

Central Park East (Room 6501a)
10:15
40min
The Arrow revolution in pandas and Dask
Patrick Hoefler

The pandas library for data manipulation and data analysis is the most widely used open source data science software library. Dask is the natural extension for scaling pandas workloads to more than a single machine. The continuing integration and adoption of Apache Arrow accelerates historical bottlenecks in both libraries.

Winter Garden (Room 5412)
10:55
10:55
200min
Code Duels

Code Duels returns to PyData NYC for the second year in a row! Show up and check it out! This year, we have a friendly competition with three game modes:

Central Park East (Room 6501a)
11:00
11:00
40min
Bad data - anecdotes and examples from the real world
Avishek Panigrahi

Many data science initiatives fail because of the unavailability of good data.

In this talk, I go over examples of bad data I’ve encountered in real life projects. I present hypotheses about the reasons that lead to bad data; tools, techniques and patterns to “debug and correct” bad data; simple workflows for validation, verification and automation to identify bad data using commonly available tools from the PyData and Python ecosystem. Finally, I’ll go over some prescriptive techniques for how you can approach data projects with non-ideal-data to improve odds of success.

Radio City (Room 6604)
11:00
40min
Forecasting With Classical and Machine Learning Methods Using SKTime
Jonathan Bechtel

Traditional time series models such as ARIMA and exponential smoothing have typically been used to forecast time series data, but the use of machine learning methods have been able to set new benchmarks for accuracy in high profile forecasting competitions such as M4 and M5.

However, the use of machine learning models can easily lead to inferior results under common conditions. This talk is a discussion of how each of these methods can be used to model time series data, and demonstrate how SKTime provides a unified framework for implementing both families of techniques.

Music Box (Room 5411)
11:00
40min
Self-Service Analytics using LLMs
Aaditya Bhat

Explore the intricacies of designing, implementing, and maintaining a production-ready LLM-based self-serve analytics platform. Learn about common pitfalls, essential design considerations, performance evaluation, and robust security measures like protection against prompt injection attacks.

Central Park West (Room 6501)
11:00
40min
The Beauty and the Beast: Python on GPUS
Andy Terrel

Python promises productivity, GPUs promise performance, but if you ever try to fire up a program on a GPU you will find that it is often slower than a CPU. Over the last decade, the Python ecosystem has embraced GPUs in numerous libraries and techniques. We survey what works with GPUs and some of the libraries that one can use to accelerate the Python workflow on a GPU.

Winter Garden (Room 5412)
11:45
11:45
40min
Artificial Rhythms: The Merger of Machine Learning, Music and Programming
Ramon Perez

From music composition with code to style transfer and music generation with machine learning (ML), the intersection of music, programming, and ML is opening up exciting avenues for innovation and creativity. This session will take you through a whirlwind tour of the latest techniques and applications that use code and ML to compose music, generate novel rhythms, and evolve the way we approach audio production. If you want to get started writing code and algorithms that create or enhance music, this session will give you the right tools to begin your journey and continue learning.

Radio City (Room 6604)
11:45
40min
Machine Learning in your Data Warehouse using Python
Megan Lieu

Moving data in and out of a warehouse is both tedious and time-consuming. In this talk, we will demonstrate a new approach using the Snowpark Python library. Snowpark for Python is a new interface for Snowflake warehouses with Pythonic access that enables querying DataFrames without having to use SQL strings, using open-source packages, and running your model without moving your data out of the warehouse. We will discuss the framework and showcase how data scientists can design and train a model end-to-end, upload it to a warehouse and append new predictions using notebooks.

Music Box (Room 5411)
11:45
40min
Scikit-learn on GPUs with Array API
Thomas J. Fan

Scikit-learn was traditionally built and designed to run machine learning algorithms on CPUs using NumPy arrays. In version 1.3, scikit-learn can now run a subset of its functionality with other array libraries, such as CuPy and PyTorch, using the Array API standard. By adopting the standard, scikit-learn can now operate on accelerators such as GPUs. In this talk, we learn about scikit-learn's API for enabling the feature and the benefits and challenges of adopting the Array API standard.

Winter Garden (Room 5412)
11:45
40min
Using Open Source LLM in ETL
Mei Chen

This session will provide a case study of using Llama2-70b to tackle a data transformation friction point in reinsurance underwriting. The final approach of the solution is industry agnostic. We will walk through our thought framework for breaking down a business problem into LLM-able chunks, lay out the explored solutions and best performing method, compare local vs. at scale inference, and how we evaluated the unstructured LLM responses to prevent hallucination and ambiguity in getting structured response.

Central Park West (Room 6501)
12:30
12:30
60min
Lunch
Central Park West (Room 6501)
12:30
60min
Lunch
Radio City (Room 6604)
12:30
60min
Lunch
Winter Garden (Room 5412)
12:30
60min
Lunch
Music Box (Room 5411)
13:30
13:30
40min
Crafting Reliable Rules with Robust Control
Michael Zargham

This talk explores how we create smarter, more reliable economic policies by using a technique called “Robust Control”. Robust control is a technique from control systems engineering focused on driving desired outcomes even when there is a lot of uncertainty about the circumstances and behaviors of the system these policies endeavor to regulate. These methodology assumes we've developed an incomplete structural model of a systems: parts of the system have well defined and enforceable rules, whereas others are models of phenomena outside of our control, such as user behavior. This talk reviews the application of a Robust Control informed workflow to select the parameters of a pricing algorithm. Code and data will be shared from the design and pre-launch tuning work and we will also use data to demonstrate how the real life deployed system did and did not match our models. The goal of this talk to demonstrate how practices from control engineering can be applied in data science applications, especially those where algorithms make decisions with intent to influence human behavior at the user population level.

Radio City (Room 6604)
13:30
40min
From RAGs to riches: Build an AI document interrogation app in 30 mins
Dharhas Pothina, Pavithra Eswaramoorthy

As we descend from the peak of the hype cycle around Large Language Models (LLMs), chat-based document interrogation systems have emerged as a high value practical use case. The ability to ask natural language questions and get relevant answers from a large corpus of documents has the potential to fundamentally transform organizations and make institutional knowledge accessible.

Retrieval-augmented generation (RAG) is a technique to make foundational LLMs more powerful and accurate, and a leading way to implement a personal or company-level chat-based document interrogation system. In this talk, we’ll understand RAG by creating a personal chat application. We’ll use a new OSS project called Ragna that provides a friendly Python and REST API, designed for this particular case. We’ll also demonstrate a web application that leverages the REST API built with Panel–a powerful OSS Python application development framework.

By the end of this talk, you will have an understanding of the fundamental components that form a RAG model as well as exposure to open source tools that can help you or your organization explore and build on your own applications.

Central Park West (Room 6501)
13:30
40min
Solving the problems in front of you
Randy Au

We're always told that early optimization is bad, not just for our code but our designs and processes. Yet very few of us are able to find that magical balance between getting things done and doing more background research and design. We're always lured by the siren's call of getting things "right" on the first try. This talk is about the process of finding the balance between doing the work and thinking about the work.

Music Box (Room 5411)
13:30
40min
Spark, Dask, DuckDB, and Polars: Benchmarks
Matthew Rocklin

Spark, Dask, DuckDB, and Polars are popular dataframe tools used at large scale. We benchmark these across a variety of scales (10 GiB to 10 TiB) on both local and cloud architectures with the standard TPC-H benchmark. No project emerges unscathed.

Winter Garden (Room 5412)
14:15
14:15
40min
Customer Lifetime Value Prediction with PyMC Marketing
Hajime Takeda

The Customer Lifetime Value(CLV) model is one of the major techniques of customer analytics and it helps companies to identify who valuable customers are. A high CLV indicates customers who deserve more marketing resources. If the company overlooks CLV, it might invest more in short-term customers who buy just once.

To predict future CLV, we encounter sub-problems like forecasting the time until a customer's next purchase (“Buy Till You Die” modeling) and the probability of a customer's churn (Survival Analysis).

Recently, PyMC-Marketing was released and, it's becoming more feasible to implement these models with the Bayesian approach.

In this talk, I will show the key concepts of CLV prediction, its demonstration using Pymc-marketing, and practical tips.

Winter Garden (Room 5412)
14:15
40min
Fireside Chat with J.J. Allaire
JJ Allaire

Have questions after a keynote? Excited to learn more from one of our keynote speakers? Fireside chats are a great opportunity meet our speakers in a more intimate setting.

Central Park East (Room 6501a)
14:15
40min
Leveraging Generative AI for Enhanced E-commerce
Kriti Kohli

We take a deep dive into the world of generative AI models and their transformative applications in e-commerce, specifically focusing on the implementation of these models for tasks such as text generation, dynamic and personalized content creation and AI assistants.

Central Park West (Room 6501)
14:15
40min
Low(er) Code ML Pipelines with Conduit
Piero Ferrante, Viren Bajaj

Conduit is a Python package and terminal user interface (TUI) designed to streamline the process of building, deploying, and managing machine learning (ML) pipelines on GCP’s Vertex AI platform. Its primary goal is to improve efficiency for data science teams by reducing the time it takes to deploy ML models while observing MLOps best practices.

Music Box (Room 5411)
14:15
40min
Transformative Power: Synthetic Image Generation in Rare Disease Diagnosis
Wenqi (Summer) Zhai, Theophilus Ijiebor, Andrea Dobrindt

Our talk aims to uncover the Impact of Image Synthetic Data in Disease Diagnosis. Delve into the domain of Generative AI in medical imaging, discover the potential of synthetic data to revolutionize rare disease diagnosis, and explore the ethical and practical considerations surrounding this groundbreaking technology.

Radio City (Room 6604)
15:00
15:00
30min
Break
Central Park West (Room 6501)
15:00
30min
Break
Radio City (Room 6604)
15:00
30min
Break
Winter Garden (Room 5412)
15:00
30min
Break
Music Box (Room 5411)
15:30
15:30
40min
Keynote: Dashboards with Jupyter and Quarto
JJ Allaire

Keynote by JJ Allaire

Central Park West (Room 6501)