Harnessing Test-Driven Development and CI/CD for Smarter Data Analysis PyData NYC 2023

Harnessing Test-Driven Development and CI/CD for Smarter Data Analysis
.ical

11-02, 10:10–10:50 (America/New_York), Winter Garden (Room 5412)

This talk will focus on applying Test Driven Development principles and practices to data analysis and basic ML Models.

We will talk about the benefits of unit testing your python scripts and ML engines. We will live code examples of how to build unit tests during data analysis, how we can use fixtures to make basic validation on our ML engines, and once we create a script with the process end-to-end we will look at options for integration testing. Finally, we will set up a github pipeline that will run our tests in different phases in order to give us visibility into the accuracy of our scripts/engines.

In this session we start with raw data and proceed to build a movie recommendation engine using TF-IDF and cosine similarity. Once we know how to process our input data and prepare it for our engine we will create production ready, tested, and modularized scripts. We will see tools we can use to package and push our modules to AWS. We’ll see how to set up Github pipelines that package our modules, run our tests, and deploy our scripts where they can be used by AWS glue jobs. Along the way we will talk about the benefits of unit testing your python scripts and ML engines. We will live code examples of how to build unit tests during data analysis, how we can use fixtures to make basic validation on our ML engines, and we will see how to build pipeline in order to take full advantage of the good tests we wrote.

Prior Knowledge Expected –

No previous knowledge expected

Liz Johnson

Liz is a full stack software engineer with a passion for learning new things. Liz is passionate about data analysis and machine learning but even more passionate about ways we can improve it through good testing practices and using the tools that exist for all the awesome things they provide. Liz has worked has a software consultant for the last couple years and in the process has built complex rules engines in python and then did extensive data analysis on their performance and possible optimizations. She has also worked on ETL pipelines for large data migrations.

Harnessing Test-Driven Development and CI/CD for Smarter Data Analysis .ical 11-02, 10:10–10:50 (America/New_York), Winter Garden (Room 5412)

Harnessing Test-Driven Development and CI/CD for Smarter Data Analysis
.ical

11-02, 10:10–10:50 (America/New_York), Winter Garden (Room 5412)