Modern Data Pipelines Testing Techniques: A Visual Guide
11-02, 15:20–16:00 (America/New_York), Winter Garden (Room 5412)

"Should I just run it in production to see if it works!" is a common starting point for many python data engineers. Don't let it be your end point. Any software product deteriorates rapidly without disciplined testing. However, testing data pipelines is a hellish experience for new data developers. Unfortunately, there are somethings that are only learnt on the job. It is a crude reality that data pipeline testing is one of those fundamental skills that gets glossed over during the training of new data engineers. It can be so much more fun to learn about the latest and greatest python data processing library. But how can a new python data engineer transform a patchwork of scripts, into a well engineered data product? Testing is the cornerstone of any iterative development to reach acceptable confidence in the outputs of any data pipeline. This talk will help with an overview of modern data pipelines testing techniques in a visual and coherent game plan.

Why bother testing data pipelines? Billions of budget dollars regularly rely on the excellence of the data scientists, data engineers, and machine learning engineers behind the countless software data pipelines that inform critical business decisions.


New data devs do not realize that data pipelines have transitive failure modes. These failure modes are not clear to the untrained eye. After the first few months of fighting debugging battles against things like:

"Is this number correct?"
"Why is the dashboard not updated for yesterday's data?"
"How come we didn't catch this bug upstream?"

They finally realize that there are many types of failures that were not covered in the 6-week bootcamp to get this job. Besides, new data devs find that data producers and consumers are continuously dealing with business change. Unfortunately, these new data devs are ill-equipped to deal with change, like we all were when we started our careers in this data thing.

"Time for some Shorts/Email/Instagram/Twitter/Youtube/Reddit/Tiktok" is the typical reaction to waiting for a long data pipeline to run to completion. "Why wait and babysit this pipeline when I could be doing something else," said every new data dev ever.

Sadly, this is the state of affairs for any new data dev. This sad state of affairs stems from their need for more modern, humane, and productive techniques for developing data pipelines.

The Bad Data Devs Lifestyle diagram shows how your average data dev spends time working on a data pipeline. This includes coding, waiting for feedback, and deciphering the feedback.

What we want is to maximize fast feedback. Unfortunately, this is not the case. When running an entire end-to-end pipeline, data pipelines usually run for "longer-than-average-human-attention-span" durations. This makes it hard for the data dev to use their time productively, leading to inefficient use of their time and lost productivity for the business.

In addition, you can bet there will be redundant resource usage to rerun the same parts of the pipeline that we already know work fine. Burning shared resources to check that the new code the data dev added to the end of the pipeline didn't break anything seems excessive.

Software is a special kind of beast. Only the developers closest to the code know how the system works. So much design is in the hands of developers and so much trust needs to be given to the devs to do the right thing. Culture is almost the only way to get honest estimates and honest work from your developers. Data products are even harder to police. It is one thing to police a login screen, but quite another thing to police a 10-step data pipeline split over three teams.

Giving autonomy to the data devs and equipping them with the best tools the business can afford is a good starting point. This talk will help with an overview of modern data pipelines testing techniques in a visual and coherent game plan, to help new python data devs escape the evil grip of TDD+CICD theater.


Prior Knowledge Expected

Previous knowledge expected

Data science platform architect focused on data science productivity, reliability, performance, and cost.

Working on designing and implementing large scale AI products through data collection, analysis, and warehousing.

Passionate about building scalable machine learning pipeline architectures with high business impact.

Aspiring author.