Bad data - anecdotes and examples from the real world
11-03, 11:00–11:40 (America/New_York), Radio City (Room 6604)

Many data science initiatives fail because of the unavailability of good data.

In this talk, I go over examples of bad data I’ve encountered in real life projects. I present hypotheses about the reasons that lead to bad data; tools, techniques and patterns to “debug and correct” bad data; simple workflows for validation, verification and automation to identify bad data using commonly available tools from the PyData and Python ecosystem. Finally, I’ll go over some prescriptive techniques for how you can approach data projects with non-ideal-data to improve odds of success.


Many data projects start with a simple hypothesis - “We have all this data. Can we use the data to X”, where X is a goal that leads to increased revenue or reduced costs. The success of these initiatives hinges on the availability of high-quality data, which is often not available because of historical reasons and/or legacy data sources.

In this presentation, I will delve into the real-world challenges posed by bad data, offering concrete examples to illustrate their annoyances and complexities. Some examples:

  • Large Excel repositories with years of historical data
  • VARCHAR gotchas in databases
  • Joins across tables when there are "slight" mismatches
  • Problems with incorrect categorical data
  • And everyone's favorite - datetimes

I explore the underlying causes and hypotheses behind the prevalence of poor data. I go over some simple tools, techniques, and patterns necessary to identify and correct data quality issues. For data validation, verification, and automation, I will introduce simple workflows to pinpoint and mitigate bad data using readily accessible tools from the PyData and Python ecosystems. Examples:
* pandas, sqlite, datasette
* Fuzzy string matching with thefuzz
* read_excel patterns
* Power of assert statements
* Makefiles for workflows

Finally, I go over prescriptive strategies for navigating data projects when confronted with less-than-ideal conditions - both technical and organizational.

This is a beginner/intermediate talk for domain experts trying to make better use of data in their organization. Basic knowledge of the PyData ecosystem is helpful but not required. The audience will find some comfort in the fact that "they're not alone" and leave with techniques that can help them overcome these challenges.


Prior Knowledge Expected

No previous knowledge expected

Avishek Panigrahi is the founder and CEO of Logarithm Labs, A YCombinator backed company that helps companies accelerate data efforts.

Prior to Logarithm Labs, Avishek worked at Google on analytics and infrastructure as part of the TPU (Tensor Processing Unit) team. Before that, he worked at Xilinx, Samsung and MIPS on various aspects of hardware design and automation.

He enjoys skiing, camping and low frequency quantitative trading