The flashy machine learning models get all the attention, but they won’t get you anywhere without a good data set. Even a complicated model will produce bad output if you give it bad input. So in their Tech Talk, data engineers Frances Dreyer and Andre Marques break a lance for a clean data set. There are easy ways to give your data a spring cleaning, and there are hard ways to do it. In this blog, Frances and Andre explain all about it.

A simple example like this can help everyone understand the importance of good data. But in the real world, we’ve noticed that a lot of scientists don’t pay much attention to it. They usually focus more on the model they’re feeding with the data set. And those who do want to check their data sets, but don’t know where to start, won’t find much help online. We found some articles that described parts of the process, but none that discussed the entire process in detail. So we thought it was high time to create a place that explains how to get a good data set in just a few steps.
We talk about those steps in detail in our Tech Talk, and this blog will give you a good idea. And you’ll see: it’s really quite simple.

1. Visualize what’s missing

By far the easiest step with the help of a notebook like Python Notebook or Google Colab Notebook. With a notebook, you just have to enter a piece of the code, and bam: you literally see all the holes in the data set. You can also play around with a notebook to create other visualizations, like heat maps.

2. Identify why data are missing

Now that you’ve identified the holes in your data, you want to know why they’re missing. There are three possible reasons. We use the example of taking a cat to a veterinarian to be weighed. The data can be incomplete in one of the following three ways:
- Missing completely at random: the cat can’t be weighed due to external factors, like the scale’s battery being empty or recharged.
- Missing at random: The cat was sick, and didn’t come to the appointment. The data have a column for ‘sick’, and if that column is true, then the ‘weight’ column is missing a value. That missing value is due to the complete column ‘sick’.
- Missing not at random: the cat’s weight hasn’t been filled in, because the owner was embarrassed at how fat it is. So the data are missing on purpose.

3. Eliminate or fill the gaps

Depending on the reason from step two, you can determine if it’s necessary to delete an entire column of data, for example. You can often do that based on the visualization. For example, if it shows that more than 95% of the data in a column are missing, then you can just delete the column entirely. You can also decide to do that if fewer data are missing.
Or you can fill the data with the average, the modal or the median of the rest of the data. We’ll explain which of these options is most suitable and why in our Tech Talk. You can use either simple or advanced statistical methods, but remember that you can keep it as simple as you want. Even simple methods can make a difference in the quality of your data set.

4. Check

Finally, you want to test if the completed data set is actually valuable. You can do that using a test- and training set. We’ll explain how to do that exactly in our Tech Talk.
But the most important takeaway is: that we need to be aware of our responsibility for the quality of data. We’re not just responsible for developing models, but also the cleanliness of our data sets. Even a minor data cleanup can make a huge step forwards in our results.

Tech Talk: Data cleaning – Techniques for identifying and filling in missing values

When you work with data sets, the question you face is always whether or not to fill in the gaps. If you don’t fill them in, should you just delete the missing data? And if you do want to fill them in, how would you do it? Should you even be worrying about missing data? In their Tech Talk, Frances and Andre answer all of these questions and more. But to get you started: yes, you should always be worried about missing data. What you do about it, though, is up to you.

About Frances Dreyer

When she’s not diving into data sets, you can find Frances practicing on her French horn to play in one of her several orchestras. Or she might be knitting something warm; a pandemic pastime that she hasn’t let go of. And even after five years working as a data engineer, she’s still looking for new challenges. Especially when it comes to techniques and technology related to Big Data.

About Andre Marques

Data has the power to improve our lives, and Andre has been working to do that every day for the past 10 years. The most rewarding part of the job to him is when the data he’s worked on change how we think about things. But life is more than just numbers, and Andre enjoys discovering new food with his wife, listening to podcasts and running just as much as working with data.
job alert

Receive the latest vacancies