Missing values are a common speedbump for data analysis. But not all missing values are built the same. Depending on your specific dataset, and its data collection method, you will need to choose an appropriate way of interpreting the missing data. Let’s go over some of the common approaches.
1. Dropping Data
The easiest way to deal with missing values is to simply get rid of them. Getting rid of rows or columns is easy with pandas:
# Remove rows with missing values clean_data = original_data.dropna(axis=0) # Remove columns with missing values clean_data = original_data.dropna(axis=1)
Whether you should remove rows or columns depends on your specific dataset. If the majority of a column contains missing data, for instance, you may want to consider removing that column. But keep in mind, sometimes those values are missing for a reason.
We can imagine a housing dataset with 3 columns, [“Bedroom1”, “Bedroom2”, “Bedroom3”], where each column represents the square footage of a room. If a house has only 2 bedrooms, though, the value for Bedroom3 would be left empty on purpose. That column would still be useful, even though many of the rows might contain a null value.
A more sophisticated approach is to impute the missing data. This is when you replace missing values with something like the average, median, or most common value in that column. Although this method prevents data loss, you need to be careful of data leakage (using outside information to influence a machine learning model).
As an example of where this might be useful, consider a dataset with medical patient information, and a column for the patient’s Body Temperature. You might be able to replace the missing values with the average body temperature like this:
df["Body Temperature"].fillna(df["Body Temperature"].mean(), inplace=True)
Using imputation in a Machine Learning Model
If you use imputation in a machine learning model, you may want to consider adding an additional feature to your dataset to record which rows were changed. For instance, you can create a column titled Body_Temperature_Was_Missing that has True values for those that were changed by imputation, and False for those that were left alone. By adding this as a feature in your model, the model might be able to compensate for biases you introduced during imputation.
Back-filling and Forward-filling
An alternative to using the mean or median of a dataset is to fill in missing values with the previous non-null value (forward fill) or the next non-null value (back fill). This is made easy by pandas- it has 2 dedicated functions, ffill() and bfill() that will perform the operation for you.
A smarter way to impute data
In addition to the manual imputation methods we’ve talked about, you can use a library like DataWig, which uses machine learning to automatically impute missing values for you. This approach might be useful if you have a large dataset, with enough information in the other columns for a machine learning model to accurately predict the imputed column.
Interpolation is similar to Imputation, but it involves calculating the approximate value based on surrounding data. For instance, take a look at this dataset:
It clearly follows a linear pattern, so we should be able to estimate values for the missing sections using interpolation. Pandas gives us a function to easily accomplish this:
df["Temperature"] = df["Temperature"].interpolate()
And when we graph the results, we see that our estimations are likely fairly accurate:
By default, the interpolate() function from pandas uses a linear method to interpolate, but there are more options if that doesn’t suit your data. Among the options provided are time, quadratic, spline, etc, so you can choose the method that best suits your data.
Hopefully you found this overview of missing values helpful. There is no perfect solution to lost data, but there are usually a variety of options available to help reduce their impact. Experiment with what works for you, and see what strategy makes sense for your data. Check out our video tutorial on how to handle missing values in your dataset! Be sure to click on the description in the video to access the notebook.