It’s been said before that most of a data scientist’s time is spent cleaning data. That’s likely true, especially if that data comes from a messy source. If your source is a well-organized database, you might not need very much pre-processing before analysis. However, if you are using a web-scraping tool to gather data online, the returned data will likely be extremely messy, and need a substantial amount of cleaning before it’s useful.
Let’s take a look at a dataset that can be found on Kaggle.com:
This data was scraped from IMDB’s website using beautifulsoup. A cursory glance at the data will show us that there are a few columns that need some significant string parsing to get useful results. For instance, the YEAR column can be in the format of 1 year, a range of years, a roman numeral and then a range of years, etc. And the GENRE column has some extra whitespace and newline characters included in each string.
Importing the Data
Let’s start by importing the data using pandas. We are going to focus on the GENRE and YEAR columns in this article, not on the ONE-LINE and STARS, so we’ll drop those for now. You’ll need to download the data from Kaggle using the link above, and then change “movies.csv” to the file location on your computer in order for this code to work.
We can get some basic info about the columns by using the command df.info():
We can see that almost all our columns have some null values, which might be a problem for certain types of analysis. But first, let’s parse some strings.
Cleaning the YEAR Column
By searching through the data in the data explorer, I can see that the YEAR column contains values in the following variations:
- (2021- )
- (2021 TV Special)
- (II) (2013-2016)
I want to split the start and end years into two different columns, and I would like to preserve the roman numerals, which indicate whether the title is/has a sequel. We can do this by using the str.extract() method in pandas to parse each value in the column.
The str.extract() method takes a regex pattern and extracts the matching string from each row in a column. The first line in this code finds the “sequel” information (the roman numerals), and the second line gets the year(s). We can then split the years on the “–” into two different columns using the str.split()
And finally, we can add those columns to our dataframe and drop the original column. What we should end up with is three new columns:
- “sequel”: roman numerals representing the film’s order in a series
- “year_1”: the first year of a TV show, or the release year of a Movie
- “year_2”: the last year of a TV show
Cleaning the GENRE Column
If we use df.iat[0,1] to look at the first value of the GENRE column, we see that it has the following format:
'\nAction, Horror, Thriller '
Note the extra whitespace and the \n newline character at the beginning. These will likely be a problem when trying to analyze the data, so let’s get rid of them using the str.strip() command. Then, we can split the text on the commas, getting a separate column for each genre in the list.
We should end up with the following format:
A Note About Null Values
Depending on your form of analysis, null values may or may not be a problem. For simple graphing, it probably won’t be an issue, but for machine learning, null values might be a show stopper. Depending on the data, you may want to fill those in with 0’s, or use the average, or just completely remove those rows/columns. There are numerous ways to handle missing values, and it all depends on your specific data, and your specific form of analysis.
There are some tools to help with finding null values, though, and we will make a separate article detailing those soon.
Now we can enjoy our clean data and use seaborn to graph a simple line chart to view the relationship between the first year of a film and the rating:
Enjoy your clean data!