Data cleaning might not be the reason you got interested in data science, but if you’re going to be a data scientist, no skill is more crucial. Working data scientists spend at least 60% of their time cleaning data, and dirty data is often ranked the single biggest barrier data scientists face at work.
That’s why we’ve just added a brand new course to our Python Data Analyst and Data Scientist paths called Data Cleaning and Analysis. If you’re a Dataquest Premium subscriber, you can start learning right now.
Why Learn Data Cleaning?
Data scientists can end up doing a wide variety of things across a wide variety of industries, but almost every data science job shares at least one thing in common: data cleaning. The real world is messy, after all, and that means real-world datasets tend to be messy, too. Incomplete entries, inconsistent formatting, entry errors – these are things you’ll encounter in almost every dataset you work with.
Even if you’re working with perfect data, though, data cleaning skills are still necessary. You’ll almost always want to make changes to your data and its formatting to facilitate your analysis, and that means doing the same sorts of things you do to messy data: dropping irrelevant entries, reformatting columns, etc.
Learning data cleaning is particularly important if you aspire to work with any kind of machine learning. As the Harvard Business Review put it:
Poor data quality is enemy number one to the widespread, profitable use of machine learning. […] The quality demands of machine learning are steep, and bad data can rear its ugly head twice — first in the historical data used to train the predictive model and second in the new data used by that model to make future decisions.
Simply put, there’s no doing data science without doing data cleaning.
What Does This Course Cover?
In Data Cleaning and Analysis, you’ll learn key data cleaning techniques in Python using the popular
pandas data analysis library (if you’d like to learn data cleaning in R, we have a separate R data cleaning course). Throughout the course you’ll work with real-world data from the World Happiness Report, cleaning and analyzing a large dataset that includes a variety of metrics for world nations like GDP and average life expectancy.
In the first three lessons of Data Cleaning, you’ll learn to aggregate, combine, and transform data efficiently using
pandas to get it ready for analysis. Then you’ll dig into slightly more complex topics, like how to work with strings in
pandas, how to use regular expressions, and how to handle missing and duplicate data.
Once you’ve worked through the teaching lessons, you’ll be challenged to put all of your new data cleaning skills to the test with a new guided project that will also teach you some new
pandas skills and data presentation skills as you work to clean and analyze real-world datasets of employee exit surveys from two Australian government bureaus.
And of course, all the material is presented in Dataquest’s split-screen presentation style so that you can get your hands dirty and start coding right off the bat.
Grab Your Mop
Data cleaning may not sound as sexy as machine learning, but the often-ignored reality of data science is that your analysis can only ever be as good as your data. If your data’s a mess, your analysis is going to be a mess, too.
Thankfully, with the power of Python and
pandas, you don’t have to let that happen, so grab your mop and dive into our new Data Cleaning and Analysis course right now!