Data cleaning. Data cleansing. Data munging. Whatever you call it, there’s no denying that the process of taking raw data and turning it into a data set that’s ready for analysis is one of the most important tasks of any data job.
In fact, it might be the most important task in data science. On surveys, data scientists report that they spend most of their time cleaning and preparing data. Surveys also show that “dirty data” is the number one challenge data workers face in their jobs.
Data cleaning may not be the “sexiest” part of the “sexiest job of the 21st century,”, but it’s an absolutely crucial skill because it’s the foundation upon which all subsequent analysis is based. No matter how good your analytical skills or your machine learning algorithms are, they won’t be able to generate anything of value if you’re feeding them dirty data. As the saying goes: garbage in, garbage out.
Even if you don’t work in data science, data cleaning can be a highly valuable skill. For example, one Dataquest student in Guyana was tasked with working with dozens of huge, unwieldy Excel data sets. Learning to combine and clean this data with Python changed his life, he says. And it turned what was once a week-long task into something that just takes a few minutes!
Chances are you’ve already got some Python data cleaning skills. It’s such a fundamental part of the data science workflow that our free beginner courses incorporate some data cleaning. We also offer a more focused course on data cleaning and analysis. Now, to help you build even more skill in this critical area, we’re launching a new course: Data Cleaning Advanced.
Ready to start scrubbing your data until it’s squeaky clean? Click the button below to get started.
This course requires a Basic or Premium subscription, and assumes familiarity with concepts introduced in our earlier courses, including basic data science Python programming skills, and some experience with libraries like
Why Should I Take This Course?
Data cleaning is a critical skill for any data analyst or data scientist, and there are many ways you could learn these skills. However, Dataquest offers a unique learning platform that integrates excellent instruction with hands-on coding practice in a way that makes our courses highly effective.
You don’t have to take our word for this; students who beta-tested this course agree. 100% of testers agreed that after completing the course, they understood regular expressions. 100% of testers also agreed that after completing the course, they could use the concepts they had learned (like list comprehensions, lambda functions, working with JSON data, etc.) in their own projects.
In particular, students praised the explanations in this course as clear and thoughtful, and reported they’d learned valuable professional skills. “I’m going to be able to improve my work using regex,” one student wrote after completing the regex-focused mission in this course. These kinds of advanced skills are “often useful in real work settings,” another student said.
In fact, one student tester told us that this course contains “one of the best missions on Dataquest!”
What Will I Learn in This Data Cleaning Advanced Course?
In Data Cleaning Advanced, you’ll dig in and get your hands dirty, cleaning real-world data sets using a variety of common techniques to root out faulty data, handle missing data, combine data sets, and prepare incoming data for analysis. Using our interactive, in-browser coding platform, you’ll write Python code to turn dirty data about Hacker News headlines and New York City traffic accidents into analysis-ready clean data.
To do that, you’ll learn about regular expressions (often called regex), a powerful way of manipulating strings, and how you can make use of regex in Python to clean dirty text data using Python’s
regex module. You’ll learn to work with data in the JSON format, which is common when working with data from web APIs, using Python’s
json module. And you’ll learn to use lambda functions to create functions within functions for faster workflows.
You’ll also learn to work with list-like data more easily using list comprehensions
Once you’ve grasped these important techniques for cleaning and preparing data, you’ll start filling in the holes in your data set, handling missing data in a variety of ways. You’ll learn:
- How to find missing data using tables and plots
- How to impute missing data using a variety of statistical methods
- How to augment an incomplete data set using data from other sources
As you work through these challenges, you’ll also get additional practice working with popular Python data science libraries you’re probably already familiar with, including
By the end of this course, you’ll have acquired the skills to clean data from a variety of data sources, including JSON data you’re pulling from APIs. You’ll be able to handle missing values in your data in a number of different ways so that you no longer have to drop rows from your data set, reducing its size and value. And you’ll have the skills to manipulate your source data into the formats you need for analysis, using the power of regex to make messy text strings much easier to work with.
When it comes to the most-used skill in all of data science, there’s no time like the present to start sharpening your skills, and upgrade your data cleaning abilities to “advanced.”