Nobody signs up for a career in data science because they’re really passionate about cleaning data. But the reality is that data cleaning is a huge part of almost every data science job. It’s something you need to do for virtually every project you undertake.
What is data cleaning?
In the early stages of learning data science, you’re often handed pristine data sets that you can dive in, manipulate, and analyze without needing to spend any time on preparation. But in the real world, data sets are almost never this neat.
Exactly what data cleaning entails depends a lot on what the data you’ve got looks like, and what you’re trying to do with it. But often, you’ll need to look through it to remove irrelevant or erroneous data points. You may need to create new variables, or join multiple data sets together to get what you need for your analysis. And you’ll also have to determine how to handle missing values, as very few real-world datasets are going to be 100% complete.
Completing these tasks efficiently requires technical skill and good professional judgement.
What will you learn in this course?
In Data Cleaning in R, we’ll build on our R skills by learning to analyze and clean some messy testing and demographic data from the New York City school system.
We’ll learn to identify and remove irrelevant data, and create new variables to aid in our analysis. We’ll use R to join related data frames and reshape the data for more effective and efficient analysis. And of course, we’ll learn how to deal with missing values and the gaps in the data set.
This course isn’t just about the programming skills, though. Part of data cleaning is learning to identify what needs to be done, and in some cases, doing some subject-area research to better understand the data set. We’ll dive into these aspects of data cleaning, too, so that when you complete the course you won’t just know how to clean data with R, you’ll also know how to identify what sorts of cleaning a given data set needs before you can start your analysis.
At the end of the course, we’ll dive into a guided project that allows you to apply your new data cleaning skills to some messy survey data, and that also gets you up and running using R Notebooks, a popular tool for data scientists who use R.
Time to start cleaning
It would be hard to overstate the importance of data cleaning skills. You’ll hear the number “80%” a lot, and it’s not just a figure of speech: surveys of data scientists suggest that the average data scientist spends about 80% of their working time preparing data (collecting and cleaning it).
So while it’s not likely to be your favorite part of the data science workflow (nearly 60% of data scientists said data cleaning was their least favorite task) it’s something you absolutely need to know. And of course, the pain of cleaning dirty data can be lessened somewhat if you know how to do things quickly and efficiently with the modern, production-ready R that we’re teaching in our new R courses.
If you think you’re ready to start scrubbing, grab your mop and click here to learn data cleaning in R for free.
(If you need to brush up on the basics first, we’ve got free courses for that, too.)