# Data Cleaning With R

Attempting to analyze data that is completely messy can be a daunting task, if not impossible. Most of the datasets you come across will require some amount of cleaning before you start analyzing and making sense of the data. As you progress through your data analyst or data scientist career, 80% of your work will be cleaning data so your analysis can become easier and/or possible. This lesson will teach you what you need to know to clean your data using R and the tidyverse. Throughout this lesson and subsequent lessons, you will gradually grow your skills to ensure you are prepared to land your first job in data!

In this lesson, you will learn how to clean data using dplyr along the Tidyverse. As you start this lesson, you will learn the importance of data cleaning and why it’s a critical skill for data scientists. As you proceed through the lesson, you’ll learn how to create new variables, simplify a data frame, perform string manipulation, and more! As you learn more and more about data cleaning in R, you’ll be introduced to packages such as dplyr, purrr, and stringr.

While learning how to clean data in R, you’ll work with New York City high school data and start to analyze what factors influence SAT scores the most. With each concept, you’ll be using our code running system with answer checking so you can ensure you’ve mastered each concept before moving on to the next concept.

## Objectives

• Learn to identify data cleaning needs prior to analysis.
• Learn to simplify data frames to contain only necessary variables and observations.
• Learn to change data types of multiple variables at once.
• Learn to create new variables by calculating summary statistics from existing variables.
• Learn to use functionals to check for duplicated observations.

## Lesson Outline

1. The Importance of Data Cleaning
2. Cleaning the New York City Schools Data
3. SAT Data: Changing Data Types and Creating New Variables
4. AP Exam Data: Changing Data Types and Creating a New Variable
5. Class Size Data: Simplifying the Data Frame
6. Class Size Data: Calculating School Averages
7. Class Size Data: Creating a Key Using String Manipulation
8. Graduation Data: Simplifying the Data Frame
9. Demographics Data: Simplifying the Data Frame
10. High School Directory: Simplifying the Data Frame
11. Confirm that Data Frames are Prepared for Joining
12. Removing Duplicate Rows
13. Next Steps
14. Takeaways