Data Cleaning Walkthrough: Combining the Data

In the previous lesson, you began investigating possible relationships between SAT scores and demographic factors. In order to do this, you acquired several datasets about New York City public schools. You manipulated these datasets and found that we could combine them all using the DBN column.

Data scientists rarely start out with tidy datasets, which makes cleaning and combining them one of the most critical skills and data professionals can learn. In fact, Forbes estimates that data scientists spend about 60% of their time cleaning and combining data, so it’s essential to be able to manipulate data quickly and efficiently.

In this lesson, you’ll clean the data a bit more, then work on combining data. To do this, you will learn how to merge data using the `pd.merge()` function that the pandas library provides to help combine data. In addition, you will learn about how to handle missing values, different types of merges, how to condense data sets, and how to compute averages across dataframes.

Objectives

• Learn to combine multiple datasets.
• Learn to perform joins in pandas.

Lesson Outline

1. Introduction
2. Condensing the Class Size Data Set
3. Condensing the Class Size Data Set
4. Computing Average Class Sizes
5. Computing Average Class Sizes
6. Condensing the Demographics Data Set
7. Condensing the Demographics Data Set
8. Condensing the Graduation Data Set
9. Condensing the Graduation Data Set
10. Converting AP Test Scores
11. Left, Right, Inner, and Outer Joins
12. Performing the Left Joins
13. Performing the Inner Joins
14. Filling in Missing Values
15. Filling in Missing Values
16. Adding a School District Column for Mapping
17. Next Steps
18. Takeaways