Data Cleaning Walkthrough: Combining the Data

In the previous lesson, you began investigating possible relationships between SAT scores and demographic factors. In order to do this, you acquired several datasets about New York City public schools. You manipulated these datasets and found that we could combine them all using the DBN column.

Data scientists rarely start out with tidy datasets, which makes cleaning and combining them one of the most critical skills and data professionals can learn. In fact, Forbes estimates that data scientists spend about 60% of their time cleaning and combining data, so it’s essential to be able to manipulate data quickly and efficiently.

In this lesson, you’ll clean the data a bit more, then work on combining data. To do this, you will learn how to merge data using the `pd.merge()` function that the pandas library provides to help combine data. In addition, you will learn about how to handle missing values, different types of merges, how to condense data sets, and how to compute averages across dataframes.


  • Learn to combine multiple datasets.
  • Learn to perform joins in pandas.

Lesson Outline

  1. Introduction
  2. Condensing the Class Size Data Set
  3. Condensing the Class Size Data Set
  4. Computing Average Class Sizes
  5. Computing Average Class Sizes
  6. Condensing the Demographics Data Set
  7. Condensing the Demographics Data Set
  8. Condensing the Graduation Data Set
  9. Condensing the Graduation Data Set
  10. Converting AP Test Scores
  11. Left, Right, Inner, and Outer Joins
  12. Performing the Left Joins
  13. Performing the Inner Joins
  14. Filling in Missing Values
  15. Filling in Missing Values
  16. Adding a School District Column for Mapping
  17. Next Steps
  18. Takeaways

Get started for free

No credit card required.

Or With

By creating an account you agree to accept our terms of use and privacy policy.