As we near the end of the Data Cleaning and Analysis course, we'll cover a topic that's essential to any data cleaning workflow — handling missing and duplicate data.
In the Pandas Fundamentals course, you learned that there are various ways to handle missing data:
- Remove any rows that have missing values.
- Remove any columns that have missing values.
- Fill the missing values with some other value.
- Leave the missing values as is.
In this lesson, you'll explore each of these options in detail and learn when to use them. You'll work with 2015, 2016, and 2017 World Happiness Reports again. More specifically, you'll combine them and clean missing values as you start to define a more complete data cleaning workflow.
Missing or duplicate data may exist in a dataset for a number of different reasons. Sometimes, missing or duplicate data is introduced as we perform cleaning and transformation tasks such as combining data, reindexing data, and reshaping data
Other times, it exists in the original dataset for reasons such as user input error or data storage or conversion issues
In the case of missing values, they may also exist in the original dataset to purposely indicate that data is unavailable.
As you work through each concept, you’ll get to apply what you’ve learned from within your browser; there's no need to use your own machine to do the exercises. The Python environment inside of this course includes answer-checking to ensure you've fully mastered each concept before learning the next.
2. Identifying Missing Values
3. Correcting Data Cleaning Errors that Result in Missing Values
4. Visualizing Missing Data
5. Using Data From Additional Sources to Fill in Missing Values
6. Identifying Duplicates Values
7. Correcting Duplicates Values
8. Handle Missing Values by Dropping Columns
9. Handle Missing Values by Dropping Columns Continued
10. Analyzing Missing Data
11. Handling Missing Values with Imputation
12. Dropping Rows
13. Next steps