MISSION 347

Working With Missing And Duplicate Data

As we near the end of the Data Cleaning and Analysis course, we'll cover a topic that's essential to any data cleaning workflow — handling missing and duplicate data.

In the Pandas Fundamentals course, you learned that there are various ways to handle missing data:

  • Remove any rows that have missing values.
  • Remove any columns that have missing values.
  • Fill the missing values with some other value.
  • Leave the missing values as is.

In this mission, you'll explore each of these options in detail and learn when to use them. You'll work with 2015, 2016, and 2017 World Happiness Reports again. More specifically, you'll combine them and clean missing values as you start to define a more complete data cleaning workflow.

Missing or duplicate data may exist in a dataset for a number of different reasons. Sometimes, missing or duplicate data is introduced as we perform cleaning and transformation tasks such as combining data, reindexing data, and reshaping data

Other times, it exists in the original dataset for reasons such as user input error or data storage or conversion issues

In the case of missing values, they may also exist in the original dataset to purposely indicate that data is unavailable.

As you work through each concept, you’ll get to apply what you’ve learned from within your browser; there's no need to use your own machine to do the exercises. The Python environment inside of this course includes answer-checking to ensure you've fully mastered each concept before learning the next.

Objectives

  • Learn techniques for dropping rows and columns with missing data.
  • Learn how to impute values to replicate missing data.
  • Learn how to identify and drop duplicate rows.

Mission Outline

1. Introduction
2. Identifying Missing Values
3. Correcting Data Cleaning Errors that Result in Missing Values
4. Visualizing Missing Data
5. Using Data From Additional Sources to Fill in Missing Values
6. Identifying Duplicates Values
7. Correcting Duplicates Values
8. Handle Missing Values by Dropping Columns
9. Handle Missing Values by Dropping Columns Continued
10. Analyzing Missing Data
11. Handling Missing Values with Imputation
12. Dropping Rows
13. Next steps
14. Takeaways

python-datacleaning

Course Info:

Beginner

The median completion time for this course is 7.2 hours. View Details

This course requires a basic subscription and includes five missions and one guided project.  It is the sixth course in the Data Analyst in Python path and Data Scientist in Python path.

START LEARNING FREE

Take a Look Inside