MISSION 293

Data Cleaning Basics

In our first Python course, the data you worked with didn't have many quirks — all the values were in a consistent format. Data with a consistent format is often described as "clean." As data scientists, not all data we encounter is clean; we often need to prepare it in a process called data cleaning. In Cleaning and Preparing Data in Python, you got familiar with some data cleaning techniques, but now we’re going to do some data cleaning using pandas.

So far in this course, we've learned how to select, assign, and analyze data with pandas using pre-cleaned data. In reality, data is rarely in the format needed to perform analysis. Data scientists commonly spend over half their time cleaning data, so knowing how to clean "messy" data is an extremely important skill. And since you’ll often be using pandas for your analysis, it’s helpful to know how to clean data that’s being stored in pandas formats like Series and DataFrames.

In this mission, we will learn the basics of data cleaning with pandas as we work with a CSV file containing information for about 1,300 laptop containers. By the end of the mission, you’ll have created a clean data set so you can answer questions such as which laptop has the most storage space, what is the best value laptop, and more.

As with every mission at Dataquest, you'll be given an opportunity to practice each concept using our code editor with built-in answer checking to ensure that you've mastered a concept before moving on to this next.

Objectives

  • Learn about different encodings.
  • Learn how to extract and convert numeric values from string values.
  • Learn how to work with missing values.

Mission Outline

1. Reading CSV Files with Encodings
2. Cleaning Column Names
3. Cleaning Column Names Continued
4. Converting String Columns to Numeric
5. Removing Non-Digit Characters
6. Converting Columns to Numeric Dtypes
7. Renaming Columns
8. Extracting Values from Strings
9. Correcting Bad Values
10. Dropping Missing Values
11. Filling Missing Values
12. Challenge: Clean a String Column
13. Next Steps
14. Takeaways

pandas-fundamentals

Course Info:

Intermediate

The median completion time for this course is 6.77 hours. View Details

This course requires a basic subscription and includes five missions and one guided project.  It is the third course in the Data Analyst in Python path and Data Scientist in Python path.

START LEARNING FREE

Take a Look Inside