MISSION 351

Cleaning and Preparing Data in Python

In this first mission, you will learn to work with and prepare text data in Python. You will learn string manipulation techniques such as replacing substrings, capitalizing strings, and parsing numbers from complex strings. Techniques like these are critical for taking messy text data and turning converting it to a uniform format for easier analysis.

In the Python for Data Science: Fundamentals course, we explored the basics of working with data using the Python programming language. In this course, you'll continue your Python journey with a focus on strings, object-oriented programming, dates and times, and many other concepts that are essential for a data scientist. In later courses, you'll build on this programming knowledge to learn data visualization, statistics, machine learning, and more.

In the first course, the data you worked with didn't have many quirks — all the values were in a consistent format. Data with a consistent format is often described as "clean." But in real life, not all the data we encounter is going to be clean; we often need to prepare it in a process called data cleaning (or sometimes, “data munging”). 

In this course, we’ll introduce some basic data cleaning techniques. We cover more advanced data cleaning techniques in our Data Cleaning in Python: Advanced course, but for now, we'll dig into basic data cleaning using a real-world data set about all of the artwork contained in the Museum of Modern Art (MoMA).

Objectives

  • How to work with and prepare text data in Python.
  • Learn basic data cleaning techniques.
  • Learn how to handle errors during data cleaning.

Mission Outline

1. Introducing Data Cleaning
2. Reading our MoMA Data Set
3. Replacing Substrings with the replace Method
4. Cleaning the Nationality and Gender Columns
5. String Capitalization
6. Errors During Data Cleaning
7. Parsing Numbers from Complex Strings, Part One
8. Parsing Numbers from Complex Strings, Part Two
9. Next Steps
10. Takeaways

python-for-data-science-intermediate

Course Info:

Intermediate

The median completion time for this course is 6 hours. View Details

This course is free and includes four missions and one guided project. It is the second course in the Data Analyst in Python path and Data Scientist in Python path.

START LEARNING FREE

Take a Look Inside