MISSION 369

Advanced Regular Expressions

In our Advanced Regular Expressions with Python mission, you will build on what you learned in the Regular Expressions Basics mission to expand your knowledge of string manipulation for advanced data cleaning tasks: regex techniques like lookarounds, backreferences, substitution, and more.

In addition to learning about regular expression concepts, you will also learn how to use regular expressions in your Python and pandas code using the re module. The re allows you to perform operations with regular expressions such as searching and replacing text patterns with other text patterns, as well as other operations involving regular expressions. In additon to the re module, you'll also be using pandas string functions with regular expressions to completely utilize the power of regular expressions. 

In this mission, you will be working with data from Hacker News to give a thorough overview of regular expressions and how powerful they can be in your data cleaning tasks. Because you'll be working with real-world data, you will get the opportunity to think like a data analyst or data scientist as you explore a dataset.

By the end of this mission, you will have a better working knowledge of regular expressions and how to use them to do some powerful string manipulation. 

Objectives

  • Learn to clean data by substituting regular expression matches.
  • Use lookarounds to include and exclude text before/after your regex match.
  • Learn to use capture groups to extract new columns from text data.

Mission Outline

1. Introduction
2. Capture Groups
3. Using Capture Groups to Extract Data
4. Counting Mentions of the 'C' Language
5. Using Lookarounds to Control Matches Based on Surrounding Text
6. BackReferences: Using Capture Groups in a RegEx Pattern
7. Substituting Regular Expression Matches
8. Extracting Domains from URLs
9. Extracting URL Parts Using Multiple Capture Groups
10. Using Named Capture Groups to Extract Data
11. Next Steps
12. Takeaways

python-data-cleaning-advanced

Course Info:

Intermediate

The median completion time for this course is 7 hours. View Details

This course requires a basic subscription and includes four missions. It is the sixth course in the Data Analyst in Python Path and Data Scientist in Python Path

START LEARNING FREE

Take a Look Inside