MISSION 400

Advanced Regular Expressions

In this Advanced Regular Expressions with R mission, you will build on what you learned in the Regular Expressions Basics mission to expand your knowledge of string manipulation for advanced data cleaning tasks.

In this one, you'll dive into regex techniques like lookarounds, backreferences, substitution, and more. As in the previous mission, you'll make use of Python's stingr library to use regular expressions inside your Python code, doing things like searching for and replacing text patterns with other text patterns. In addition to the stringr module, you'll also learn to use string functions with regular expressions to get a better grasp of how regex fits into a typical data cleaning workflow.

In this mission, you will continue working real text data from Hacker News submissions, using your new regex skills to further clean the data set. Because you'll be working with real-world data, you will get the opportunity to think like a data scientist, and by the end of this mission, you will have a solid grasp of regular expressions and how to use them to do powerful string manipulation using R and stringr.

Objectives

  • Learn to clean data by substituting regular expression matches.
  • Use lookarounds to include and exclude text before/after your regex match.
  • Learn to use capture groups to extract new columns from text data.

Mission Outline

1. Introduction
2. Capture Groups
3. Using Capture Groups to Extract Data
4. Counting Mentions of the 'C' Language
5. Using Lookarounds to Control Matches Based on Surrounding Text
6. BackReferences: Using Capture Groups in a RegEx Pattern
7. Challenge: Cleaning our dataset
8. Substituting Regular Expression Matches
9. Extracting Domains from URLs
10. Extracting URL Parts Using Multiple Capture Groups
11. Next Steps
12. Takeaways

python-data-cleaning-advanced

Course Info:

Intermediate

The median completion time for this course is 7 hours. View Details

This course requires a basic subscription and includes four missions. It is the sixth course in the Data Analyst in R path

START LEARNING FREE

Take a Look Inside