In this lesson, you will learn to work with and prepare text data in Python. You will learn string manipulation techniques such as replacing substrings, capitalizing strings, and parsing numbers from complex strings. Techniques like these are critical for taking messy text data and turning converting it to a uniform format for easier analysis.
In the Python for Data Science: Fundamentals course, we explored the basics of working with data using the Python programming language. In this course, you'll continue your Python journey with a focus on strings, object-oriented programming, dates and times, and many other concepts that are essential for a data scientist. In later courses, you'll build on this programming knowledge to learn data visualization, statistics, machine learning, and more.
In the first course, the data you worked with didn't have many quirks — all the values were in a consistent format. Data with a consistent format is often described as "clean." But in real life, not all the data we encounter is going to be clean; we often need to prepare it in a process called data cleaning (or sometimes, “data munging”).
In this course, we’ll introduce some basic data cleaning techniques. We cover more advanced data cleaning techniques in our Data Cleaning in Python: Advanced course, but for now, we'll dig into basic data cleaning using a real-world data set about all of the artwork contained in the Museum of Modern Art (MoMA).
1. Introducing Data Cleaning
2. Reading our MoMA Data Set
3. Replacing Substrings with the replace Method
4. Cleaning the Nationality and Gender Columns
5. String Capitalization
6. Errors During Data Cleaning
7. Parsing Numbers from Complex Strings, Part One
8. Parsing Numbers from Complex Strings, Part Two
9. Next Steps