MISSION 171

Quickly Analyzing Data With Parallel Processing

In the past few lessons, we learned about I/O bound and CPU bound programs. We also learned about threads, processes, and in which situations to use each. In this lesson, we'll bring everything together, and learn how to process a dataset using what we've learned.

To do this, we'll be using a dataset of movie quotes during this lesson. The quotes are taken from the scripts of 1,068 movies, constitute 894,014 lines altogether, and take up 56 megabytes of disk space. Although we want these examples in this lesson to execute quickly for learning how to limit dataset size, the techniques we'll use can be scaled up to much larger datasets easily.

While processing data, you'll use a paradigm called MapReduce, a paradigm utilized in data processing tools. If you're not familiar with MapReduce and Spark, you can become familiar with those in our Spark and Map Reduce course.

As you work through processing data with parallel processing, you’ll get to apply what you’ve learned from within your browser; there's no need to use your own machine to do the exercises. The Python environment inside of this course includes answer-checking to ensure you've fully mastered each concept before learning the next.

Objectives

  • How to choose which library to use when parallel computing.
  • Learn the why and how to use process pools.
  • How to debug parallel processing code.

Mission Outline

1. Movie Quotes Data
2. The Concurrent Futures Package
3. Reading In Files
4. Finding The Longest Lines
5. Finding The Most Commonly Used Word
6. Debugging Errors
7. Debugging Errors
8. Removing Punctuation
9. Finding Word Frequencies
10. Next Steps
11. Takeaways

improving-code-performance

Course Info:

Intermediate

The median completion time for this course is 4.7 hours. View Details

This course requires a premium subscription and includes four missions, and one guided project.  It is the fourth course in the Data Engineer path.

START LEARNING FREE

Take a Look Inside