Processing Data with MapReduce

In the past few lessons, we learned about I/O bound and CPU bound programs. We also learned about threads, processes, and in which situations to use each. In this lesson, we’ll bring everything together, and learn how to process a dataset using what we’ve learned.

To do this, we’ll be using a dataset of movie quotes during this lesson. The quotes are taken from the scripts of 1,068 movies, constitute 894,014 lines altogether, and take up 56 megabytes of disk space. Although we want these examples in this lesson to execute quickly for learning how to limit dataset size, the techniques we’ll use can be scaled up to much larger datasets easily.

While processing data, you’ll use a paradigm called MapReduce, a paradigm utilized in data processing tools. If you’re not familiar with MapReduce and Spark, you can become familiar with those in our Spark and Map Reduce course.

As you work through processing data with parallel processing, you’ll get to apply what you’ve learned from within your browser; there’s no need to use your own machine to do the exercises. The Python environment inside of this course includes answer-checking to ensure you’ve fully mastered each concept before learning the next.


  • How to choose which library to use when parallel computing.
  • Learn the why and how to use process pools.
  • How to debug parallel processing code.

Lesson Outline

  1. Movie Quotes Data
  2. The Concurrent Futures Package
  3. Reading In Files
  4. Finding The Longest Lines
  5. Finding The Most Commonly Used Word
  6. Debugging Errors
  7. Debugging Errors
  8. Removing Punctuation
  9. Finding Word Frequencies
  10. Next Steps
  11. Takeaways

Get started for free

No credit card required.

Or With

By creating an account you agree to accept our terms of use and privacy policy.