At the beginning of this path, you learned that data can be stored inside pandas dataframes, and then you learned data can also be stored inside a SQL database. While these data storage structures are ideal for some data, they're not ideal for massive amounts of data as the data processing workload can get really slow with big datasets using these tools

In the previous two lessons, we covered the basics of PySpark, the MapReduce paradigm, transformations and actions, and how to do basic data cleanup in Spark. 

In this data science online challenge, you'll use the techniques you've learned to transform the text of Hamlet into a format that's more useful for data analysis. As a data scientist, knowing how to use Spark to analyze big data is a critical skill and is a skill many employers are actively look for when evaluating candidates.

At Dataquest, we're huge believers in learning through doing, and we hope this shows in your experience with the lessons. While lessons focus on introducing concepts, challenges allow you to perform deliberate practice by completing structured problems. Challenges will feel similar to lessons, but with little instructional material and a larger focus on exercises.

Objectives

  • Transforming data from text files into RDD objects.
  • Cleaning data using lambda functions.

Lesson Outline

1. Introduction
2. Extract Line Numbers
3. Remove Blank Values
4. Remove Pipe Characters