Challenge: Transforming Hamlet into a Data Set

At the beginning of this path, you learned that data can be stored inside pandas dataframes, and then you learned data can also be stored inside a SQL database. While these data storage structures are ideal for some data, they're not ideal for massive amounts of data as the data processing workload can get really slow with big datasets using these tools

In the previous two missions, we covered the basics of PySpark, the MapReduce paradigm, transformations and actions, and how to do basic data cleanup in Spark. 

In this data science online challenge, you'll use the techniques you've learned to transform the text of Hamlet into a format that's more useful for data analysis. As a data scientist, knowing how to use Spark to analyze big data is a critical skill and is a skill many employers are actively look for when evaluating candidates.

At Dataquest, we're huge believers in learning through doing, and we hope this shows in your experience with the missions. While missions focus on introducing concepts, challenges allow you to perform deliberate practice by completing structured problems. Challenges will feel similar to missions, but with little instructional material and a larger focus on exercises.


  • Transforming data from text files into RDD objects.
  • Cleaning data using lambda functions.

Mission Outline

1. Introduction
2. Extract Line Numbers
3. Remove Blank Values
4. Remove Pipe Characters


Course Info:


The median completion time for this course is 6 hours. View Details

This course requires a premium subscription and includes five missions, and one installation tutorial.  It is the 31st course in the Data Scientist In Python path.


Take a Look Inside