At the beginning of this path, you learned that data can be stored inside pandas dataframes and then you learned data can also be stored inside a database. While these data storage structures are ideal for some data, they're not ideal for massive amounts of data, as the data processing workload can get really slow with big datasets using these tools.

In the last lesson, we introduced the Spark cluster computing framework and explored some basic PySpark methods, all within the Dataquest interface. 

In this project, we'll walk through how to install Spark on your own computer and integrate PySpark with Jupyter Notebook.  We can use Spark in two modes:

  • Local mode - The entire Spark application runs on a single machine. Local mode is what you'll use to prototype Spark code on your own computer. It's also easier to set up.
  • Cluster mode - The Spark application runs across multiple machines. Cluster mode is what you'll use when you want to run your Spark application across multiple machines in a cloud environment like Amazon Web Services, Microsoft Azure, or Digital Ocean.

For this project, we'll walk through the instructions for installing Spark in local mode. Afterwards, you'll be able to use Spark with Jupyter notebook for personal projects or your technical interviews and land a job in big data!


  • Learn how to install Spark and Pyspark.
  • Learn to integrate PySpark with Jupyter Notebook.

Lesson Outline

1. Introduction
2. Java
3. Spark
4. PySpark Shell
5. Jupyter Notebook
6. Testing your Installation