Tutorial: Installing and Integrating PySpark with Jupyter Notebook

At Dataquest, we’ve released an interactive course on Spark, with a focus on PySpark. We explore the fundamentals of Map-Reduce and how to utilize PySpark to clean, transform, and munge data. In this post, we’ll dive into how to install PySpark locally on your own computer and how to integrate it into the Jupyter Notebbok workflow. Some familarity with the command line will be necessary to complete the installation.


At a high level, these are the steps to install PySpark and integrate it with Jupyter notebook:

  1. Install the required packages below
  2. Download and build Spark
  3. Set your enviroment variables
  4. Create an Jupyter profile for PySpark

Required packages

  • Java SE Development Kit
  • Scala Build Tool
  • Spark 1.5.1 (at the time of writing)
  • Python 2.6 or higher (we prefer to use Python 3.4+)
  • Jupyter Notebook


Spark requires Java 7+, which you can download from Oracle’s website:


Head to the Spark downloads page, keep the default options in steps 1 to 3, and download a zipped version (.tgz file) of Spark from the link in step 4. Once you’ve downloaded Spark, we recommend unzipping the folder and moving the unzipped folder to your home directory.

Scala build tool

  1. To build Spark, you’ll need the Scala build tool, which you can install:
  1. Navigate to the directory you unzipped Spark to and run sbt assembly within that directory (this should take a while!).


To test that Spark was built properly, run the following command in the same folder (where Spark resides):


and the interactive PySpark shell should start up. This is the interactive PySpark shell, similar to Jupyter, but if you run sc in the shell, you’ll see the SparkContext object already initialized. You can write and run commands interactively in this shell just like you can with Jupyter.

Environment variables

Environment variables are global variables that any program on your computer can access and contain specific settings and pieces of information that you want all programs to have access to. In our case, we need to specify the location of Spark and add some special arguments which we reference later. Use nano or vim to open ~/.bash_profile and add the following lines at the end:

export SPARK_HOME="$HOME/spark-1.5.1"
export PYSPARK_SUBMIT_ARGS="--master local[2]"

Replace "$HOME/spark-1.5.1" with the location of the folder you unzipped Spark to (and also make sure the version numbers match!).

Jupyter profile

The last step is to create a profile for Jupyter specifically for PySpark with some custom settings. To create this profile, run:

Jupyter profile create pyspark

Use nano or vim to create the following Python script at the following location:


and then add the following to it:

import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-'))

filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))

spark_release_file = spark_home + "/RELEASE"

if os.path.exists(spark_release_file) and "Spark 1.5" in open(spark_release_file).read():
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: 
        pyspark_submit_args += " pyspark-shell"
        os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

If you’re using a later version than Spark 1.5, replace “Spark 1.5” with the version you’re using, in the script.


To start Jupyter Notebook with the pyspark profile, run:

jupyter notebook --profile=pyspark

To test that PySpark was loaded properly, create a new notebook and run sc in one of the code cells to make sure the SparkContext object was initialized properly.

Next Steps

If you’d like to learn spark in more detail, you can take our interactive Spark course on Dataquest.

Share On Facebook
Share On Twitter
Share On Linkedin
Share On Reddit