Tutorial: Installing and Integrating PySpark with Jupyter Notebook
Overview
At a high level, these are the steps to install PySpark and integrate it with Jupyter notebook:
- Install the required packages below
- Download and build Spark
- Set your enviroment variables
- Create an Jupyter profile for PySpark
Required packages
- Java SE Development Kit
- Scala Build Tool
- Spark 1.5.1 (at the time of writing)
- Python 2.6 or higher (we prefer to use Python 3.4+)
- Jupyter Notebook
Java
Spark requires Java 7+, which you can download from Oracle's website:
Spark
Head to the
Spark downloads page, keep the default options in steps 1 to 3, and download a zipped version (.tgz file) of Spark from the link in step 4. Once you've downloaded Spark, we recommend unzipping the folder and moving the unzipped folder to your home directory.
Scala build tool
- To build Spark, you'll need the Scala build tool, which you can install:
-
Mac:
brew install sbt
- Linux: instructions
-
Navigate to the directory you unzipped Spark to and run
sbt assembly
within that directory (this should take a while!).
Test
To test that Spark was built properly, run the following command in the same folder (where Spark resides):
bin/pyspark
and the interactive PySpark shell should start up. This is the interactive PySpark shell, similar to Jupyter, but if you run
sc
in the shell, you'll see the SparkContext object already initialized. You can write and run commands interactively in this shell just like you can with Jupyter.
Environment variables
Environment variables are global variables that any program on your computer can access and contain specific settings and pieces of information that you want all programs to have access to. In our case, we need to specify the location of Spark and add some special arguments which we reference later. Use
nano
or vim
to open ~/.bash_profile
and add the following lines at the end:
export SPARK_HOME="$HOME/spark-1.5.1"
export PYSPARK_SUBMIT_ARGS="--master local[2]"
Replace
"$HOME/spark-1.5.1"
with the location of the folder you unzipped Spark to (and also make sure the version numbers match!).
Jupyter profile
The last step is to create a profile for Jupyter specifically for PySpark with some custom settings. To create this profile, run:
Jupyter profile create pyspark
Use
nano
or vim
to create the following Python script at the following location:
~/.jupyter/profile_pyspark/startup/00-pyspark-setup.py
and then add the following to it:
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.5" in open(spark_release_file).read():
pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args:
pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
If you're using a later version than Spark 1.5, replace "Spark 1.5" with the version you're using, in the script.
Run
To start Jupyter Notebook with the
pyspark
profile, run:
jupyter notebook --profile=pyspark
To test that PySpark was loaded properly, create a new notebook and run
sc
in one of the code cells to make sure the SparkContext object was initialized properly.
Next Steps
If you'd like to learn spark in more detail, you can take our
interactive Spark course on Dataquest.