At Dataquest, we've released an interactive course on Spark, with a focus on PySpark. We explore the fundamentals of Map-Reduce and how to utilize PySpark to clean, transform, and munge data. In this post, we'll dive into how to install PySpark locally on your own computer and how to integrate it into the Jupyter Notebbok workflow.
Some familarity with the command line will be necessary to complete the installation.
At a high level, these are the steps to install PySpark and integrate it with Jupyter notebook:
- Install the required packages below
- Download and build Spark
- Set your enviroment variables
- Create an Jupyter profile for PySpark
- Java SE Development Kit
- Scala Build Tool
- Spark 1.5.1 (at the time of writing)
- Python 2.6 or higher (we prefer to use Python 3.4+)
- Jupyter Notebook
Spark requires Java 7+, which you can download from Oracle's website:
Head to the Spark downloads page, keep the default options in steps 1 to 3, and download a zipped version (.tgz file) of Spark from the link in step 4. Once you've downloaded Spark, we recommend unzipping the folder and moving the unzipped folder to your home directory.
Scala build tool
- To build Spark, you'll need the Scala build tool, which you can install:
brew install sbt
- Linux: instructions
- Navigate to the directory you unzipped Spark to and run
sbt assemblywithin that directory (this should take a while!).
To test that Spark was built properly, run the following command in the same folder (where Spark resides):
and the interactive PySpark shell should start up. This is the interactive PySpark shell, similar to Jupyter, but if you run
sc in the shell, you'll see the SparkContext object already initialized. You can write and run commands interactively in this shell just like you can with Jupyter.
Environment variables are global variables that any program on your computer can access and contain specific settings and pieces of information that you want all programs to have access to. In our case, we need to specify the location of Spark and add some special arguments which we reference later.
vim to open
~/.bash_profile and add the following lines at the end:
export SPARK_HOME="$HOME/spark-1.5.1" export PYSPARK_SUBMIT_ARGS="--master local"
"$HOME/spark-1.5.1" with the location of the folder you unzipped Spark to (and also make sure the version numbers match!).
The last step is to create a profile for Jupyter specifically for PySpark with some custom settings. To create this profile, run:
Jupyter profile create pyspark
vim to create the following Python script at the following location:
and then add the following to it:
import os import sys spark_home = os.environ.get('SPARK_HOME', None) sys.path.insert(0, spark_home + "/python") sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip')) filename = os.path.join(spark_home, 'python/pyspark/shell.py') exec(compile(open(filename, "rb").read(), filename, 'exec')) spark_release_file = spark_home + "/RELEASE" if os.path.exists(spark_release_file) and "Spark 1.5" in open(spark_release_file).read(): pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "") if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell" os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
If you're using a later version than Spark 1.5, replace "Spark 1.5" with the version you're using, in the script.
To start Jupyter Notebook with the
pyspark profile, run:
jupyter notebook --profile=pyspark
To test that PySpark was loaded properly, create a new notebook and run
sc in one of the code cells to make sure the SparkContext object was initialized properly.
If you'd like to learn spark in more detail, you can take our interactive Spark course on Dataquest.