Big data is all around us, and Spark is quickly becoming an in-demand Big Data tool that employers want to see.
In this course, you’ll learn the advantages of Apache Spark. You’ll learn concepts such as Resilient Distributed Datasets (RDDs), Spark SQL, Spark DataFrames, and the difference between pandas and Spark DataFrames.
You’ll also learn how to install Spark and PySpark, a Python API that allows you to interact with Spark using Python code. You’ll learn how to integrate PySpark with Jupyter Notebook so you can analyze large datasets.
Best of all, you’ll learn by doing — you’ll practice and get feedback directly in the browser. You’ll work with a variety of real-world datasets, including the text of Hamlet, census data, and guest data from The Daily Show.
- Breaking down tasks using the map-reduce framework
- Processing and transforming larger, raw files using Spark
- Working with large, unstructured datasets using Spark SQL and Spark DataFrames
Analyzing Large Datasets in Spark and Map-Reduce [6 lessons]
- Install Spark and PySpark
- Integrate PySpark with Jupyter Notebook
Projects in this course
The Dataquest guarantee
Dataquest has helped thousands of people start new careers in data. If you put in the work and follow our path, you’ll master data skills and grow your career.
We believe so strongly in our paths that we offer a full satisfaction guarantee. If you complete a career path on Dataquest and aren’t satisfied with your outcome, we’ll give you a refund.