Big data is all around us, and Spark is quickly becoming an in-demand Big Data tool that employers want to see.
In this course, you’ll learn the advantages of Apache Spark. You’ll learn concepts such as Resilient Distributed Datasets (RDDs), Spark SQL, Spark DataFrames, and the difference between pandas and Spark DataFrames.
You’ll also learn how to install Spark and PySpark, a Python API that allows you to interact with Spark using Python code. You’ll learn how to integrate PySpark with Jupyter Notebook so you can analyze large datasets.
Best of all, you’ll learn by doing — you’ll practice and get feedback directly in the browser. You’ll work with a variety of real-world datasets, including the text of Hamlet, census data, and guest data from The Daily Show.
- Breaking down tasks using the map-reduce framework
- Processing and transforming larger, raw files using Spark
- Working with large, unstructured datasets using Spark SQL and Spark DataFrames
Analyzing Large Datasets in Spark and Map-Reduce [6 lessons]
Introduction to Spark 1hLesson Objectives
- Explain the history of big data
- Define how the RDD object works in Spark
- Count using Spark
Project: Spark Installation and Jupyter Notebook Integration 1hLesson Objectives
- Install Spark and PySpark
- Integrate PySpark with Jupyter Notebook
Transformations and Actions 1hLesson Objectives
- Read TSV files into Spark
- Apply lambda functions over RDD objects
Challenge: Transforming Hamlet into a Data Set 1hLesson Objectives
- Transform data from text files into RDD objects
- Clean data using lambda functions
Spark DataFrames 1hLesson Objectives
- Employ Spark DataFrames
- Identify the difference between pandas and Spark DataFrames
- Perform basic filters with Spark DataFrames
Projects in this course
Project: Spark Installation and Jupyter Notebook Integration
Learn how to set up PySpark and integrate it with Jupyter Notebook.
The Dataquest guarantee
Dataquest has helped thousands of people start new careers in data. If you put in the work and follow our path, you’ll master data skills and grow your career.
We believe so strongly in our paths that we offer a full satisfaction guarantee. If you complete a career path on Dataquest and aren’t satisfied with your outcome, we’ll give you a refund.
Master skills faster with Dataquest
Go from zero to job-ready
Learn exactly what you need to achieve your goal. Don’t waste time on unrelated lessons.
Build your project portfolio
Build confidence with our in-depth projects, and show off your data skills.
Challenge yourself with exercises
Work with real data from day one with interactive lessons and hands-on exercises.
Showcase your path certification
Impress employers by completing a capstone project and certifying it with an expert review.