Find yourself working with massive data sets regularly? Learn how to use Apache Spark and the map-reduce technique to clean and analyze “big data” in this Apache Spark and PySpark course.
Big data is all around us and Spark is quickly becoming an in-demand Big Data tool that employers want to see in job applicants who’ll have to work with large data sets. If you want to work with cutting-edge, in-demand skills that employers will look fondly upon, taking this introductory Spark course is highly recommended.
In this course, you will learn what Apache Spark is and when it would be advantageous to use. You'll learn such concepts as Resilient Distributed Datasets (RDDs), Spark SQL, Spark DataFrames, and the difference between pandas and Spark Dataframes.
You will also learn how to install Spark and PySpark, a Python API that allows you to interact with Spark using Python code. We will also walk you through how to integrate PySpark with Jupyter Notebook so you can analyze large datasets from the comfort of a Jupyter notebook.
In this course, you'll be working with a variety of real-world data sets, including the text of Hamlet, census data, and guest data from The Daily Show.
We also offer a free tutorial on Apache Spark, which you can check out by clicking this link.
By the end of this course, you'll be able to:
Learn Spark and Map-Reduce
Introduction to Spark
Learn the basics of Spark by analyzing guests on The Daily Show
Project: Spark Installation and Jupyter Notebook Integration
Learn how to set up PySpark and integrate it with Jupyter Notebook.
Transformations and Actions
Learn more about transformations and actions while cleaning up the text of Hamlet.
Challenge: Transforming Hamlet into a Data Set
Practice using Spark to transform the text of Hamlet into a usable data set.
Learn the basics of Spark DataFrames by working with census data.
Learn the basics of Spark SQL by working with census data.