MISSION 93

Spark SQL

At the beginning of this path, you learned that data can be stored inside pandas dataframes and then you learned data can also be stored inside a SQL database. While these data storage structures are ideal for some data, they're not ideal for massive amounts of data as the data processing workload can get really slow with big datasets using these tools.

In the previous mission, we learned how to read JSON into a Spark DataFrame, as well as some basic techniques for interacting with DataFrames. In this mission, we'll learn how to use Spark's SQL interface to query and interact with the data. Later on, we'll add other files to demonstrate how to take advantage of SQL to work with multiple data sets.

To facilitate your learning about Spark dataframes, you will work with a JSON file containing data from the 2010 U.S Census. In addition to working with U.S Census data and learning about Spark dataframes, you’ll get to apply what you’ve learned from within your browser so that there's no need to use your own machine to do the exercises. The Python environment inside of this course includes answer checking so you can ensure that you've fully mastered each concept before learning the next concept.

Objectives

  • Learn to query Spark dataframes using SQL.
  • Learn how to work with multiple tables in Spark SQL.

Mission Outline

1. Overview
2. Register the DataFrame as a Table
3. Querying
4. Filtering
5. Mixing Functionality
6. Multiple tables
7. Joins
8. SQL Functions
9. Takeaways

spark-map-reduce

Course Info:

Intermediate

The median completion time for this course is 6 hours. View Details

This course requires a premium subscription and includes five missions, and one installation tutorial.  It is the 31st course in the Data Scientist In Python path.

START LEARNING FREE

Take a Look Inside