At the beginning of this path, you learned that data can be stored inside pandas dataframes and then you learned data can also be stored inside a SQL database. While these data storage structures are ideal for some data, they’re not ideal for massive amounts of data as the data processing workload can get really slow with big datasets using these tools.
In the previous lesson, we learned how to read JSON into a Spark DataFrame, as well as some basic techniques for interacting with DataFrames. In this lesson, we’ll learn how to use Spark’s SQL interface to query and interact with the data. Later on, we’ll add other files to demonstrate how to take advantage of SQL to work with multiple data sets.
To facilitate your learning about Spark dataframes, you will work with a JSON file containing data from the 2010 U.S Census. In addition to working with U.S Census data and learning about Spark dataframes, you’ll get to apply what you’ve learned from within your browser so that there’s no need to use your own machine to do the exercises. The Python environment inside of this course includes answer checking so you can ensure that you’ve fully mastered each concept before learning the next concept.
- Learn to query Spark dataframes using SQL.
- Learn how to work with multiple tables in Spark SQL.
- Register the DataFrame as a Table
- Mixing Functionality
- Multiple tables
- SQL Functions