At the beginning of this path, you learned that data can be stored inside pandas dataframes, and then you learned data can also be stored inside a SQL database. While these data storage structures are ideal for some data, they’re not ideal for massive amounts of data as the data processing workload can get really slow with big datasets using these tools.
In this online Spark dataframes lesson, you’ll continue using PySpark and learn about Spark dataframes and their advantages over pandas dataframes. The Spark dataframe was inspired by pandas and combines the scale and speed of Spark with the familiar query, filter, and analysis capabilities of pandas. Spark dataframes also allow us to modify and reuse our existing pandas code to scale it up for handling much larger data sets.
To facilitate your learning about Spark dataframes, you will work with a JSON file containing adata from the 2010 U.S Census. In addition to working with U.S Census data and learning about Spark dataframes, you’ll get to apply what you’ve learned from within your browser so that there’s no need to use your own machine to do the exercises. The Python environment inside of this course includes answer checking so you can ensure that you’ve fully mastered each concept before learning the next concept.
- Learn how to work with Spark dataframes.
- Learn the difference between pandas and Spark dataframes.
- Learn to perform basic filters with Spark dataframes.
- The Spark DataFrame: An Introduction
- Reading in Data
- Pandas vs Spark DataFrames
- Row Objects
- Selecting Columns
- Filtering Rows
- Using Column Comparisons as Filters
- Converting Spark DataFrames to pandas DataFrames
- Next Steps