MISSION 60

Introduction to Spark

In the Spark online lesson, you'll learn several core concepts such as Spark's main core data structure, Spark methods, and more! You'll also learn PySpark, a wonderful toolkit that allows us to interface with Spark using Python. We'll work with a dataset containing the names of all the guests who have appeared on The Daily Show.

At the beginning of this path, you learned that data can be stored inside panda dataframes and then you learned data can also be stored inside a database. While these data storage structures are ideal for some data, they're not ideal for massive amounts of data as the data processing workload can get really slow with big datasets using these tools.

To work around this, UC Berkeley AMP Lab spearheaded groundbreaking work to develop Spark, which uses distributed, in-memory data structures to improve speeds for many data processing workloads by several orders of magnitude.

As you work through each concept, you’ll get to apply what you’ve learned from within your browser — there's no need to use your own machine to do the exercises. The Python environment inside of this course includes answer checking so you can ensure that you've fully mastered each concept before learning the next.

Objectives

  • Learn a brief history of a Big Data.
  • Learn about RDD objects and how they work in Spark.
  • Learn the basics of counting in Spark

Mission Outline

1. A Brief History of Big Data
2. The Spark Revolution
3. Resilient Distributed Data Sets (RDDs)
4. SparkContext
5. Lazy Evaluation
6. Pipelines
7. Python and Scala, Friends Forever
8. ReduceByKey()
9. Explanation
10. Filter
11. Practice with Pipelines
12. Next Steps
13. Takeaways

spark-map-reduce

Course Info:

Intermediate

The median completion time for this course is 6 hours. View Details

This course requires a premium subscription and includes five missions, and one installation tutorial.  It is the 31st course in the Data Scientist In Python path.

START LEARNING FREE

Take a Look Inside