MISSION 266

Multiple Dependency Pipeline

At the end of the previous lesson on building a pipeline class, we discussed some drawbacks with our initial pipeline implementation. One of the drawbacks was the restriction of only linear running tasks. Using our tasks as an example, we'll show why this is a major drawback that we must address when we're building a data pipeline.

In our last lesson's task pipeline, the final task was to summarize logs outputted from a parsed CSV file. But suppose we wanted to also run a summarize on multiple column. This seems doable — our only requirement should be the parsed CSV — but with our linear pipeline, this will not work.

In this lesson, we will solve the linear dependency mapping problem of the pipeline created in the previous lesson. We will discuss the DAG or directed acrylic graph, and how we can use it as a task scheduler. Then, we will implement the DAG, and add it to our pipeline.

As you get familiar with the concepts in these multiple dependency pipelines, you’ll get to apply what you’ve learned from within your browser; there's no need to use your own machine to do the exercises. The Python environment inside of this course includes answer-checking to ensure you've fully mastered each concept before learning the next.

Objectives

  • Learn the basics of graph theory.
  • Learn to implement a directed acyclic graph in Python.
  • Write a schduler for the pipeline class.

Mission Outline

1. Overview
2. Intro to DAGs
3. The DAG Class
4. Sorting the DAG
5. Finding Number of In Degrees
6. Challenge: Sorting Dependencies
7. Enhance the Add Method
8. Adding DAG to the Pipeline
9. Challenge: Running the Pipeline
10. Next Steps
11. Takeaways

building-a-data-pipeline

Course Info:

Intermediate

The median completion time for this course is 6.3 hours. ​View Details​​​

This course requires a premium subscription. This course has four missions, and one guided project.  It is the seventh course in the Data Engineer Path.

START LEARNING FREE

Take a Look Inside