Hacker News Pipeline
In this course on building a data pipeline, we began with the concepts of functional programming and then built our own data pipeline class in Python. We learned about advanced Python concepts such as the decorators, closures, and good API design. In the last lesson, we also learned how to implement a directed acyclic graph as the scheduler for our pipeline.
In this guided project, we will use the pipeline we have been building, and apply it to a real-world data pipeline project. From a JSON API, we will filter, clean, aggregate, and summarize data in a sequence of tasks that will apply these transformations for us.
The data we will use comes from a Hacker News (HN) API that returns JSON data of the top stories in 2014. If you’re unfamiliar with Hacker News, it’s a link aggregator website that users vote up stories that are interesting to the community. It is similar to Reddit, but the community only revolves around computer science and entrepreneurship posts.
Guided projects are meant to be challenging to better prepare you for the real world, so don’t be discouraged if you have to refer back to previous lessons. If you haven’t worked with Jupyter Notebook before or need a refresher, we recommend completing our Jupyter Notebook Guided Project before continuing.
As with all guided projects, we encourage you to experiment and extend your project, taking it in unique directions to make it a more compelling addition to your portfolio!
- Learn to work with JSON API data in Python.
- Learn to build a real world data pipeline from raw data to summarization.
- Introduction to the Data
- Loading the JSON Data
- Filtering the Stories
- Convert to CSV
- Extract Title Column
- Clean the Titles
- Create the Word Frequency Dictionary
- Sort the Top Words
- Next Steps