MISSION 227

Guided Project: Analyzing Wikipedia Pages

In this guided project, you'll work with data scraped from Wikipedia, a popular online encyclopedia. Wikipedia is maintained by volunteer content contributors and editors who continuously improve content. Anyone can edit Wikipedia entries, and because Wikipedia is crowd-sourced, it's been able to rapidly assemble a huge library of articles.

With the data that was scraped from Wikipedia, we'll analyze and process data on articles to figure out patterns in the Wikipedia writing and content presentation style. The articles were scraped by hitting random pages on Wikipedia, then downloading the contents using the requests package. If you need a refresher on web scraping and HTML, you may want to check out our APIs and Web Scraping course before trying this guided project.

Working on guided projects will give you hands-on experience with real-world examples, so we encourage you to not only complete them, but to take the time to really understand the concepts.

These projects are meant to be challenging to better prepare you for the real world, so don't be discouraged if you have to refer back to previous missions. If you haven't worked with Jupyter Notebook before or need a refresher, we recommend completing our Jupyter Notebook Guided Project before continuing.

As with all guided projects, we encourage you to experiment and extend your project, taking it in unique directions to make it a more compelling addition to your portfolio!

Objectives

  • How to use parallel computing to quickly analyze Wikipedia pages.
  • How to process and strip HTML pages in Python.

Mission Outline

1. Introducing Wikipedia Data
2. Reading In The Data
3. Remove Extraneous Markup
4. Finding Common Tags
5. Finding Common Words
6. Next Steps

improving-code-performance

Course Info:

Intermediate

The median completion time for this course is 4.7 hours. View Details

This course requires a premium subscription and includes four missions, and one guided project.  It is the fourth course in the Data Engineer path.

START LEARNING FREE

Take a Look Inside