/ WHM

Visualizing Women's Marches: Part 1

In celebration of Women's History Month, I wanted to better understand the scale of the Women's Marches that occurred in January 2017. Shortly after the marches, Vox published a map visualizing the estimated turnout across the entire country.

Vox Visualization

This map is excellent at displaying:

  • locations with the highest relative turnouts
  • hubs and clusters of where marches occurred

I immediately had lots of follow-up questions, so I dove into the data set further, to find other potential data sources and create my own visualizations. I'll walk through my exploration in this series of posts, focusing only the marches that happened in the United States.

In this post, I'll go over data collection and cleaning. In Part 2, we'll look at data enrichment and visualization.

My main goals in this series are to:

  • highlight how most of data science often really is data collection and data cleaning
  • use numbers and visualizations to appreciate the sheer scale of these marches
  • be able to ask and approximate the answer for more questions

If you want to follow along and recreate the process, you'll need to be comfortable with Python (specifically the pandas and matplotlib libraries) and a spreadsheet tool (Microsoft Excel or Google Sheets). I also recommend keeping our pandas cheatsheet open as you follow along.

Data collection

Both of the data sets I used in this blog post have been crowdsourced by volunteers. Thankfully, most of the data that was contributed has links to the sources. Unfortunately, some of these source links no longer exist or can be unreliable (self-reported counts). This means that we should avoid drawing substantive conclusions and instead use them to inform some general trends and insights.

Social media crowdsourced data set

The map from above was generated from this crowdsourced data effort, led by professors Erica Chenoweth and Jeremy Pressman.

The spreadsheet contains a few worksheets that are mostly different views of the same data. It seems like the first worksheet (By Town/City) has the most number of rows on U.S. marches, so we'll use that one.

Because the spreadsheet is a public file, it's locked for editing. Let's copy and paste the rows and columns we're interested in into a new spreadsheet and download it as a CSV file. For columns with a formula, it's important to rewrite the formula after copying.

Keep only the following columns:

spreadsheet_preview

We'll revisit this data set later in the data cleaning section.

Wikipedia crowdsourced data set

Another good resource I discovered was a Wikipedia page listing all of the marches, crowdsourced by the Wikipedia volunteer community.

wiki_preview

Downloading this data set was a lot more painful and required much more experimentation. Here are two different techniques I tried that I gave up on halfway through:

  • Fetched and parsed the page using requests and BeautifulSoup.
    • Problem: The HTML formatting for the table was inconsistent and I wanted to try some other approaches before committing to perusing lots of HTML code.
  • Used the pandas.read_html() function to read the entire webpage and try to parse the relevant table as a DataFrame object.
    • Problem: Same issue of inconsistent HTML formatting, specifically around the number of columns in each row. The resulting DataFrame would require a tremendous of hand cleaning (almost to the level row by row transcription).

The challenge with trying to generate a DataFrame from HTML code is that the formatting has to be perfect. Every row needs to have the same number of columns, or the resulting DataFrame can come out wrong.

On the other hand, Wikipedia renders the underlying HTML code into a nice table that's easy for us to interact with. We can perform manual text selection to select the well formatted rows and copy and paste it elsewhere.

This is the workflow I ultimately used:

  • Sorted the table by the State column to force each row to contain the same number of columns (a requirement for pandas), except for the row for Washington DC.

    wikipedia_sort

  • Used my mouse to select and copy the rows above the Washington DC row, including the header row with the column names. Then, I used pandas.read_clipboard() to read my selection into a DataFrame. This function was implemented by the pandas developers specifically for copying small tables:

    pandas_read_clipboard

  • Used my mouse to select and copy the rows below the Washington DC row. Pandas unfortunately returned an error when I tried to parse my clipboard this time, because there was inconsistencies in the number of columns. Instead, I used the free tool, HTML Table Generator, to generate a HTML table (File > Paste table data), and used the editor to quickly clean up the data and export the HTML .

    html_table_generator

  • I then generated new HTML code from this cleaned table, copied it into a Python string object, and then used pandas.read_html() to read the string into a DataFrame.

    pandas_string_html

  • Finally, I manually added a row in the first dataframe containing the data on Washington DC.

While this workflow was definitely cobbled together and isn't really scalable, it let me get the data I needed pretty quickly without having to do any premature optimization. Also, this is more of a casual project, so a hacky approach is just fine! Ultimately a more rigorous analysis would require setting up a robust data pipeline, strict procedures for maintaining data quality, and more reliable data sets as well.

Data Cleaning

Now for the fun part, cleaning the data sets!

Social media crowdsourced data set

Because this data set is small (only contains a few hundred rows and six columns), it's really convenient to use a direct manipulation spreadsheet tool like Microsoft Excel or Google Sheets to perform some of the data cleaning. As you scroll through the spreadsheet, you'll notice a few issues with the data:

  • Some rows have missing values for the turnout numbers.
  • Several issues with the State column:
    • Two rows have -- for the State column.
    • One row has CA/NV for the state column.
  • Some rows have missing information on the state and country.
  • Some of the values in the City_state column aren't formatted in the style: ",".
  • Some rows are missing values for both the State and Country column.
  • Two rows describe marches that happend in the same city on different dates (Salt Lake City, UT (1/20) and Salt Lake City, UT (1/23)).

We can handle the last five issues first using a spreadsheet tool.

  • Remove rows with -- in the State/Country column, as they don't seem to have happened in a state or territory of the U.S.
  • For the row with South Lake Tahoe, NV for the State/Country column, edit it to South Lake Tahoe, NV and to NV for the State column.
  • For values in the City_state column that are missing the city (e.g. Cambridge), add it back in by using the State value.
  • For values in the City_state column that don't have a state value after the comma, manually look up and fix (e.g. Christiansted, St. Croix should be Christiansted, VI for Virgin Islands).
  • For values in the City_state column containing extra info, shorten (e.g. Encinitas, CA (Seacrest) to Encinitas, CA).
  • For rows missing values in the State and Country column, manually add them back in.
  • Combine the two rows for Salt Lake City, UT (1/20) and Salt Lake City, UT (1/23)).

Here's a link to the final spreadsheet. Download it as a CSV file and load it into a pandas DataFrame, to handle the first issue of rows containing missing values for turnout numbers.

import pandas as pd
all_marches = pd.read_csv("FinalSocialMedia.csv")
all_marches
City_State State Country Low Average High
0 Abilene, TX TX US 200 200 200
1 Accident, MD MD US 54 54 54
2 Adak, AK AK US 10 10 10
3 Adrian, MI MI US 150 150 150
4 Ajo, AZ AZ US 250 250 250
... ... ... ... ... ... ...
667 Yellow Springs, OH OH US 250 250 250
668 Ypsilanti, MI MI US 1,200 1,200 1,200
669 Yucca Valley, CA CA US 76 138 200
670 Yuma, AZ AZ US 10 10 10
671 Zebulon, GA GA US 35 35 35

Let's count the number of rows missing values for each column:

all_marches.isnull().sum()
City_State     0
State          2
Country        1
Low           44
Average        0
High          44
dtype: int64

Then, investigate the two rows missing values for the State column:

all_marches[all_marches['State'].isnull()]
City_State State Country Low Average High
149 Disability (online) NaN US NaN 0 NaN
665 Worcester, MA NaN NaN NaN 0 NaN

Looks like we also uncovered the row with a missing value for the Country column. Let's drop both rows, since one is missing turnout counts and one happened online (which limits us from visualizing on a map).

all_marches = all_marches.drop([149, 665], axis=0)
all_marches.isnull().sum()
City_State     0
State          0
Country        0
Low           42
Average        0
High          42
dtype: int64

Now we just need to handle the rows containing missing values for the Low and High columns. It's very likely that rows containing missing values for one column contain a missing value for the other column as well. Try dropping all rows with missing values for the Low column and see if that handles the ones with missing High values:

low_null = all_marches[~all_marches['Low'].isnull()]
low_null.isnull().sum()
City_State    0
State         0
Country       0
Low           0
Average       0
High          0
dtype: int64

Looks like our hypothesis was correct! We can go ahead and drop those rows for the main DataFrame and assign the result to a new DataFrame.

sm_marches = all_marches[~all_marches['Low'].isnull()]

Wikipedia crowdsourced data set

Let's start by exporting both of the DataFrames into separate CSV files, above_dc.csv and below_dc.csv. Then, import both of them into a spreadsheet tool. Scroll through both spreadsheets and start to look for data quality issues. Here are some of the things we need to address:

  • The rows containing photos spill over to two or three rows.

    wiki_spillover

  • Extra text for citation links for many values ([6]) in the Cities and Approximate Attendance columns.

    numerous_issues

  • Many instances of ranges that can't be parsed as numbers in pandas.

    • 5,000 - 10,000
    • 70+
    • 100s (hundreds)
  • Repetitive or extra information in some of the State values (e.g. Georgia (U.S State) Georgia).

Let's use the spreadsheet tool to fix all of these issues (it should take approximately 30 minutes):

  • Find all of the marches that spill over into two or three rows, and format them manually.
  • Replace all of the numerical values with poor formatting using the following rules:
    • Replace ranges like 5,000 - 10,0000 with the average value. It would be nice to preserve the rangaes by creating 2 columns (one for low and one for high), but this increases the time we need to spend cleaning manually by a lot.
    • Replace values like 70+ with just 70.
    • Replace values like 100s (hundreds) with just 100.
    • For any small extraneous issues, use your best judgment.

Finally, combine both of the spreadsheets into a single spreadsheet and export to a CSV file. Here's a link to the final version: Google Sheets Link

Next Steps

After a few hours of data collection and cleaning, we have two clean data sets to work with. In the next blog post, we'll focus on enriching the data sets with more information (e.g. geodata), calculating some summary statistics, and creating some simple visualizations.

Srini Kadamati

Srini Kadamati

Director of Content at Dataquest.io. Based in Austin, TX.

Read More