Visualizing Women’s Marches: Part 1
In celebration of Women’s History Month, I wanted to better understand the scale of the Women’s Marches that occurred in January 2017. Shortly after the marches, Vox published a map visualizing the estimated turnout across the entire country.
This map is excellent at displaying:
- locations with the highest relative turnouts
- hubs and clusters of where marches occurred
I immediately had lots of follow-up questions, so I dove into the data set further, to find other potential data sources and create my own visualizations. I’ll walk through my exploration in this series of posts, focusing only the marches that happened in the United States.
In this post, I’ll go over data collection and cleaning. In Part 2, we’ll look at data enrichment and visualization.
My main goals in this series are to:
- highlight how most of data science often really is data collection and data cleaning
- use numbers and visualizations to appreciate the sheer scale of these marches
- be able to ask and approximate the answer for more questions
If you want to follow along and recreate the process, you’ll need to be comfortable with Python (specifically the pandas and matplotlib libraries) and a spreadsheet tool (Microsoft Excel or Google Sheets). I also recommend keeping our pandas cheatsheet open as you follow along.
Data collection
Both of the data sets I used in this blog post have been crowdsourced by volunteers. Thankfully, most of the data that was contributed has links to the sources. Unfortunately, some of these source links no longer exist or can be unreliable (self-reported counts). This means that we should avoid drawing substantive conclusions and instead use them to inform some general trends and insights.
Social media crowdsourced data set
The map from above was generated from this crowdsourced data effort, led by professors Erica Chenoweth and Jeremy Pressman.
The spreadsheet contains a few worksheets that are mostly different views of the same data. It seems like the first worksheet (By Town/City) has the most number of rows on U.S. marches, so we’ll use that one.
Because the spreadsheet is a public file, it’s locked for editing. Let’s copy and paste the rows and columns we’re interested in into a new spreadsheet and download it as a CSV file. For columns with a formula, it’s important to rewrite the formula after copying.
Keep only the following columns:
We’ll revisit this data set later in the data cleaning section.
Wikipedia crowdsourced data set
Another good resource I discovered was a Wikipedia page listing all of the marches, crowdsourced by the Wikipedia volunteer community.
Downloading this data set was a lot more painful and required much more experimentation. Here are two different techniques I tried that I gave up on halfway through:
- Fetched and parsed the page using requests and BeautifulSoup.
- Problem: The HTML formatting for the table was inconsistent and I wanted to try some other approaches before committing to perusing lots of HTML code.
- Used the
pandas.read_html()
function to read the entire webpage and try to parse the relevant table as a DataFrame object.- Problem: Same issue of inconsistent HTML formatting, specifically around the number of columns in each row. The resulting DataFrame would require a tremendous of hand cleaning (almost to the level row by row transcription).
The challenge with trying to generate a DataFrame from HTML code is that the formatting has to be perfect. Every row needs to have the same number of columns, or the resulting DataFrame can come out wrong.
On the other hand, Wikipedia renders the underlying HTML code into a nice table that’s easy for us to interact with. We can perform manual text selection to select the well formatted rows and copy and paste it elsewhere.
This is the workflow I ultimately used:
- Sorted the table by the State column to force each row to contain the same number of columns (a requirement for pandas), except for the row for Washington DC.
- Used my mouse to select and copy the rows above the Washington DC row, including the header row with the column names. Then, I used
pandas.read_clipboard()
to read my selection into a DataFrame. This function was implemented by the pandas developers specifically for copying small tables: - Used my mouse to select and copy the rows below the Washington DC row. Pandas unfortunately returned an error when I tried to parse my clipboard this time, because there was inconsistencies in the number of columns. Instead, I used the free tool, HTML Table Generator, to generate a HTML table (File > Paste table data), and used the editor to quickly clean up the data and export the HTML .
- I then generated new HTML code from this cleaned table, copied it into a Python string object, and then used
pandas.read_html()
to read the string into a DataFrame. - Finally, I manually added a row in the first dataframe containing the data on Washington DC.
While this workflow was definitely cobbled together and isn’t really scalable, it let me get the data I needed pretty quickly without having to do any premature optimization. Also, this is more of a casual project, so a hacky approach is just fine! Ultimately a more rigorous analysis would require setting up a robust data pipeline, strict procedures for maintaining data quality, and more reliable data sets as well.
Data Cleaning
Now for the fun part, cleaning the data sets!
Social media crowdsourced data set
Because this data set is small (only contains a few hundred rows and six columns), it’s really convenient to use a direct manipulation spreadsheet tool like Microsoft Excel or Google Sheets to perform some of the data cleaning. As you scroll through the spreadsheet, you’ll notice a few issues with the data:
- Some rows have missing values for the turnout numbers.
- Several issues with the
State
column:- Two rows have
--
for theState
column. - One row has
CA/NV
for the state column.
- Two rows have
- Some rows have missing information on the state and country.
- Some of the values in the
City_state
column aren’t formatted in the style: “, “. - Some rows are missing values for both the
State
andCountry
column. - Two rows describe marches that happend in the same city on different dates (
Salt Lake City, UT (1/20)
andSalt Lake City, UT (1/23)
).
We can handle the last five issues first using a spreadsheet tool.
- Remove rows with
--
in theState/Country
column, as they don’t seem to have happened in a state or territory of the U.S. - For the row with
South Lake Tahoe, NV
for theState/Country
column, edit it toSouth Lake Tahoe, NV
and toNV
for theState
column. - For values in the
City_state
column that are missing the city (e.g.Cambridge
), add it back in by using theState
value. - For values in the
City_state
column that don’t have a state value after the comma, manually look up and fix (e.g.Christiansted, St. Croix
should beChristiansted, VI
for Virgin Islands). - For values in the
City_state
column containing extra info, shorten (e.g.Encinitas, CA (Seacrest)
toEncinitas, CA
). - For rows missing values in the
State
andCountry
column, manually add them back in. - Combine the two rows for
Salt Lake City, UT (1/20)
andSalt Lake City, UT (1/23)
).
Here’s a link to the final spreadsheet. Download it as a CSV file and load it into a pandas DataFrame, to handle the first issue of rows containing missing values for turnout numbers.
import pandas as pd
all_marches = pd.read_csv("FinalSocialMedia.csv")
all_marches
City_State | State | Country | Low | Average | High | |
---|---|---|---|---|---|---|
0 | Abilene, TX | TX | US | 200 | 200 | 200 |
1 | Accident, MD | MD | US | 54 | 54 | 54 |
2 | Adak, AK | AK | US | 10 | 10 | 10 |
3 | Adrian, MI | MI | US | 150 | 150 | 150 |
4 | Ajo, AZ | AZ | US | 250 | 250 | 250 |
… | … | … | … | … | … | … |
667 | Yellow Springs, OH | OH | US | 250 | 250 | 250 |
668 | Ypsilanti, MI | MI | US | 1,200 | 1,200 | 1,200 |
669 | Yucca Valley, CA | CA | US | 76 | 138 | 200 |
670 | Yuma, AZ | AZ | US | 10 | 10 | 10 |
671 | Zebulon, GA | GA | US | 35 | 35 | 35 |
Let’s count the number of rows missing values for each column:
all_marches.isnull().sum()
City_State 0
State 2
Country 1
Low 44
Average 0
High 44
dtype: int64
Then, investigate the two rows missing values for the State
column:
all_marches[all_marches['State'].isnull()]
City_State | State | Country | Low | Average | High | |
---|---|---|---|---|---|---|
149 | Disability (online) | NaN | US | NaN | 0 | NaN |
665 | Worcester, MA | NaN | NaN | NaN | 0 | NaN |
Looks like we also uncovered the row with a missing value for the Country
column. Let’s drop both rows, since one is missing turnout counts and one happened online (which limits us from visualizing on a map).
all_marches = all_marches.drop([149, 665], axis=0)
all_marches.isnull().sum()
City_State 0
State 0
Country 0
Low 42
Average 0
High 42
dtype: int64
Now we just need to handle the rows containing missing values for the Low
and High
columns. It’s very likely that rows containing missing values for one column contain a missing value for the other column as well. Try dropping all rows with missing values for the Low
column and see if that handles the ones with missing High
values:
low_null = all_marches[~all_marches['Low'].isnull()]
low_null.isnull().sum()
City_State 0
State 0
Country 0
Low 0
Average 0
High 0
dtype: int64
Looks like our hypothesis was correct! We can go ahead and drop those rows for the main DataFrame and assign the result to a new DataFrame.
sm_marches = all_marches[~all_marches['Low'].isnull()]
Wikipedia crowdsourced data set
Let’s start by exporting both of the DataFrames into separate CSV files, above_dc.csv
and below_dc.csv
. Then, import both of them into a spreadsheet tool. Scroll through both spreadsheets and start to look for data quality issues. Here are some of the things we need to address:
- The rows containing photos spill over to two or three rows.
- Extra text for citation links for many values (
[6]
) in theCities
andApproximate Attendance
columns. - Many instances of ranges that can’t be parsed as numbers in pandas.
5,000 - 10,000
70+
100s (hundreds)
- Repetitive or extra information in some of the
State
values (e.g.Georgia (U.S State) Georgia
).
Let’s use the spreadsheet tool to fix all of these issues (it should take approximately 30 minutes):
- Find all of the marches that spill over into two or three rows, and format them manually.
- Replace all of the numerical values with poor formatting using the following rules:
- Replace ranges like
5,000 - 10,0000
with the average value. It would be nice to preserve the rangaes by creating 2 columns (one for low and one for high), but this increases the time we need to spend cleaning manually by a lot. - Replace values like
70+
with just70
. - Replace values like
100s (hundreds)
with just100
. - For any small extraneous issues, use your best judgment.
- Replace ranges like
Finally, combine both of the spreadsheets into a single spreadsheet and export to a CSV file. Here’s a link to the final version: Google Sheets Link
Next Steps
After a few hours of data collection and cleaning, we have two clean data sets to work with. In the next blog post, we’ll focus on enriching the data sets with more information (e.g. geodata), calculating some summary statistics, and creating some simple visualizations.