/ WHM

Visualizing Women's Marches: Part 2

This post is the second in a series on visualizing the Women's Marches from January 2017. In the first post, we explored the intensive data collection and data cleaning process necessary to produce clean pandas dataframes.

Data Enrichment

Because we eventually want to be able to build maps visualizing the marches, we need latitude and longitude coordinates. If we preview both dataframes, you'll notice that we're missing this information for both dataframes:

sm_marches.head()
City_State State Country Low Average High
0 Abilene, TX TX US 200 200 200
1 Accident, MD MD US 54 54 54
2 Adak, AK AK US 10 10 10
3 Adrian, MI MI US 150 150 150
4 Ajo, AZ AZ US 250 250 250
wiki_marches.head()
State City Turnout
0 Texas Abilene 200
1 Maryland Accident 54
2 Alaska Adak 10
3 Michigan Adrian 140
4 Colorado Alamosa 350

Another issue is that one dataframe uses state codes (like TX) while the other uses the full state names (like Texas). Thankfully, we can solve both the missing geodata and state representation problem using a geocoding service. Geocoding is the process of converting an address to a concrete pair of latitude and longitude coordinates. Using a geocoding service to improve the dataset is an example of data enrichment. Data enrichment is extremely common when working on data science projects, because the raw data rarely contains all of the information we're interested in.

While we don't have the specific addresses of where the marches occured, geocoding services will accept a city and state name combination (e.g. city of state of TX or Texas). There are many geocoding services but we'll use Geocodio, as it's one I'm familiar with and has a free tier of 2,500 lookups.

geocodio_one-2

We can use Geocodio either through uploading a spreadsheet file or through their API. Since we only have two small spreadsheets, we'll use their upload tool. Note that using Geocodio does require you to create an account. Here's a list of some other geocoding services, if you don't want to use Geocodio: https://geoservices.tamu.edu/Services/Geocode/OtherGeocoders/

Let's export sm_marches as a CSV file using DataFrame.to_csv():

sm_marches.to_csv("sm_marches.csv", index=False)

Upload both sm_marches.csv and FinalWiki.csv to Geocodio. The service will ask you to specify which columns represent the address, state, country, zip, etc (you don't need to have all of the fields). Then, Geocodio will append new columns with the service's best guesses for each row and let you download the new CSV files back to your computer. Finally, rename these files as FinalWiki_geocodio.csv and sm_csv_geocodio.csv.

Let's read both of these files back into dataframes and explore the columns that were added.

sm_gc = pd.read_csv("sm_csv_geocodio.csv")
wiki_gc = pd.read_csv("FinalWiki_geocodio.csv")
sm_gc.head()
City_State State Country Low Average High Latitude Longitude Accuracy Score Accuracy Type Number Street City State.1 County Zip Country.1 Source
0 Abilene, TX TX US 200 200 200 32.576489 -99.665323 1.0 place NaN NaN Abilene TX Taylor County 79601.0 US TIGER/Line® dataset from the US Census Bureau
1 Accident, MD MD US 54 54 54 39.628700 -79.319760 1.0 place NaN NaN Accident MD Garrett County 21520.0 US TIGER/Line® dataset from the US Census Bureau
2 Adak, AK AK US 10 10 10 51.829438 -176.629994 1.0 place NaN NaN Adak AK Aleutians West Census Area 99546.0 US TIGER/Line® dataset from the US Census Bureau
3 Adrian, MI MI US 150 150 150 41.889943 -84.065892 1.0 place NaN NaN Adrian MI Lenawee County 49221.0 US TIGER/Line® dataset from the US Census Bureau
4 Ajo, AZ AZ US 250 250 250 32.384890 -112.890110 1.0 place NaN NaN Ajo AZ Pima County 85321.0 US TIGER/Line® dataset from the US Census Bureau
wiki_gc.head()
State City Turnout Latitude Longitude Accuracy Score Accuracy Type Number Street City.1 State.1 County Zip Country Source
0 Texas Abilene 200 32.576489 -99.665323 1.0 place NaN NaN Abilene TX Taylor County 79601.0 US TIGER/Line® dataset from the US Census Bureau
1 Maryland Accident 54 39.628700 -79.319760 1.0 place NaN NaN Accident MD Garrett County 21520.0 US TIGER/Line® dataset from the US Census Bureau
2 Alaska Adak 10 51.829438 -176.629994 1.0 place NaN NaN Adak AK Aleutians West Census Area 99546.0 US TIGER/Line® dataset from the US Census Bureau
3 Michigan Adrian 140 41.889943 -84.065892 1.0 place NaN NaN Adrian MI Lenawee County 49221.0 US TIGER/Line® dataset from the US Census Bureau
4 Colorado Alamosa 350 37.479933 -105.790967 1.0 place NaN NaN Alamosa CO Alamosa County 81101.0 US TIGER/Line® dataset from the US Census Bureau

Geocodio added a few extra columns:

  • Latitude
  • Longitude
  • Accuracy Score
  • Accuracy Type
  • Number
  • Street
  • City (rendered in one of the dataframes as City.1 because City already existed)
  • State (rendered in both of the dataframes as State.1 because State already existed)
  • County (rendered in one of the dataframes as Country.1 because Country already existed)
  • Zip
  • Country
  • Source

Let's determine which rows had a high confidence score from Geocodio and which rows Geocodio wasn't able to geocode at all.

Let's count the number of rows with confidence scores less than 1 (note the slight differences in column names):

sm_count = len(sm_gc[sm_gc['Accuracy Score'] < 1.00][['City_State', 'City', 'State.1', 'Latitude', 'Longitude']])
wiki_count = len(wiki_gc[wiki_gc['Accuracy Score'] < 1.00][['State', 'City', 'City.1', 'State.1', 'Latitude', 'Longitude']])
print(sm_count)
print(wiki_count)
32
16

Since there aren't too many rows, we can manually check each row that Geocodio wasn't confident about. Let's increase the number of rows that pandas displays in jupyter notebook, to make it easier to verify each row.

pd.options.display.max_rows = 50

Here's a workflow we can use:

  • Compare the original city and state values with the ones Geocodio returned and see if they match (or if Geocodio even found a match).
    • If they don't match or missing entirely, use Google Maps to search for the location, click and hold in the center of the outlined region to drop a point, and copy the latitude and longitude coordinates. Create a dictionary that makes row index values to a list containing the coordinates: sm_replace_dict = { 15: [30.6266, 81.4609], ... }
    • For marches that don't have a clear location (e.g. Hospital Ward, CA, store the index value in a separate list and we'll drop those rows all at once.

Here's the resulting code for transforming sm_gc:

sm_replace_dict = {
    15: [30.6266, 81.4609],
    70: [37.191807, -108.078634],
    78: [42.914664, -78.864897],
    97: [37.555616, -76.304590],
    159: [35.047669, -108.323769],
    234: [48.199181, -120.773236],
    242: [39.245103, -76.914071],
    326: [39.956147, -74.922647],
    331: [41.379763, -70.649143],
    376: [30.013898, -90.013031],
    475: [17.716221, -64.831193],
    513: [46.486260, -84.355558],
    528: [32.459338, -93.769839],
    539: [38.978815, -119.932676],
    573: [44.224584, -74.462409],
    581: [43.095693, -75.229880]
}

sm_drop = [238]

for k,v in sm_replace_dict.items():
    sm_gc.at[k, 'Latitude'] = v[0]
    sm_gc.at[k, 'Longitude'] = v[1]

for i in sm_drop:
    sm_gc = sm_gc.drop(i, axis=0)

Here's the code for transforming wiki_gc:

for k,v in wiki_replace_dict.items():
    wiki_gc.at[k, 'Latitude'] = v[0]
    wiki_gc.at[k, 'Longitude'] = v[1]

for i in wiki_drop:
    wiki_gc = wiki_gc.drop(i, axis=0)

Let's select just the columns we want to use for analysis and visualization and assign them to a new dataframe:

sm_keep_cols = ['City_State', 'State', 'Average', 'Latitude', 'Longitude']
wiki_keep_cols = ['City', 'State', 'Turnout', 'Latitude', 'Longitude']

sm_ds = sm_gc[sm_keep_cols]
wiki_ds = wiki_gc[wiki_keep_cols]

Lastly, let's convert both of the columns describing the turnout numbers to numeric columns.

wiki_ds['Turnout'] = pd.to_numeric(wiki_ds['Turnout'])
sm_ds['Average'] = pd.to_numeric(sm_ds['Average'].str.replace(",", ""))

Now we're ready for some visualization!

Data Visualization

We've done multiple rounds of data cleaning, so let's first understand the number of marches represented in each dataset:

print(len(sm_ds))
print(len(wiki_ds))
627
506

Visualizing The Most Popular Marches

We'll start by looking at the marches with the highest turnouts from both datasets. Let's start by importing matplotlib and setting the style to the fivethirtyeight style. While the fivethirtyeight matplotlib style provides a baseline for aesthetics, it doesn't replicate the plots they use entirely. If you're interested in learning how to replicate their style completely, you should check out How To Generate FiveThirtyEight Graphs in Python.

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.style as style
style.use('fivethirtyeight')

We'll use the DataFrame.plot() method to generate horizontal bar plots, assign the matplotlib Axes objects to a variable, and then use methods to modify each plot.

colors = [[0,158/255,115/255]]
sm_top_10_plot = sm_ds.sort_values(by='Average', ascending=False)[0:10].plot(x='City_State', y='Average', kind='bar', figsize = (10,6), fontsize=16, color=colors)
sm_top_10_plot.legend_.remove()

sm_top_10_plot.set_title("Top 10 Marches Crowdsourced From Social Media")
sm_top_10_plot.set_xlabel('City', fontsize=20)
sm_top_10_plot.set_ylabel('Turnout', fontsize=20)

sm_top_10_plot-2

Let's replicate this for the dataset crowdsourced from Wikipedia:

wiki_top_10_plot = wiki_ds.sort_values(by='Turnout', ascending=False)[0:10].plot(x='City', y='Turnout', kind='barh', figsize = (10,6), fontsize=16, color=colors)
wiki_top_10_plot.legend_.remove()
wiki_top_10_plot.set_title("Top 10 Marches Crowdsourced From Wikipedia")
wiki_top_10_plot.set_ylabel('City', fontsize=18, labelpad=20)
wiki_top_10_plot.set_xlabel('Turnout', fontsize=18, labelpad=20)

wiki_top_10_plot-2

It looks like Los Angeles, Washington DC, New York, and Chicago stood out in terms of turnout (each over 20,0000 people). To make these bar plots easier to compare, we'll:

  • Add state names to wiki_ds (currently only contains state codes) and add state codes to sm_ds (currently only contains state names).
  • Combine the City and State columns from the wiki_ds dataframe into a single City_State column.
state_codes = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]

state_names = ["Alabama","Alaska","Arizona","Arkansas","California","Colorado",
  "Connecticut","District Of Columbia", "Delaware","Florida","Georgia","Hawaii","Idaho","Illinois",
  "Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland",
  "Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana",
  "Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York",
  "North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania",
  "Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah",
  "Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"]
  
# Which state codes aren't in the list?
sm_ds[~sm_ds['State'].isin(states)]
City_State State Average Latitude Longitude
102 Christiansted, VI VI 450 17.734211 -64.734694
131 Cruz Bay, VI VI 200 18.331343 -64.793750
168 Esperanza, PR PR 325 18.097187 -65.470714
219 Hagatna, Guam GM 200 13.444257 144.786297
475 St. Croix, VI VI 250 17.716221 -64.831193
477 St. John, VI VI 60 18.328160 -64.740737
500 San Juan, PR PR 300 18.465901 -66.103568
591 Vieques, PR PR 200 18.123347 -65.460356

Now let's figure out which state names aren't in the list of 50 state names:

wiki_ds[~wiki_ds['State'].isin(state_names)]
City State Turnout Latitude Longitude City_State
263 Mayaguez Puerto Rico 0 18.181938 -67.133802 Mayaguez, Puerto Rico
390 San Juan Puerto Rico 0 18.465901 -66.103568 San Juan, Puerto Rico
401 Santurce Puerto Rico 0 18.452679 -66.078113 Santurce, Puerto Rico
430 St. Croix United States Virgin Islands 500 17.718837 -64.807850 St. Croix, United States Virgin Islands
431 St. John United States Virgin Islands 200 18.337598 -64.735584 St. John, United States Virgin Islands
438 St. Thomas United States Virgin Islands 300 18.349667 -64.930083 St. Thomas, United States Virgin Islands
474 Vieques Puerto Rico 200 18.123347 -65.460356 Vieques, Puerto Rico
479 Washington DC Washington DC 500000 38.886988 -77.013516 Washington DC, Washington DC

Let's drop all of the rows describing marches not in the mainland U.S. (all but D.C.) and manually handle the row describing the Washington DC march (because it's an edge case).

sm_ds = sm_ds.drop([102, 131, 168, 219, 475, 477, 500, 591])
wiki_ds = wiki_ds.drop([263, 390, 401, 430, 431, 438, 474])
wiki_ds.at[479, "City"] = "Washington"
wiki_ds.at[479, "State"] = "District Of Columbia"

Now, let's add a State_Codes and a State_Names column for both dataframes.

sm_ds['State_Codes'] = sm_ds['State']
sm_ds['State_Names'] = ""

for key, row in sm_ds.iterrows():
    list_index = 0
    current_code = row['State_Codes']
    for counter, name in enumerate(state_codes):
        if name == current_code:
            list_index = counter
    sm_ds.at[key, 'State_Names'] = state_names[list_index]
    
wiki_ds['State_Names'] = wiki_ds['State']
wiki_ds['State_Codes'] = ""

for key, row in wiki_ds.iterrows():
    list_index = 0
    current_name = row['State_Names']
    for counter, name in enumerate(state_names):
        if name == current_name:
            list_index = counter
    wiki_ds.at[key, 'State_Codes'] = state_codes[list_index]

Here's what sm_ds looks like after this transformation:

City_State State Average Latitude Longitude State_Codes State_Names
0 Abilene, TX TX 200 32.576489 -99.665323 TX Texas
1 Accident, MD MD 54 39.628700 -79.319760 MD Maryland
... ... ... ... ... ... ... ...
626 Yuma, AZ AZ 10 32.701461 -114.657232 AZ Arizona
627 Zebulon, GA GA 35 33.077371 -84.321736 GA Georgia

While sm_ds contains a friendly column for displaying both the city and state code in our plots (City_State), the wiki_ds dataframe doesn't. We'll add it by concatenating two of the columns:

wiki_ds['City_State'] = wiki_ds ['City'] + ', ' + wiki_ds['State_Codes']

Finally, let's create a grid of matplotlib subplots, 2 rows by 1 column and display both bar plots.

fig = plt.figure(figsize=(8,12))
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)

sm_top_10_plot = sm_ds.sort_values(by='Average', ascending=False)[0:10].plot(x='City_State', y='Average', kind='barh', ax=ax1, color=colors)
wiki_ds.sort_values(by='Turnout', ascending=False)[0:10].plot(x='City_State', y='Turnout', kind='barh', ax=ax2, color=colors)

ax1.legend_.remove()
ax1.set_title("Top 10 Marches Crowdsourced From Social Media \n")

ax2.legend_.remove()
ax2.set_title("Top 10 Marches Crowdsourced From Wikipedia \n")

combined_top_10_plot-1

Visualizing The Number Of Marches Per State

Let's now visualize the number of marches that occurred in each state as a histogram. By default, float values are used for the y-axis tick labels. We can fetch those tick labels, cast them to integer values, and then set them back. Let's start with the social media dataset.

sm_state_hist = sm_ds['State_Names'].value_counts().plot(kind='hist', figsize=(8,6), bins=15, title="Distribution Of Marches Per State")
sm_state_hist.set_xlabel("# Marches")
sm_state_hist.set_ylabel("# States")

y_labels = [int(item) for item in sm_state_hist.get_yticks().tolist()]
sm_state_hist.set_yticklabels(y_labels)

plt.savefig("sm_state_hist.png", bbox_inches="tight")

sm_state_hist-1

More than half of the states had around 12 marches or less. There's also a clear outlier state, with around 72 marches. If we plot the Wikipedia dataset, we observe a pretty similar histogram.

wiki_state_hist = wiki_ds['State_Names'].value_counts().plot(kind='hist', figsize=(8,6), bins=15, title="Distribution Of Marches Per State. Source: Wikipedia")
wiki_state_hist.set_xlabel("# Marches")
wiki_state_hist.set_ylabel("# States")

y_labels = [int(item) for item in wiki_state_hist.get_yticks().tolist()]
wiki_state_hist.set_yticklabels(y_labels)

wiki_state_hist-1

Which state is the outlier?

print("State with the most # of marches:", sm_ds['State_Names'].value_counts().index[0])
California

While number of marches per state is interesting, we can also dive into the total number of people who marched per state. This could potentially be a better indicator of the collective strength and spirit of the movement in those states.

Visualizing The Total Turnout Per State

Let's take advantage of pandas groupby and aggregation capabilities to group all of the marches in both dataframes by states and then sort by the total turnout per state.

sm_ds.groupby('State_Names').agg(['count', 'sum'])['Average'].sort_values('count', ascending=False)[0:15]
count sum
State_Names
California 77 948661
District Of Columbia 1 750000
New York 28 502436
Illinois 9 262145
Massachusetts 15 183596
Washington 33 177439
Colorado 25 167470
Oregon 25 118820
Minnesota 12 98410
Texas 20 95415
Florida 21 93441
Wisconsin 17 91098
Pennsylvania 24 87018
Georgia 6 66185
North Carolina 16 61778

The states District of Columbia, Illinois, and Georgia stand out because of how few marches lead to such high turnouts. This is replicated for the Wikipedia dataset as well:

wiki_ds.groupby('State_Names').agg(['count', 'sum'])['Turnout'].sort_values('sum', ascending=False)[0:15]
count sum
State_Names
California 62 1184930
District Of Columbia 1 500000
New York 25 452916
Illinois 9 260595
Washington 27 213893
Massachusetts 10 169553
Colorado 16 162349
Oregon 24 127856
Minnesota 12 98274
Texas 20 86335
Pennsylvania 17 83540
Florida 20 78370
Georgia 6 62535
North Carolina 11 59270
Arizona 6 38450

Visualizing On A Map

We'll end this post by visualizing the marches on a map of the U.S. We'll use the folium package, which provides a Python API for generating interactive maps (which are rendered in HTML, CSS, and JavaScript). Our blog currently doesn't support embedded javascript unfortunately, so we'll include screenshots here instead.

To create a map in folium, we first need to import the relevant modules and then create a folium.folium.Map object:

from folium.plugins import MarkerCluster
from folium import plugins
from folium.plugins import HeatMap

# Map obj
map_clusters = folium.Map()

Then, we extract the latitude and longitude values from a dataframe, convert it to numpy objects, pass those objects into the MarkerCluster() constructor, and then use the MarkerCluster.add_to() method to link the cluster of markers to the map object we created.

data = sm_ds[['Latitude', 'Longitude']].values
MarkerCluster(data).add_to(map_clusters)

map_clusters

Here's what the map looks like zoomed in twice, where only clusters of marches are displayed:

map_one-1

As we zoom in more, individual marches are displayed as markers instead of clusters.

map_two-1

If we zoom in one more item and focus just on mainland U.S. (exclude Alaska, Virgin Islands, and Hawaii), we can observe a wonderful distribution of activity.

map_three-1

This map lets us visually understand the spectrum of marches:

gradient-1

We highly recommend generating these interactive maps on your own so you can explore the different marches more effectively.

Next Steps

There's still a lot more you can explore from a visualization stand point. Here are some ideas for next steps:

  • Scale the radius of the clusters and markers on the map by the turnout values (just like in the original Vox visualization).
  • Created grouped bar plots that visualize counts from both datasets.
  • Create a map that blends data from both datasets.
  • Add county level election results data to the maps to augment the marches.
  • Recreate the visualizations done on Vox here and here.
Srini Kadamati

Srini Kadamati

Director of Content at Dataquest.io. Based in Austin, TX.

Read More