Data Enrichment
Because we eventually want to be able to build maps visualizing the marches, we need latitude and longitude coordinates. If we preview both dataframes, you’ll notice that we’re missing this information for both dataframes:
sm_marches.head()
City_State | State | Country | Low | Average | High | |
---|---|---|---|---|---|---|
0 | Abilene, TX | TX | US | 200 | 200 | 200 |
1 | Accident, MD | MD | US | 54 | 54 | 54 |
2 | Adak, AK | AK | US | 10 | 10 | 10 |
3 | Adrian, MI | MI | US | 150 | 150 | 150 |
4 | Ajo, AZ | AZ | US | 250 | 250 | 250 |
wiki_marches.head()
State | City | Turnout | |
---|---|---|---|
0 | Texas | Abilene | 200 |
1 | Maryland | Accident | 54 |
2 | Alaska | Adak | 10 |
3 | Michigan | Adrian | 140 |
4 | Colorado | Alamosa | 350 |
Another issue is that one dataframe uses state codes (like
TX)
while the other uses the full state names (like Texas
). Thankfully, we can solve both the missing geodata and state representation problem using a geocoding service. Geocoding is the process of converting an address to a concrete pair of latitude and longitude coordinates. Using a geocoding service to improve the dataset is an example of data enrichment. Data enrichment is extremely common when working on data science projects, because the raw data rarely contains all of the information we’re interested in. While we don’t have the specific addresses of where the marches occured, geocoding services will accept a city and state name combination (e.g. city of state of TX
or Texas
). There are many geocoding services but we’ll use Geocodio, as it’s one I’m familiar with and has a free tier of 2,500 lookups. We can use Geocodio either through uploading a spreadsheet file or through their API. Since we only have two small spreadsheets, we’ll use their upload tool. Note that using Geocodio does require you to create an account. Here’s a list of some other geocoding services, if you don’t want to use Geocodio: https://geoservices.tamu.edu/Services/Geocode/OtherGeocoders/ Let’s export
sm_marches
as a CSV file using DataFrame.to_csv()
:
sm_marches.to_csv("sm_marches.csv", index=False)
Upload both
sm_marches.csv
and FinalWiki.csv
to Geocodio. The service will ask you to specify which columns represent the address, state, country, zip, etc (you don’t need to have all of the fields). Then, Geocodio will append new columns with the service’s best guesses for each row and let you download the new CSV files back to your computer. Finally, rename these files as FinalWiki_geocodio.csv
and sm_csv_geocodio.csv
. Let’s read both of these files back into dataframes and explore the columns that were added.
sm_gc = pd.read_csv("sm_csv_geocodio.csv")
wiki_gc = pd.read_csv("FinalWiki_geocodio.csv")
sm_gc.head()
City_State | State | Country | Low | Average | High | Latitude | Longitude | Accuracy Score | Accuracy Type | Number | Street | City | State.1 | County | Zip | Country.1 | Source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abilene, TX | TX | US | 200 | 200 | 200 | 32.576489 | -99.665323 | 1.0 | place | NaN | NaN | Abilene | TX | Taylor County | 79601.0 | US | TIGER/Line® dataset from the US Census Bureau |
1 | Accident, MD | MD | US | 54 | 54 | 54 | 39.628700 | -79.319760 | 1.0 | place | NaN | NaN | Accident | MD | Garrett County | 21520.0 | US | TIGER/Line® dataset from the US Census Bureau |
2 | Adak, AK | AK | US | 10 | 10 | 10 | 51.829438 | -176.629994 | 1.0 | place | NaN | NaN | Adak | AK | Aleutians West Census Area | 99546.0 | US | TIGER/Line® dataset from the US Census Bureau |
3 | Adrian, MI | MI | US | 150 | 150 | 150 | 41.889943 | -84.065892 | 1.0 | place | NaN | NaN | Adrian | MI | Lenawee County | 49221.0 | US | TIGER/Line® dataset from the US Census Bureau |
4 | Ajo, AZ | AZ | US | 250 | 250 | 250 | 32.384890 | -112.890110 | 1.0 | place | NaN | NaN | Ajo | AZ | Pima County | 85321.0 | US | TIGER/Line® dataset from the US Census Bureau |
wiki_gc.head()
State | City | Turnout | Latitude | Longitude | Accuracy Score | Accuracy Type | Number | Street | City.1 | State.1 | County | Zip | Country | Source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Texas | Abilene | 200 | 32.576489 | -99.665323 | 1.0 | place | NaN | NaN | Abilene | TX | Taylor County | 79601.0 | US | TIGER/Line® dataset from the US Census Bureau |
1 | Maryland | Accident | 54 | 39.628700 | -79.319760 | 1.0 | place | NaN | NaN | Accident | MD | Garrett County | 21520.0 | US | TIGER/Line® dataset from the US Census Bureau |
2 | Alaska | Adak | 10 | 51.829438 | -176.629994 | 1.0 | place | NaN | NaN | Adak | AK | Aleutians West Census Area | 99546.0 | US | TIGER/Line® dataset from the US Census Bureau |
3 | Michigan | Adrian | 140 | 41.889943 | -84.065892 | 1.0 | place | NaN | NaN | Adrian | MI | Lenawee County | 49221.0 | US | TIGER/Line® dataset from the US Census Bureau |
4 | Colorado | Alamosa | 350 | 37.479933 | -105.790967 | 1.0 | place | NaN | NaN | Alamosa | CO | Alamosa County | 81101.0 | US | TIGER/Line® dataset from the US Census Bureau |
Geocodio added a few extra columns:
-
Latitude
-
Longitude
-
Accuracy Score
-
Accuracy Type
-
Number
-
Street
-
City
(rendered in one of the dataframes asCity.1
becauseCity
already existed) -
State
(rendered in both of the dataframes asState.1
becauseState
already existed) -
County
(rendered in one of the dataframes asCountry.1
becauseCountry
already existed) -
Zip
-
Country
-
Source
Let’s determine which rows had a high confidence score from Geocodio and which rows Geocodio wasn’t able to geocode at all. Let’s count the number of rows with confidence scores less than
1
(note the slight differences in column names):
sm_count = len(sm_gc[sm_gc['Accuracy Score'] < 1.00][['City_State', 'City', 'State.1', 'Latitude', 'Longitude']])
wiki_count = len(wiki_gc[wiki_gc['Accuracy Score'] < 1.00][['State', 'City', 'City.1', 'State.1', 'Latitude', 'Longitude']])
print(sm_count)
print(wiki_count)
32
16
Since there aren’t too many rows, we can manually check each row that Geocodio wasn’t confident about. Let’s increase the number of rows that pandas displays in jupyter notebook, to make it easier to verify each row.
pd.options.display.max_rows = 50
Here’s a workflow we can use:
-
Compare the original city and state values with the ones Geocodio returned and see if they match (or if Geocodio even found a match).
-
If they don’t match or missing entirely, use Google Maps to search for the location, click and hold in the center of the outlined region to drop a point, and copy the latitude and longitude coordinates. Create a dictionary that makes row index values to a list containing the coordinates:
sm_replace_dict = { 15: [30.6266, 81.4609], ... }
-
For marches that don’t have a clear location (e.g.
Hospital Ward, CA
, store the index value in a separate list and we’ll drop those rows all at once.
-
If they don’t match or missing entirely, use Google Maps to search for the location, click and hold in the center of the outlined region to drop a point, and copy the latitude and longitude coordinates. Create a dictionary that makes row index values to a list containing the coordinates:
Here’s the resulting code for transforming
sm_gc
:
sm_replace_dict = {
15: [30.6266, 81.4609],
70: [37.191807, -108.078634],
78: [42.914664, -78.864897],
97: [37.555616, -76.304590],
159: [35.047669, -108.323769],
234: [48.199181, -120.773236],
242: [39.245103, -76.914071],
326: [39.956147, -74.922647],
331: [41.379763, -70.649143],
376: [30.013898, -90.013031],
475: [17.716221, -64.831193],
513: [46.486260, -84.355558],
528: [32.459338, -93.769839],
539: [38.978815, -119.932676],
573: [44.224584, -74.462409],
581: [43.095693, -75.229880]}
sm_drop = [238]
for k,v in sm_replace_dict.items():
sm_gc.at[k, 'Latitude'] = v[0]
sm_gc.at[k, 'Longitude'] = v[1]
for i in sm_drop:
sm_gc = sm_gc.drop(i, axis=0)
Here’s the code for transforming
wiki_gc
:
for k,v in wiki_replace_dict.items():
wiki_gc.at[k, 'Latitude'] = v[0]
wiki_gc.at[k, 'Longitude'] = v[1]
for i in wiki_drop:
wiki_gc = wiki_gc.drop(i, axis=0)
Let’s select just the columns we want to use for analysis and visualization and assign them to a new dataframe:
sm_keep_cols = ['City_State', 'State', 'Average', 'Latitude', 'Longitude']
wiki_keep_cols = ['City', 'State', 'Turnout', 'Latitude', 'Longitude']
sm_ds = sm_gc[sm_keep_cols]
wiki_ds = wiki_gc[wiki_keep_cols]
Lastly, let’s convert both of the columns describing the turnout numbers to numeric columns.
wiki_ds['Turnout'] = pd.to_numeric(wiki_ds['Turnout'])
sm_ds['Average'] = pd.to_numeric(sm_ds['Average'].str.replace(",", ""))
Now we’re ready for some visualization!
Data Visualization
We’ve done multiple rounds of data cleaning, so let’s first understand the number of marches represented in each dataset:
print(len(sm_ds))
print(len(wiki_ds))
627
506
Visualizing The Most Popular Marches
We’ll start by looking at the marches with the highest turnouts from both datasets. Let’s start by importing matplotlib and setting the style to the
fivethirtyeight style. While the fivethirtyeight matplotlib style provides a baseline for aesthetics, it doesn’t replicate the plots they use entirely. If you’re interested in learning how to replicate their style completely, you should check out How To Generate FiveThirtyEight Graphs in Python.
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.style as style
style.use('fivethirtyeight')
We’ll use the
DataFrame.plot() method to generate horizontal bar plots, assign the matplotlib Axes objects to a variable, and then use methods to modify each plot.
colors = [[0,158/255,115/255]]sm_top_10_plot = sm_ds.sort_values(by='Average', ascending=False)[0:10].plot(x='City_State', y='Average', kind='bar', figsize = (10,6), fontsize=16, color=colors)
sm_top_10_plot.legend_.remove()
sm_top_10_plot.set_title("Top 10 Marches Crowdsourced From Social Media")
sm_top_10_plot.set_xlabel('City', fontsize=20)
sm_top_10_plot.set_ylabel('Turnout', fontsize=20)
Let’s replicate this for the dataset crowdsourced from Wikipedia:
wiki_top_10_plot = wiki_ds.sort_values(by='Turnout', ascending=False)[0:10].plot(x='City', y='Turnout', kind='barh', figsize = (10,6), fontsize=16, color=colors)
wiki_top_10_plot.legend_.remove()
wiki_top_10_plot.set_title("Top 10 Marches Crowdsourced From Wikipedia")
wiki_top_10_plot.set_ylabel('City', fontsize=18, labelpad=20)
wiki_top_10_plot.set_xlabel('Turnout', fontsize=18, labelpad=20)
It looks like Los Angeles, Washington DC, New York, and Chicago stood out in terms of turnout (each over 20,0000 people). To make these bar plots easier to compare, we’ll:
-
Add state names to
wiki_ds
(currently only contains state codes) and add state codes tosm_ds
(currently only contains state names). -
Combine the
City
andState
columns from thewiki_ds
dataframe into a singleCity_State
column.
state_codes = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
"MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
"SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
state_names = ["Alabama","Alaska","Arizona","Arkansas","California","Colorado",
"Connecticut","District Of Columbia", "Delaware","Florida","Georgia","Hawaii","Idaho","Illinois",
"Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland",
"Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana",
"Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York",
"North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania",
"Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah",
"Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"]
# Which state codes aren't in the list?
sm_ds[~sm_ds['State'].isin(states)]
City_State | State | Average | Latitude | Longitude | |
---|---|---|---|---|---|
102 | Christiansted, VI | VI | 450 | 17.734211 | -64.734694 |
131 | Cruz Bay, VI | VI | 200 | 18.331343 | -64.793750 |
168 | Esperanza, PR | PR | 325 | 18.097187 | -65.470714 |
219 | Hagatna, Guam | GM | 200 | 13.444257 | 144.786297 |
475 | St. Croix, VI | VI | 250 | 17.716221 | -64.831193 |
477 | St. John, VI | VI | 60 | 18.328160 | -64.740737 |
500 | San Juan, PR | PR | 300 | 18.465901 | -66.103568 |
591 | Vieques, PR | PR | 200 | 18.123347 | -65.460356 |
Now let’s figure out which state names aren’t in the list of 50 state names:
wiki_ds[~wiki_ds['State'].isin(state_names)]
City | State | Turnout | Latitude | Longitude | City_State | |
---|---|---|---|---|---|---|
263 | Mayaguez | Puerto Rico | 0 | 18.181938 | -67.133802 | Mayaguez, Puerto Rico |
390 | San Juan | Puerto Rico | 0 | 18.465901 | -66.103568 | San Juan, Puerto Rico |
401 | Santurce | Puerto Rico | 0 | 18.452679 | -66.078113 | Santurce, Puerto Rico |
430 | St. Croix | United States Virgin Islands | 500 | 17.718837 | -64.807850 | St. Croix, United States Virgin Islands |
431 | St. John | United States Virgin Islands | 200 | 18.337598 | -64.735584 | St. John, United States Virgin Islands |
438 | St. Thomas | United States Virgin Islands | 300 | 18.349667 | -64.930083 | St. Thomas, United States Virgin Islands |
474 | Vieques | Puerto Rico | 200 | 18.123347 | -65.460356 | Vieques, Puerto Rico |
479 | Washington DC | Washington DC | 500000 | 38.886988 | -77.013516 | Washington DC, Washington DC |
Let’s drop all of the rows describing marches not in the mainland U.S. (all but D.C.) and manually handle the row describing the Washington DC march (because it’s an edge case).
sm_ds = sm_ds.drop([102, 131, 168, 219, 475, 477, 500, 591])
wiki_ds = wiki_ds.drop([263, 390, 401, 430, 431, 438, 474])
wiki_ds.at[479, "City"] = "Washington"
wiki_ds.at[479, "State"] = "District Of Columbia"
Now, let’s add a
State_Codes
and a State_Names
column for both dataframes.
sm_ds['State_Codes'] = sm_ds['State']
sm_ds['State_Names'] = ""
for key, row in sm_ds.iterrows():
list_index = 0
current_code = row['State_Codes']
for counter, name in enumerate(state_codes):
if name == current_code:
list_index = counter
sm_ds.at[key, 'State_Names'] = state_names[list_index]
wiki_ds['State_Names'] = wiki_ds['State']
wiki_ds['State_Codes'] = ""
for key, row in wiki_ds.iterrows():
list_index = 0
current_name = row['State_Names']
for counter, name in enumerate(state_names):
if name == current_name:
list_index = counter
wiki_ds.at[key, 'State_Codes'] = state_codes[list_index]
Here’s what
sm_ds
looks like after this transformation:
City_State | State | Average | Latitude | Longitude | State_Codes | State_Names | |
---|---|---|---|---|---|---|---|
0 | Abilene, TX | TX | 200 | 32.576489 | -99.665323 | TX | Texas |
1 | Accident, MD | MD | 54 | 39.628700 | -79.319760 | MD | Maryland |
… | … | … | … | … | … | … | … |
626 | Yuma, AZ | AZ | 10 | 32.701461 | -114.657232 | AZ | Arizona |
627 | Zebulon, GA | GA | 35 | 33.077371 | -84.321736 | GA | Georgia |
While
sm_ds
contains a friendly column for displaying both the city and state code in our plots (City_State
), the wiki_ds
dataframe doesn’t. We’ll add it by concatenating two of the columns:
wiki_ds['City_State'] = wiki_ds ['City'] + ', ' + wiki_ds['State_Codes']
Finally, let’s create a grid of matplotlib subplots, 2 rows by 1 column and display both bar plots.
fig = plt.figure(figsize=(8,12))
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)
sm_top_10_plot = sm_ds.sort_values(by='Average', ascending=False)[0:10].plot(x='City_State', y='Average', kind='barh', ax=ax1, color=colors)
wiki_ds.sort_values(by='Turnout', ascending=False)[0:10].plot(x='City_State', y='Turnout', kind='barh', ax=ax2, color=colors)
ax1.legend_.remove()
ax1.set_title("Top 10 Marches Crowdsourced From Social Media \n")
ax2.legend_.remove()
ax2.set_title("Top 10 Marches Crowdsourced From Wikipedia \n")
Visualizing The Number Of Marches Per State
Let’s now visualize the number of marches that occurred in each state as a histogram. By default, float values are used for the y-axis tick labels. We can fetch those tick labels, cast them to integer values, and then set them back. Let’s start with the social media dataset.
sm_state_hist = sm_ds['State_Names'].value_counts().plot(kind='hist', figsize=(8,6), bins=15, title="Distribution Of Marches Per State")
sm_state_hist.set_xlabel("# Marches")
sm_state_hist.set_ylabel("# States")
y_labels = [int(item) for item in sm_state_hist.get_yticks().tolist()]
sm_state_hist.set_yticklabels(y_labels)
plt.savefig("sm_state_hist.png", bbox_inches="tight")
More than half of the states had around 12 marches or less. There’s also a clear outlier state, with around 72 marches. If we plot the Wikipedia dataset, we observe a pretty similar histogram.
wiki_state_hist = wiki_ds['State_Names'].value_counts().plot(kind='hist', figsize=(8,6), bins=15, title="Distribution Of Marches Per State. Source: Wikipedia")
wiki_state_hist.set_xlabel("# Marches")
wiki_state_hist.set_ylabel("# States")y_labels = [int(item) for item in wiki_state_hist.get_yticks().tolist()]
wiki_state_hist.set_yticklabels(y_labels)
Which state is the outlier?
print("State with the most # of marches:", sm_ds['State_Names'].value_counts().index[0])
California
While number of marches per state is interesting, we can also dive into the total number of people who marched per state. This could
potentially be a better indicator of the collective strength and spirit of the movement in those states.
Visualizing The Total Turnout Per State
Let’s take advantage of pandas
groupby and aggregation capabilities to group all of the marches in both dataframes by states and then sort by the total turnout per state.
sm_ds.groupby('State_Names').agg(['count', 'sum'])['Average'].sort_values('count', ascending=False)[0:15]
count | sum | |
---|---|---|
State_Names | ||
California | 77 | 948661 |
District Of Columbia | 1 | 750000 |
New York | 28 | 502436 |
Illinois | 9 | 262145 |
Massachusetts | 15 | 183596 |
Washington | 33 | 177439 |
Colorado | 25 | 167470 |
Oregon | 25 | 118820 |
Minnesota | 12 | 98410 |
Texas | 20 | 95415 |
Florida | 21 | 93441 |
Wisconsin | 17 | 91098 |
Pennsylvania | 24 | 87018 |
Georgia | 6 | 66185 |
North Carolina | 16 | 61778 |
The states District of Columbia, Illinois, and Georgia stand out because of how
few marches lead to such high turnouts. This is replicated for the Wikipedia dataset as well:
wiki_ds.groupby('State_Names').agg(['count', 'sum'])['Turnout'].sort_values('sum', ascending=False)[0:15]
count | sum | |
---|---|---|
State_Names | ||
California | 62 | 1184930 |
District Of Columbia | 1 | 500000 |
New York | 25 | 452916 |
Illinois | 9 | 260595 |
Washington | 27 | 213893 |
Massachusetts | 10 | 169553 |
Colorado | 16 | 162349 |
Oregon | 24 | 127856 |
Minnesota | 12 | 98274 |
Texas | 20 | 86335 |
Pennsylvania | 17 | 83540 |
Florida | 20 | 78370 |
Georgia | 6 | 62535 |
North Carolina | 11 | 59270 |
Arizona | 6 | 38450 |
Visualizing On A Map
We’ll end this post by visualizing the marches on a map of the U.S. We’ll use the
folium package, which provides a Python API for generating interactive maps (which are rendered in HTML, CSS, and JavaScript). Our blog currently doesn’t support embedded javascript unfortunately, so we’ll include screenshots here instead. To create a map in folium, we first need to import the relevant modules and then create a folium.folium.Map
object:
from folium.plugins import MarkerCluster
from folium import plugins
from folium.plugins import HeatMap
# Map obj
map_clusters = folium.Map()
Then, we extract the latitude and longitude values from a dataframe, convert it to numpy objects, pass those objects into the
MarkerCluster()
constructor, and then use the MarkerCluster.add_to()
method to link the cluster of markers to the map object we created.
data = sm_ds[['Latitude', 'Longitude']].values
MarkerCluster(data).add_to(map_clusters)
map_clusters
Here’s what the map looks like zoomed in twice, where only clusters of marches are displayed:
As we zoom in more, individual marches are displayed as markers instead of clusters.
If we zoom in one more item and focus just on mainland U.S. (exclude Alaska, Virgin Islands, and Hawaii), we can observe a wonderful distribution of activity.
This map lets us visually understand the spectrum of marches:
We highly recommend generating these interactive maps on your own so you can explore the different marches more effectively.
Next Steps
There’s still a lot more you can explore from a visualization stand point. Here are some ideas for next steps:
- Scale the radius of the clusters and markers on the map by the turnout values (just like in the original Vox visualization).
- Created grouped bar plots that visualize counts from both datasets.
- Create a map that blends data from both datasets.
- Add county level election results data to the maps to augment the marches.
- Recreate the visualizations done on Vox here and here.