Clustering is a powerful way to split up datasets into groups based on similarity. A very popular clustering algorithm is k-means clustering. In k-means clustering, we divide data up into a fixed number of clusters while trying to ensure that the items in each cluster are as similar as possible.
In this post, we'll explore cluster US Senators using an interactive Python environment. We'll use the voting history from the 114th Congress to split Senators into clusters.
Loading in the data
We have a csv file that contains all the votes from the 114th Senate. You can download the file here.
Each row contains the votes of an individual senator. Votes are coded as
1 for "Yes", and
0.5 for "Abstain".
Here are the first three rows of the data:
name,party,state,00001,00004,00005,00006,00007,00008,00009,00010,00020,00026,00032,00038,00039,00044,00047 Alexander,R,TN,0,1,1,1,1,0,0,1,1,1,0,0,0,0,0 Ayotte,R,NH,0,1,1,1,1,0,0,1,0,1,0,1,0,1,0
We can read the csv file into Python using
import pandas as pd # Read in the csv file votes = pd.read_csv("114_congress.csv") # As you can see, there are 100 senators, and they voted on 15 bills (we subtract 3 because the first 3 columns aren't bills). print(votes.shape) # We have more "Yes" votes than "No" votes overall print(pd.value_counts(votes.iloc[:,3:].values.ravel()))
(100, 18) 1.0 803 0.0 669 0.5 28 dtype: int64
Initial k-means clustering
k-means clustering will try to make clusters out of the senators.
Each cluster will contain senators whose votes are as similar to each other as possible.
We'll need to specify the number of clusters we want upfront.
2 to see how that looks.
import pandas as pd # The kmeans algorithm is implemented in the scikits-learn library from sklearn.cluster import KMeans # Create a kmeans model on our data, using 2 clusters. random_state helps ensure that the algorithm returns the same results each time. kmeans_model = KMeans(n_clusters=2, random_state=1).fit(votes.iloc[:, 3:]) # These are our fitted labels for clusters -- the first cluster has label 0, and the second has label 1. labels = kmeans_model.labels_ # The clustering looks pretty good! # It's separated everyone into parties just based on voting history print(pd.crosstab(labels, votes["party"]))
party D I R row_0 0 41 2 0 1 3 0 54
Exploring people in the wrong cluster
We can now find out which senators are in the "wrong" cluster.
These senators are in the cluster associated with the opposite party.
# Let's call these types of voters "oddballs" (why not?) # There aren't any republican oddballs democratic_oddballs = votes[(labels == 1) & (votes["party"] == "D")] # It looks like Reid has abstained a lot, which changed his cluster. # Manchin seems like a genuine oddball voter. print(democratic_oddballs["name"])
42 Heitkamp 56 Manchin 74 Reid Name: name, dtype: object
Plotting out the clusters
Let's explore our clusters a little more by plotting them out.
Each column of data is a dimension on a plot, and we can't visualize 15
We'll use principal component analysis to compress the vote columns into
Then, we can plot out all of our senators according to their votes, and shade
them by their k-means cluster.
import matplotlib.pyplot as plt from sklearn.decomposition import PCA pca_2 = PCA(2) # Turn the vote data into two columns with PCA plot_columns = pca_2.fit_transform(votes.iloc[:,3:18]) # Plot senators based on the two dimensions, and shade by cluster label # You can see the plot by clicking "plots" to the bottom right plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=votes["label"]) plt.show()
Trying even more clusters
While two clusters is interesting, it didn't tell us anything we don't already
More clusters could show wings of each party, or cross-party groups.
Let's try using 5 clusters to see what happens.
import pandas as pd from sklearn.cluster import KMeans kmeans_model = KMeans(n_clusters=5, random_state=1).fit(votes.iloc[:, 3:]) labels = kmeans_model.labels_ # The republicans are still pretty solid, but it looks like there are two democratic "factions" print(pd.crosstab(labels, votes["party"]))
party D I R row_0 0 6 0 0 1 0 0 52 2 31 1 0 3 0 0 2 4 7 1 0
More on k-means clustering
For more on k-means clustering, you can checkout our Dataquest mission on k-means clustering.