/ Python

k-means clustering US Senators

Clustering is a powerful way to split up datasets into groups based on similarity. A very popular clustering algorithm is k-means clustering. In k-means clustering, we divide data up into a fixed number of clusters while trying to ensure that the items in each cluster are as similar as possible.

In this post, we'll explore cluster US Senators using an interactive Python environment. We'll use the voting history from the 114th Congress to split Senators into clusters.

Loading in the data

We have a csv file that contains all the votes from the 114th Senate. You can download the file here.

Each row contains the votes of an individual senator. Votes are coded as 0
for "No", 1 for "Yes", and 0.5 for "Abstain".

Here are the first three rows of the data:

name,party,state,00001,00004,00005,00006,00007,00008,00009,00010,00020,00026,00032,00038,00039,00044,00047
Alexander,R,TN,0,1,1,1,1,0,0,1,1,1,0,0,0,0,0
Ayotte,R,NH,0,1,1,1,1,0,0,1,0,1,0,1,0,1,0

We can read the csv file into Python using pandas.


import pandas as pd
# Read in the csv file
votes = pd.read_csv("114_congress.csv")

# As you can see, there are 100 senators, and they voted on 15 bills (we subtract 3 because the first 3 columns aren't bills).
print(votes.shape)

# We have more "Yes" votes than "No" votes overall
print(pd.value_counts(votes.iloc[:,3:].values.ravel()))

(100, 18)
1.0    803
0.0    669
0.5     28
dtype: int64

Initial k-means clustering

k-means clustering will try to make clusters out of the senators.

Each cluster will contain senators whose votes are as similar to each other as possible.

We'll need to specify the number of clusters we want upfront.

Let's try 2 to see how that looks.


import pandas as pd

# The kmeans algorithm is implemented in the scikits-learn library
from sklearn.cluster import KMeans

# Create a kmeans model on our data, using 2 clusters.  random_state helps ensure that the algorithm returns the same results each time.
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(votes.iloc[:, 3:])

# These are our fitted labels for clusters -- the first cluster has label 0, and the second has label 1.
labels = kmeans_model.labels_

# The clustering looks pretty good!
# It's separated everyone into parties just based on voting history
print(pd.crosstab(labels, votes["party"]))

party   D  I   R
row_0           
0      41  2   0
1       3  0  54

Exploring people in the wrong cluster

We can now find out which senators are in the "wrong" cluster.

These senators are in the cluster associated with the opposite party.


# Let's call these types of voters "oddballs" (why not?)
# There aren't any republican oddballs
democratic_oddballs = votes[(labels == 1) & (votes["party"] == "D")]

# It looks like Reid has abstained a lot, which changed his cluster.
# Manchin seems like a genuine oddball voter.
print(democratic_oddballs["name"])

42    Heitkamp
56     Manchin
74        Reid
Name: name, dtype: object

Plotting out the clusters

Let's explore our clusters a little more by plotting them out.

Each column of data is a dimension on a plot, and we can't visualize 15
dimensions.

We'll use principal component analysis to compress the vote columns into
two.

Then, we can plot out all of our senators according to their votes, and shade
them by their k-means cluster.


import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
pca_2 = PCA(2)

# Turn the vote data into two columns with PCA
plot_columns = pca_2.fit_transform(votes.iloc[:,3:18])

# Plot senators based on the two dimensions, and shade by cluster label
# You can see the plot by clicking "plots" to the bottom right
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=votes["label"])
plt.show()

Mission26_8_0

Trying even more clusters

While two clusters is interesting, it didn't tell us anything we don't already
know.

More clusters could show wings of each party, or cross-party groups.

Let's try using 5 clusters to see what happens.


import pandas as pd
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=5, random_state=1).fit(votes.iloc[:, 3:])
labels = kmeans_model.labels_

# The republicans are still pretty solid, but it looks like there are two democratic "factions"
print(pd.crosstab(labels, votes["party"]))

party   D  I   R
row_0           
0       6  0   0
1       0  0  52
2      31  1   0
3       0  0   2
4       7  1   0

More on k-means clustering

For more on k-means clustering, you can checkout our Dataquest mission on k-means clustering.