February 15, 2015

# Tutorial: K-Means Clustering US Senators

Clustering is a powerful way to split up datasets into groups based on similarity. A very popular clustering algorithm is K-means clustering. In K-means clustering, we divide data up into a fixed number of clusters while trying to ensure that the items in each cluster are as similar as possible. In this post, we’ll explore cluster US Senators using an interactive Python environment. We’ll use the voting history from the 114th Congress to split Senators into clusters.

We have a csv file that contains all the votes from the 114th Senate. You can download the file here. Each row contains the votes of an individual senator. Votes are coded as `0` for “No”, `1` for “Yes”, and `0.5` for “Abstain”. Here are the first three rows of the data:

``````name,party,state,00001,00004,00005,00006,00007,00008,00009,00010,00020,00026,00032,00038,00039,00044,00047
Alexander,R,TN,0,1,1,1,1,0,0,1,1,1,0,0,0,0,0
Ayotte,R,NH,0,1,1,1,1,0,0,1,0,1,0,1,0,1,0``````

We can read the csv file into Python using `pandas`.

``````
import pandas as pd
# Read in the csv file

# As you can see, there are 100 senators, and they voted on 15 bills (we subtract 3 because the first 3 columns aren't bills).

``````
``````
(100, 18)
1.0    803
0.0    669
0.5    28
dtype: int64
``````

## Initial k-means clustering

K-means clustering will try to make clusters out of the senators. Each cluster will contain senators whose votes are as similar to each other as possible. We’ll need to specify the number of clusters we want up front. Let’s try `2` to see how that looks.

``````
import pandas as pd

# The kmeans algorithm is implemented in the scikits-learn library
from sklearn.cluster import KMeans

# Create a kmeans model on our data, using 2 clusters. random_state helps ensure that the algorithm returns the same results each time.

# These are our fitted labels for clusters -- the first cluster has label 0, and the second has label 1.
labels = kmeans_model.labels_

# The clustering looks pretty good!
# It's separated everyone into parties just based on voting history
``````
party   D  I   R
row_0
0      41  2   0
1       3  0  54
``````

## Exploring people in the wrong cluster

We can now find out which senators are in the “wrong” cluster. These senators are in the cluster associated with the opposite party.

``````
# Let's call these types of voters "oddballs" (why not?)
# There aren't any republican oddballs

# It looks like Reid has abstained a lot, which changed his cluster.
# Manchin seems like a genuine oddball voter.
print(democratic_oddballs["name"])``````
``````
42    Heitkamp
56     Manchin
74        Reid
Name: name, dtype: object``````

## Plotting out the clusters

Let’s explore our clusters a little more by plotting them out. Each column of data is a dimension on a plot, and we can’t visualize 15 dimensions. We’ll use principal component analysis to compress the vote columns into two. Then, we can plot out all of our senators according to their votes, and shade them by their K-means cluster.

``````
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
pca_2 = PCA(2)

# Turn the vote data into two columns with PCA

# Plot senators based on the two dimensions, and shade by cluster label
# You can see the plot by clicking "plots" to the bottom right
plt.show()
``````

## Trying even more clusters

While two clusters is interesting, it didn’t tell us anything we don’t already know. More clusters could show wings of each party, or cross-party groups. Let’s try using 5 clusters to see what happens.

``````
import pandas as pdfrom sklearn.cluster
import KMeanskmeans_model = KMeans(n_clusters=5, random_state=1).fit(votes.iloc[:, 3:])
labels = kmeans_model.labels_

# The republicans are still pretty solid, but it looks like there are two democratic "factions"
``````
``````
party   D  I   R
row_0   6  0   0
1       0  0  52
2      31  1   0
3       0  0   2
4       7  1   0
``````

## More on k-means clustering

For more on K-means clustering, you can check out our Dataquest lesson on K-means clustering.