K-Nearest Neighbors in Python
Ever wondered how your favorite streaming service knows exactly what to recommend next, or how online retailers seem to predict your shopping needs? At the core of these intelligent systems lies a simple yet powerful algorithm: K-Nearest Neighbors (KNN).
In this post, we'll demystify the KNN algorithm by taking a closer look at its mechanics and applying it to predict NBA players' scoring performances from the 2013-2014 season. Along the way, we'll learn about how we can use Euclidean distance to discover which players are "closest" to LeBron James. If you want to code along with me, you can grab the NBA dataset we'll be using in csv format here.
A Look at Our Data
Before we get into the particulars of the KNN algorithm, let's take a quick look at our NBA data. Each row in our dataset contains information on how an individual player performed in the 2013-2014 NBA season.
Here are some of the important columns from our dataset:
player
-- name of the playerpos
-- the position of the playerg
-- number of games the player was ings
-- number of games the player startedpts
-- total points the player scored
There are many more columns in the dataset, mostly containing information about average player game performance over the course of the season. See this site for a detailed explanation of the columns.
Next, we can read in our dataset and check out all the columns:
import pandas
with open("nba_2013.csv", 'r') as csvfile:
nba = pandas.read_csv(csvfile)
print(f"Column names: {nba.columns.values}")
Column names: ['player' 'pos' 'age' 'bref_team_id' 'g' 'gs' 'mp' 'fg' 'fga' 'fg.' 'x3p' 'x3pa' 'x3p.' 'x2p' 'x2pa' 'x2p.' 'efg.' 'ft' 'fta' 'ft.' 'orb' 'drb' 'trb' 'ast' 'stl' 'blk' 'tov' 'pf' 'pts' 'season' 'season_end']
Based on the columns we're seeing, we clearly have a lot of data on each NBA player! So how are we going to use all this data to find players that are most similar to LeBron James? Let's take a step back for a moment and look at a more basic example before we go too far with our NBA data.
KNN Overview
The k-nearest neighbors algorithm is based around the simple idea of predicting unknown values by identifying observations (rows) in the dataset that are most similar to the one we're trying to predict.
For example, let's say that we have 3 different types of cars where we know the name of the Car, the car's Horsepower, if the car has Racing Stripes or not, and if the car Is Fast or not.
Car | Horsepower | Racing Stripes | Is Fast |
---|---|---|---|
Honda Accord | 180 | False | False |
Yugo | 500 | True | True |
Delorean DMC-12 | 200 | True | True |
Now let's assume we have another car, but we don't know if it is fast or not:
Car | Horsepower | Racing Stripes | Is Fast |
---|---|---|---|
Chevrolet Camaro | 400 | True | Unknown |
1-Nearest Neighbors
In order to predict if if the Camaro is fast or not, we begin by finding the most similar known car in our dataset. In this case, we compare its horsepower
and racing_stripes
values to find the most similar car, which is the Yugo. Since the Yugo is fast, we would predict that the Camaro is also fast. This is an example of 1-nearest neighbors -- we only looked at the most similar car; in other words, we used a k of 1.
2-Nearest Neighbors
If we performed a 2-nearest neighbors prediction, we’d get two True
values for Is Fast (one from the Delorean and one from the Yugo). Averaging these gives us True
. Since the Delorean and Yugo are the two most similar cars, we used a k of 2 to predict that the Camaro is fast.
3-Nearest Neighbors
If we performed a 3-nearest neighbors prediction, we’d still have two True
values (from the Delorean and Yugo) and one False
value (from the Honda Accord) for Is Fast. Since the average of these values is still True
, using a k of 3 also leads us to predict that the Camaro is fast.
Choosing a K for K-Nearest Neighbors
The number of neighbors (k) we use for a k-nearest neighbors prediction can be any value less than the number of rows in our dataset. In practice, sticking to just a few neighbors makes the algorithm work better, since adding less similar ones will ultimately hurt our predictions. A good starting point is to set k to a small odd number, like 3 or 5 to aviod ties, and adjust based on how well the model performs.
Euclidean Distance
Before we can predict using KNN, we need to find some way to figure out which data rows are "closest" to the row we're trying to predict on.
A simple way to do this is to find the Euclidean distance between two observations (rows) using the formula:
$$\sqrt{(q_1-p_1)^2 + (q_2-p_2)^2 + \cdots + (q_n-p_n)^2}$$
Where:
- $\boldsymbol{q_1}$ is the value in column $\boldsymbol{1}$ for row $\boldsymbol{q}$
- $\boldsymbol{p_n}$ is the value in column $\boldsymbol{n}$ for row $\boldsymbol{p}$.
Note: we'll need to convert any categorical variables into numeric values if we want to use them in this formula.
For example, let's say we have these two rows (where True
/False
have been converted to 1
/0
), and we want to find the distance between these two cars:
Car | Horsepower | Is Fast |
---|---|---|
Honda Accord | 180 | 0 |
Chevrolet Camaro | 400 | 1 |
Then the distance between them is:
$$\sqrt{(180-400)^2 + (0-1)^2}$$
When we crunch the numbers, we find that the "distance" between the Honda Accord and the Chevrolet Camaro is about 220
.
In a similar way, we can use Euclidean distance to find the most similar NBA players to Lebron James by finding the players with the shortest distance to him.
# Select Lebron James from our dataset
selected_player = nba[nba["player"] == "LeBron James"].iloc[0]
# Choose only the numeric columns (we'll use these to compute Euclidean distance)
distance_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts']
import math
def euclidean_distance(row):
"""
A simple Euclidean distance function
"""
inner_value = 0
for k in distance_columns:
inner_value += (row[k] - selected_player[k]) ** 2
return math.sqrt(inner_value)
# Find the distance from each player in the dataset to lebron.
lebron_distance = nba.apply(euclidean_distance, axis=1)
Normalizing Columns
You may have noticed that horsepower
in the cars example above had a much larger impact on the final distance than racing_stripes
did. This is because horsepower
values are much larger in absolute terms, and thus hide the impact of racing_stripes
values in the Euclidean distance calculations.
This can be an issue because just having larger values doesn’t mean a variable is more important or better at predicting which rows are similar—it just ends up dominating the calculation.
A simple way to deal with this problem is to normalize all the numeric columns to have a mean of 0
, and a standard deviation of 1
. This will level the playing field and ensure that no single column overpowers the Euclidean distance calculations.
How to
To set the mean to 0
, we find the mean of the column, then subtract it from each value in that column. To set its standard deviation to 1
, we divide every value in the column by its standard deviation. The formula for normalizing a single value looks like this:
$$x_{norm}=\frac{x-\mu}{\sigma}$$
Where:
- $x$ is the original numeric value
- $\mu$ is the column's mean
- $\sigma$ is the column's standard deviation
Let's do that now for our NBA data by taking advantage of vectorized operations in pandas:
# Select only the numeric columns from the NBA dataset
nba_numeric = nba[distance_columns]
# Normalize all of the numeric columns
nba_normalized = (nba_numeric - nba_numeric.mean()) / nba_numeric.std()
Finding the Nearest Neighbor
We now know enough to find the nearest neighbor of a given row in the NBA dataset. To make our Euclidean distance calculations much faster and easier, we can use the distance.euclidean
function from scipy.spatial
.
from scipy.spatial import distance
# Fill in NA values in nba_normalized
nba_normalized.fillna(0, inplace=True)
# Find the normalized vector for lebron james.
lebron_normalized = nba_normalized[nba["player"] == "LeBron James"].iloc[0]
# Find the distance between lebron james and everyone else.
euclidean_distances = nba_normalized.apply(lambda row: distance.euclidean(row, lebron_normalized), axis=1)
# Create a new dataframe with distances.
distance_frame = pandas.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
# The shortest distance to lebron is lebron, so the second shortest is the most similar non-lebron player
second_shortest = distance_frame.iloc[1]["idx"]
most_similar_to_lebron = nba.loc[int(second_shortest)]["player"]
print(f"Player most similar to LeBron James: {most_similar_to_lebron}")
Player most similar to LeBron James: Carmelo Anthony
Generating Training and Testing Sets
Now that we know how to find the nearest neighbors, we’re ready to make predictions. Our goal is to predict how many points a player scored by using their 5
closest neighbors, based on similarity across all numeric columns in the dataset.
Before we do anything, we need to split the data into training and testing sets. This step can't be skipped because we don’t want to test our predictions on the same data we used to train the model—it would lead to overfitting and inaccurate results. To create these sets, we’ll randomly shuffle the rows in the nba
dataframe and split them into two groups: one for training the model and one for testing it.
While cross-validation is another option for splitting the data, it’s a bit more complex and not strictly necessary for this example. For now, random sampling will do the job while keeping things straightforward.
import random
from numpy.random import permutation
# Fill NA values with column mean
nba = nba.fillna(nba.mean(numeric_only=True))
# Randomly shuffle the index of nba.
random_indices = permutation(nba.index)
# Set a cutoff for how many items we want in the test set (in this case 1/3 of the items)
test_cutoff = math.floor(len(nba)/3)
# Generate the test set by taking the first 1/3 of the randomly shuffled indices.
test = nba.loc[random_indices[1:test_cutoff]]
# Generate the train set with the rest of the data.
train = nba.loc[random_indices[test_cutoff:]]
Using sklearn
for K-Nearest Neighbors
Instead of having to do it all ourselves, we can use the k-nearest neighbors implementation in scikit-learn. Check out the official scikit-learn documentation for more details. There's a regressor and a classifier available, but we'll be using the regressor, as we have continuous values to predict on.
Sklearn performs the normalization and distance finding automatically, and lets us specify how many neighbors we want to look at.
# The columns that we will be making predictions with.
x_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf']
# The column that we want to predict.
y_column = ["pts"]
from sklearn.neighbors import KNeighborsRegressor
# Create the knn model.
# Look at the five closest neighbors.
knn = KNeighborsRegressor(n_neighbors=5)
# Fit the model on the training data.
knn.fit(train[x_columns], train[y_column])
# Make point predictions on the test set using the fit model.
predictions = knn.predict(test[x_columns])
Computing Error
Now that we can make predictions about how many points a player scores, we can compute the error involved with our predictions by comparing them to our test set. We can compute the mean squared error using this formula:
$$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat{y_{i}})^{2}$$
Where:
- $n$ is the number of predictions
- $y_{i}$ are the actual values from the test set
- $\hat{y_{i}}$ are our predictions
# Get the actual values from the test set.
actual = test[y_column]
# Compute the mean squared error of our predictions.
mse = (((actual - predictions) ** 2).sum()) / len(predictions)
Next Steps
Try experimenting with different values for k to see if the MSE goes up (bad) or down (good). Also, we chose to fill missing values with column means; try dropping rows instead to see if you get better results.
For more on k-nearest neighbors, you can check out our five-part interactive Introduction to Supervised Machine Learning in Python course where you'll learn to train, validate, and improve various machine learning models.