# K nearest neighbors in python: A tutorial

In this post, we’ll be using the K-nearest neighbors algorithm to predict how many points NBA players scored in the 2013-2014 season. Along the way, we’ll learn about euclidean distance and figure out which NBA players are the most similar to Lebron James. If you want to follow along, you can grab the dataset in csv format here.

## A look at the data

Before we dive into the algorithm, let’s take a look at our data. Each row in the data contains information on how a player performed in the 2013-2014 NBA season.

Here are some selected columns from the data:

`player`

– name of the player`pos`

– the position of the player`g`

– number of games the player was in`gs`

– number of games the player started`pts`

– total points the player scored

There are many more columns in the data, mostly containing information about average player game performance over the course of the season. See this site for an explanation of the rest of them.

We can read our dataset in and figure out which columns are present:

```
['player' 'pos' 'age' 'bref_team_id' 'g' 'gs' 'mp' 'fg' 'fga' 'fg.' 'x3p'
'x3pa' 'x3p.' 'x2p' 'x2pa' 'x2p.' 'efg.' 'ft' 'fta' 'ft.' 'orb' 'drb'
'trb' 'ast' 'stl' 'blk' 'tov' 'pf' 'pts' 'season' 'season_end']
```

## KNN overview

The k-nearest neighbors algorithm is based around the simple idea of predicting unknown values by matching them with the most similar known values.

Let’s say that we have 3 different types of cars. We know the name of the car, its horsepower, whether or not it has racing stripes, and whether or not it’s fast.:

```
car,horsepower,racing_stripes,is_fast
Honda Accord,180,False,False
Yugo,500,True,True
Delorean DMC-12,200,True,True
```

Let’s say that we now have another car, but we don’t know how fast it is:

```
car,horsepower,racing_stripes,is_fast
Chevrolet Camaro,400,True,Unknown
```

We want to figure out if the car is fast or not. In order to predict if it is with k nearest neighbors, we first find the most similar known car. In this case, we would compare the `horsepower`

and `racing_stripes`

values to find the most similar car, which is the `Yugo`

. Since the Yugo is fast, we would predict that the Camaro is also fast. This is an example of 1-nearest neighbors – we only looked at the most similar car, giving us a k of 1.

If we performed a 2-nearest neighbors, we would end up with 2 `True`

values (for the Delorean and the Yugo), which would average out to `True`

. The Delorean and Yugo are the two most similar cars, giving us a k of 2.

If we did 3-nearest neighbors, we would end up with 2 `True`

values and a `False`

value, which would average out to `True`

.

The number of neighbors we use for k-nearest neighbors (k) can be any value less than the number of rows in our dataset. In practice, looking at only a few neighbors makes the algorithm perform better, because the less similar the neighbors are to our data, the worse the prediction will be.

## Euclidean distance

Before we can predict using KNN, we need to find some way to figure out which data rows are “closest” to the row we’re trying to predict on.

A simple way to do this is to use Euclidean distance. The formula is

Let’s say we have these two rows (True/False has been converted to 1/0), and we want to find the distance between them:

```
car,horsepower,is_fast
Honda Accord,180,0
Chevrolet Camaro,400,1
```

We would first only select the numeric columns. Then the distance becomes , which is about equal to `220`

.

We can use the principle of euclidean distance to find the most similar NBA players to Lebron James.

### Enjoying this post? Learn data science with Dataquest!

#####
- Learn from the comfort of your browser.
- Work with real-life data sets.
- Build a portfolio of projects.

Start for Free
## Normalizing columns

You may have noticed that `horsepower`

in the cars example had a much larger impact on the final distance than `racing_stripes`

did. This is because `horsepower`

values are much larger in absolute terms, and thus dwarf the impact of `racing_stripes`

values in the euclidean distance calculations.

This can be bad, because a variable having larger values doesn’t necessarily make it better at predicting what rows are similar.

A simple way to deal with this is to normalize all the columns to have a mean of 0, and a standard deviation of 1. This will ensure that no single column has a dominant impact on the euclidean distance calculations.

To set the mean to 0, we have to find the mean of a column, then subtract the mean from every value in the column. To set the standard deviation to 1, we divide every value in the column by the standard deviation. The formula is .

## Finding the nearest neighbor

We now know enough to find the nearest neighbor of a given row in the NBA dataset. We can use the `distance.euclidean`

function from `scipy.spatial`

, a much faster way to calculate euclidean distance.

## Generating training and testing sets

Now that we know how to find the nearest neighbors, we can make predictions on a test set. We’ll try to predict how many points a player scored using the `5`

closest neighbors. We’ll find neighbors by using all the numeric columns in the dataset to generate similarity scores.

First, we have to generate test and train sets. In order to do this, we’ll use random sampling. We’ll randomly shuffle the index of the `nba`

dataframe, and then pick rows using the randomly shuffled values.

If we didn’t do this, we’d end up predicting and training on the same data set, which would overfit. We could do cross validation also, which would be slightly better, but slightly more complex.

## Using sklearn for k nearest neighbors

Instead of having to do it all ourselves, we can use the k-nearest neighbors implementation in scikit-learn. Here’s the documentation. There’s a regressor and a classifier available, but we’ll be using the regressor, as we have continuous values to predict on.

Sklearn performs the normalization and distance finding automatically, and lets us specify how many neighbors we want to look at.

## Computing error

Now that we know our point predictions, we can compute the error involved with our predictions. We can compute mean squared error. The formula is .