January 12, 2023

A Gentle Introduction to Supervised Machine Learning (2023)

When we go the grocery store, it doesn't take us long to identify fruits we like among the multiple varieties available in the store. Our brains are able to quickly process the incoming visual data and identify which fruit matches one we enjoy eating.

But how would a computer do it?

That's where Machine Learning comes in. We build machine learning models, feed them some input data and the model's algorithm processes that data to make a decision or a prediction. There are different types of machine learning:

  • Supervised Machine Learning: the model learns from labeled data.
  • Unsupervised Machine Learning: the model learns from unlabeled data.
  • Reinforcement Learning: the model learns by interacting with its environment and receiving a reward based on that interaction.

For this tutorial, we'll focus on Supervised Machine Learning.

In Supervised Machine Learning, models are given data that is already labeled. The models are then trained to learn which features of the data correspond to which label. When the trained model is given some new, unseen data, the model relies on what it has learned so far to make a prediction. There are two kinds of supervised machine learning models:

  • Classification
  • Regression

In our grocery store example, we would input data containing features for different fruits--such as their colors, shapes, and sizes. Each fruit in that data would have a corresponding label stating whether we like it. When we train the model, it would learn which of those features belong to a fruit we like and which belong to ones we don't like. The next time we show that trained model a fruit, it should predict for us, with some accuracy, whether it's a fruit we like. It would try to classify the fruit into a category. Such a model is called a classification model.

If we wanted to predict the price of a fruit, the labels the model would rely on to make a prediction would be different. The labels would not be categories or classes; instead they would be numbers. The model would then take in those features and try to learn what the price of a fruit might be. This model is called a regression model.

Let's learn about a classification model based on the K-Nearest Neighbors algorithm.

The Machine Learning Workflow

The machine learning workflow doesn't just include building and training a model. There are several steps, as depicted above, that help ensure that we're building a model that yields good results.

There are several resources we can rely on to find real-world data sets. Let's use one of these data sets and train a classifier on it!

We'll use the Bank Marketing data set to try to predict if a bank's customer will subscribe to one of the bank's products.

import pandas as pd

# load the data
banking_df = pd.read_csv("subscription_prediction.csv")
num_classes = len(banking_df["y"].unique())
print(f"The dataset has {banking_df.shape[1]} features and {banking_df.shape[0]} observations")
print(f"The dataset has {num_classes} classes")
banking_df.head()
The dataset has 21 features and 10122 observations
The dataset has 2 classes
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 40 admin. married basic.6y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
1 56 services married high.school no no yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
2 41 blue-collar married unknown unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
3 57 housemaid divorced basic.4y no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
4 39 management single basic.9y unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no

5 rows × 21 columns

Data Exploration and Wrangling

This step helps ensure we use relevant features to train our model on. Exploring and cleaning the data set can allow us to find connections between different features and the output classes. We can then select some of those relevant features to train our model on.

For example, we can look at how well the features are correlated to the output.

# Convert output categories into binary labels
banking_df["y"] = banking_df["y"].apply(lambda x: 1 if x=="yes" else 0)

# Calculate correlation between features
correlations = abs(banking_df.corr())

# Identify top 5 features, excluding y itself, that correlate strongly with y.
top_5_features = correlations["y"].sort_values(ascending=False)[1:6].index

print(correlations["y"].sort_values(ascending=False)[1:6])
nr.employed     0.468524
duration        0.468197
euribor3m       0.445328
emp.var.rate    0.429680
pdays           0.317997
Name: y, dtype: float64

Relatively, the above numerical features correlate strongly with the output label. We can use some or all of them to train our model on.

Data Preparation

We need to transform our features so they can be effectively used to train the model. This process of transforming those features is called feature engineering.

Numerical features can have a wide range of values. One feature with larger values could impact our model's performance a lot more than intended. We can normalize our features by rescaling their values to a specific range, such as [0, 1]; this is called min-max scaling or min-max normalization.

Categorical features need to be transformed as well. A string value representing the color of a fruit can't be interpreted by a model. However, we can assign a numerical value to each category, such as a 0 or 1. This process is known as one-hot encoding. New columns, referred as dummy variables, will be created in this process. The following table depicts this transformation:

Marital
Divorced Married Single Unknown
Divorced 1
0 0 0
Married 0
1 0
0
Single
0
0 1 0
Unknown 0
0 0 1

The Marital column lists the category for each observation. The rest of the columns store a 0 or 1, depending on the category for that observation.

Once we have our features ready, we can split the model into training, validation, and test sets. We train the model on the training data set and then evaluate it on the validation set. We then fine-tune it based on the evaluation and try to improve its performance. We then make a final evaluation on the test set.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Divide dataset into features and label columns
X = banking_df.drop(["y"], axis=1)
y = banking_df["y"]

# Split the dataset
X_train, X_val, y_train, y_val = train_test_split(X[top_5_features], y, test_size=0.20, random_state = 417)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.20*X.shape[0]/X_train.shape[0], random_state = 417)

# Normalize the dataset
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

Building and Training a Model: K-Nearest Neighbors

Let's assume the following plot contains some data points corresponding to two features--Campaign and Age. Each data point is a customer who has either subscribed to a product (in purple) or one who hasn't (in blue).

The point in red is a new customer. We want to predict whether the new customer will subscribe to the product, based on the provided information. How can we do so with information from just two features?

One way is to calculate the distance of that uncategorized customer from all the other points and look at the ones closest to it. If a majority of the points, or customers, closest to it have subscribed to the product, we can classify the new customer as one who is likely to subscribe as well. If the majority of customers closest to it are not subscribed, we can say that the new customer is unlikely to subscribe. The distance between the data points can be calculated using a distance metric such as the Euclidean or the Manhattan distance.

We can try to classify new data points by looking at how closely-related they are to other data points in context of their labels. This is the K-Nearest Neighbors (KNN) algorithm, where K is the number of neighbors we look at in relation to the new data point.

We will build and train a KNN-based classifier on our training data.

from sklearn.neighbors import KNeighborsClassifier

num_neighbors = 3

# Instantiate the model
knn = KNeighborsClassifier(n_neighbors = num_neighbors)

# Train or fit the model to training data
knn.fit(X_train_scaled, y_train)
KNeighborsClassifier(n_neighbors=3)

Evaluating and Fine-Tuning the Model

We can now evaluate our model on the validation set and then fine-tune it!

Since we are evaluating a classifer, we need to know how accurately it predicts whether a customer is subscribed to a product. We'll use accuracy as our metric to evaluate our model's performmance.

# Normalize the validation set
X_val_scaled = scaler.transform(X_val)

# Evaluate the model on validation set
val_accuracy = knn.score(X_val_scaled, y_val)
print(val_accuracy)
0.8632098765432099

That's 8

Fine-tuning can involve selecting new features, trying out different feature engineering approaches, or experimenting with model hyperparameters to get the model to perform better.

Model hyperparameters are certain parameters that we can set or input ourselves when training machine learning models. There are different hyperparameters that we can play around with for KNNs, such as:

  • What value to select for K.
  • What distance metric to use.
num_neighbors = [num for num in range(1, 6)]

# Iterate over different Ks
for neighbors in num_neighbors:

    # Instantiate the model
    knn = KNeighborsClassifier(n_neighbors = neighbors, metric = "euclidean")

    # Train or fit the model to training data
    knn.fit(X_train_scaled, y_train)

    # Evaluate the model on validation set
    val_accuracy = knn.score(X_val_scaled, y_val)
    print(f"Model accuracy when K = {neighbors}: {val_accuracy}")
Model accuracy when K = 1: 0.8385185185185186
Model accuracy when K = 2: 0.8093827160493827
Model accuracy when K = 3: 0.8632098765432099
Model accuracy when K = 4: 0.8612345679012345
Model accuracy when K = 5: 0.8671604938271605

We only see a marginal improvement in our accuracy: 86.

This is often an iterative and experimental process. There are a large variety of permutations and combinations that we can try. We can optimize our search by optimizing it through approaches like grid search, in which we specify a subset of the hyperparameter space we want to search across, and the grid search algorithm finds the hyperparameters that yield the best results automatically.

Evaluate the Model on a Test Set

We identified the hyperparameters that resulted in the best performing model on the validation set. We'll use those same hyperparameters to train the model and evaluate our test set.

# Normalize the test set
X_test_scaled = scaler.transform(X_test)

num_neighbors = 5

# Instantiate the model
knn = KNeighborsClassifier(n_neighbors = num_neighbors, metric = "euclidean")

# Train or fit the model to training data
knn.fit(X_train_scaled, y_train)

# Evaluate the model on test set
test_accuracy = knn.score(X_test_scaled, y_test)

print(test_accuracy)
0.865679012345679

Our model got 86.

Conclusion

This tutorial gave us a brief overview of Supervised Machine learning, specifically a classification model, K-Nearest Neighbors. We implemented it on a real-world data set while following a workflow that's designed for machine learning projects.

If you’d like to explore more on this particular topic, please check out Dataquest's Introduction to Supervised Machine Learning in Python course. Alternatively, you can take our Machine Learning in Python Path, which will help you master the skills in approximately two months.

Sahil Juneja

About the author

Sahil Juneja

Sahil is a content developer with experience in creating courses on topics related to data science, deep learning and robotics. You can connect with him on LinkedIn.

Learn data skills for free

Headshot Headshot

Join 1M+ learners

Try free courses