October 11, 2022

Regularization in Machine Learning (with Code Examples)

In this tutorial, we'll learn what regularization is and why we use it. We'll also discuss regularization techniques and how to use them.

After you finish this tutorial, you'll understand the following:

  • Regularization in machine learning
  • L1 regularization (lasso regression)
  • L2 regularization (ridge regression)
  • Elastic Net
  • How to use these regularization techniques

In this tutorial, we assume you know the fundamentals of machine learning, including the basic concepts of linear regression. If you're not familiar with machine learning or are eager to refresh your machine learning skills, you might like to try our Data Scientist in Python Career Path.

Introduction

Basically, we use regularization techniques to fix overfitting in our machine learning models. Before discussing regularization in more detail, let's discuss overfitting.

Overfitting happens when a machine learning model fits tightly to the training data and tries to learn all the details in the data; in this case, the model cannot generalize well to the unseen data. The following illustration, called the generalization curve, shows that the training loss keeps decreasing by increasing the number of training iterations:

Although minimizing the training loss is a good thing, the validation loss starts to increase after a specific number of iterations. The increasing trend of the validation loss means that while we are trying to reduce the training loss, we increase the model's complexity, so it cannot generalize to new data points.

In other words, a high variance machine learning model captures all the details of the training data along with the existing noise in the data. So, as you've seen in the generalization curve, the difference between training loss and validation loss is becoming more and more noticeable. On the contrary, a high bias machine learning model is loosely coupled to the training data, which leads to a low difference between training loss and validation loss.

So far, we've learned that preventing overfitting is crucial to improve the performance of our machine learning model. In the following sections, we'll learn about regularization and its techniques.

What Is Regularization?

Regularization means restricting a model to avoid overfitting by shrinking the coefficient estimates to zero. When a model suffers from overfitting, we should control the model's complexity. Technically, regularization avoids overfitting by adding a penalty to the model's loss function:

$$\text{Regularization = Loss Function + Penalty}$$

There are three commonly used regularization techniques to control the complexity of machine learning models, as follows:

  • L2 regularization
  • L1 regularization
  • Elastic Net

Let’s discuss these standard techniques in detail.

L2 Regularization

A linear regression that uses the L2 regularization technique is called ridge regression. In other words, in ridge regression, a regularization term is added to the cost function of the linear regression, which keeps the magnitude of the model’s weights (coefficients) as small as possible. The L2 regularization technique tries to keep the model’s weights close to zero, but not zero, which means each feature should have a low impact on the output while the model's accuracy should be as high as possible.

$$\text{Ridge Regression Cost Function} = \text{Loss Function} + \dfrac{1}{2} \lambda\sum_{j=1}^m w_j^2$$

Where $\lambda$ controls the strength of regularization, and $w_j$ are the model's weights (coefficients).

By increasing $\lambda$, the model becomes flattered and underfit. On the other hand, by decreasing $\lambda$, the model becomes more overfit, and with $\lambda$ = 0, the regularization term will be eliminated.

L1 Regularization

Least Absolute Shrinkage and Selection Operator (lasso) regression is an alternative to ridge for regularizing linear regression. Lasso regression also adds a penalty term to the cost function, but slightly different, called L1 regularization. L1 regularization makes some coefficients zero, meaning the model will ignore those features. Ignoring the least important features helps emphasize the model's essential features.

$$\text{Lasso Regression Cost Function} = \text{Loss Function} + \lambda \sum_{j=1}^m |w_j|$$

Where $\lambda$ controls the strength of regularization, and $w_j$ are the model's weights (coefficients).

Lasso regression automatically performs feature selection by eliminating the least important features.

Elastic Net

The Elastic Net is a regularized regression technique combining ridge and lasso's regularization terms. The $r$ parameter controls the combination ratio. When $r=1$, the L2 term will be eliminated, and when $r=1$, the L1 term will be removed.

$$\text{Elastic Net Cost Function} = \text{Loss Function} + r \lambda \sum_{j=1}^m |wj|+ \dfrac{(1-r)}{2} \lambda\sum{j=1}^m w_j^2$$

Although combining the penalties of lasso and ridge usually works better than only using one of the regularization techniques, adjusting two parameters, $\lambda$ and $r$, is a little tricky.

Demonstrating Regularization Techniques with Python

In this section, we're going to apply L2 and L1 regularization techniques to the Boston Housing dataset and compare the training set score and test set score before and after using these techniques.

First, let's load the dataset and split it into a training set and a test set, as follows:

import mglearn as ml
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from numpy import genfromtxt

dataset = genfromtxt('https://raw.githubusercontent.com/m-mehdi/tutorials/main/boston_housing.csv', delimiter=',')
X = dataset[:,:-1]
y = dataset[:,-1]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0)

Now, we can train the Linear regression model and then print the training set score and the test set score:

lr = LinearRegression().fit(X_train, y_train)

print(f"Linear Regression-Training set score: {lr.score(X_train, y_train):.2f}")
print(f"Linear Regression-Test set score: {lr.score(X_test, y_test):.2f}")
    Linear Regression-Training set score: 0.95
    Linear Regression-Test set score: 0.61

Comparing the model performance on the training set and the test set reveals that the model suffers from overfitting.

To avoid overfitting and control the complexity of the model, let's use ridge regression (L2 regularization) and see how well it does on the dataset:

from sklearn.linear_model import Ridge
ridge = Ridge(alpha=0.7).fit(X_train, y_train)
print(f"Ridge Regression-Training set score: {ridge.score(X_train, y_train):.2f}")
print(f"Ridge Regression-Test set score: {ridge.score(X_test, y_test):.2f}")
    Ridge Regression-Training set score: 0.90
    Ridge Regression-Test set score: 0.76

Although the training set score of ridge regression is slightly lower than the linear regression training score, the test set score of ridge is significantly higher than the linear regression test set score. These scores confirm that ridge regression reduces the model's complexity, leading to a less-overfit-but-more-general model.

The alpha parameter specifies a trade-off between the model's performance on the training set and its simplicity. So, increasing the alpha value (its default value is 1.0) simplifies the model by shrinking the coefficients.

Now, let's apply the lasso regression to the dataset and explore the results.

from sklearn.linear_model import Lasso
lasso = Lasso(alpha=1.0).fit(X_train, y_train)
print(f"Lasso Regression-Training set score: {lasso.score(X_train, y_train):.2f}")
print(f"Lasso Regression-Test set score: {lasso.score(X_test, y_test):.2f}")
    Lasso Regression-Training set score: 0.29
    Lasso Regression-Test set score: 0.21

As shown, lasso performs quite disappointingly, and it's a sign of underfitting. The lasso model doesn't work well because most of the coefficients have become exactly zero. If we want to know the exact number of features that have been used in the model, we can use the following code:

print(f"Number of features: {sum(lasso.coef_ != 0)}")
    Number of features: 4

This means that only 4 of the 104 features in the training set are used in the lasso regression model, while the rest are ignored.

Let's adjust alpha to reduce underfitting by decreasing its value to 0.01:

lasso = Lasso(alpha=0.01).fit(X_train, y_train)
print("Lasso Regression-Training set score: {:.2f}".format(lasso.score(X_train, y_train)))
print("Lasso Regression-Test set score: {:.2f}".format(lasso.score(X_test, y_test)))
    Lasso Regression-Training set score: 0.90
    Lasso Regression-Test set score: 0.77
    /Users/mohammadmehdi/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.690e+01, tolerance: 3.233e+00
      model = cd_fast.enet_coordinate_descent(

Rerunning the code below shows that by decreasing alpha, the Lasso model uses 32 of the 104 features:

print(f"Number of features: {sum(lasso.coef_ != 0)}")
    Number of features: 32

Although we can reduce alpha even more, it seems that its optimum value is 0.01.

The last technique that we're going to use is elastic net. Let's see how well it does.

from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.01, l1_ratio=0.01).fit(X_train, y_train)
print(f"Elastic Net-Training set score: {elastic_net.score(X_train, y_train):.2f}")
print(f"Elastic Net-Test set score: {elastic_net.score(X_test, y_test):.2f}")
    Elastic Net-Training set score: 0.84
    Elastic Net-Test set score: 0.70
    /Users/mohammadmehdi/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.474e+02, tolerance: 3.233e+00
      model = cd_fast.enet_coordinate_descent(

NOTE

In general, to avoid overfitting, the regularized models are preferable to a plain linear regression model. In most scenarios, ridge works well. But in case you're not certain about using lasso or elastic net, elastic net is a better choice because, as we've seen, lasso removes strongly correlated features.


Conclusion

This tutorial explored different ways of avoiding overfitting in linear machine learning models. We discussed why overfitting happens and what ridge, lasso, and elastic net regression methods are. We also applied these techniques to the Boston housing dataset and compared the results. Some other techniques, such as early stop and dropout, can be used for regularizing complex models, while the latter is mainly used for regularizing artificial neural networks.

I hope that you have learned something new today. Feel free to connect with me on LinkedIn or Twitter.

Mehdi Lotfinejad

About the author

Mehdi Lotfinejad

Mehdi is a Senior Data Engineer and Team Lead at ADA. He is a professional trainer who loves writing data analytics tutorials.