January 2, 2025

Regularization in Machine Learning (with Code Examples)

Regularization in machine learning is one of the most effective tools for improving the reliability of your machine learning models. It helps prevent overfitting, ensuring your models perform well not just on the data they’ve seen, but on new, unseen data too.

By understanding regularization in machine learning, you’ll be able to:

Identify when your models might benefit from regularization
Use L1 (lasso) and L2 (ridge) regularization effectively
Leverage Elastic Net for complex datasets
Optimize model performance with these techniques

We'll assume you’re already comfortable with the basics of machine learning, especially linear regression. But if you’re just getting started or need a refresher, our Machine Learning in Python skill path is a great place to build a strong foundation.

Why Models Overfit and How to Address It

Overfitting happens when a machine learning model learns the training data too well, including its noise and minor details. This results in a model that performs exceptionally on the training data but struggles to generalize to new, unseen data.

You can see this behavior illustrated in the graph below:

Minimizing training loss might seem like the ultimate goal, but it’s not the whole story. As the training loss decreases, the validation loss often starts to rise after a certain point. This trend indicates that the model is becoming too complex, capturing not only the underlying patterns but also the noise in the training data. As a result, it struggles to generalize to new data.

On the other hand, in the high-bias region of the graph, the model is too simple to capture the data's complexity. This results in high training and validation losses that are very similar, indicating underfitting. While there’s no gap between the two losses, the overall performance remains poor because the model is unable to learn the underlying patterns in the data.

Preventing overfitting is critical for building machine learning models that perform well on unseen data. Up next, we’ll explore how regularization helps achieve this balance and the techniques you can use to apply it effectively.

What Is Regularization in Machine Learning?

Regularization restricts a model to prevent overfitting by penalizing large coefficient values, with some techniques shrinking coefficients to zero. When a model suffers from overfitting, we should control the model's complexity. Technically, regularization avoids overfitting by adding a penalty to the model's loss function:

$$\text{Regularization = Loss Function + Penalty}$$

There are three commonly used regularization techniques to control the complexity of machine learning models:

L2 regularization
L1 regularization
Elastic Net

Let’s discuss these standard techniques in detail.

L2 Regularization

A linear regression model that uses the L2 regularization technique is called ridge regression. Effectively, it adds a penalty term to the cost function, which reduces the magnitude of the model's weights (coefficients) without setting them to zero. This encourages the model to distribute influence more evenly across features, helping prevent overfitting while maintaining as much predictive power as possible.

$$\text{Ridge Regression Cost Function} = \text{Loss Function} + \frac{1}{2} \lambda \sum_{j=1}^m w_j^2$$

Here, $\lambda$ controls the strength of regularization, and $w_j$ are the model's weights (coefficients). Increasing $\lambda$ applies stronger regularization, shrinking coefficients further, which can reduce overfitting but may lead to underfitting if $\lambda$ is too large. Conversely, when $\lambda$ is close to $0$, the regularization term has little effect, and ridge regression behaves like ordinary linear regression.

Ridge regression helps strike a balance between bias and variance, improving the model's ability to generalize to unseen data by controlling the influence of less important features.

L1 Regularization

Least Absolute Shrinkage and Selection Operator (lasso) regression is an alternative to ridge regression for regularizing linear models. It adds a penalty term to the cost function, known as L1 regularization, which encourages sparsity by shrinking some coefficients to exactly zero. This effectively ignores the least important features, emphasizing the model's most significant predictors.

$$\text{Lasso Regression Cost Function} = \text{Loss Function} + \lambda \sum_{j=1}^m |w_j|$$

Here, $\lambda$ controls the strength of regularization, with larger values penalizing coefficients more, and $w_j$ represents the model's weights (coefficients).

By eliminating less important features, lasso regression performs automatic feature selection, simplifying the model and improving interpretability.

Elastic Net

The Elastic Net is a regularized regression technique combining ridge and lasso's regularization terms. The $r$ parameter controls the combination ratio. When $r=1$, the L2 term will be eliminated, and when $r=0$, the L1 term will be removed.

$$\text{Elastic Net Cost Function} = \text{Loss Function} + \lambda \left( r \sum_{j=1}^m |w_j| + \frac{(1 - r)}{2} \sum_{j=1}^m w_j^2 \right)$$

Although combining the penalties of lasso and ridge usually works better than only using one of the regularization techniques, adjusting two parameters, $\lambda$ and $r$, is a little tricky.

Demonstrating Regularization Techniques with Python

In this section, we're going to apply the three regularization techniques above to the Boston Housing dataset and compare the training set score and test set score before and after using these techniques.

First, let's load the dataset and split it into a training set and a test set, as follows:

import mglearn as ml
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from numpy import genfromtxt

dataset = genfromtxt('https://raw.githubusercontent.com/datatweets/tutorials/refs/heads/main/misc/boston_housing.csv', delimiter=',')
X = dataset[:,:-1]
y = dataset[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Now, we can train the Linear regression model and then print the training set score and the test set score:

lr = LinearRegression().fit(X_train, y_train)

print(f"Linear Regression-Training set score: {lr.score(X_train, y_train):.2f}")
print(f"Linear Regression-Test set score: {lr.score(X_test, y_test):.2f}")

    Linear Regression-Training set score: 0.95
    Linear Regression-Test set score: 0.61

Comparing the model performance on the training set and the test set reveals that the model suffers from overfitting.

Applying L2 Regularization

To avoid overfitting and control the complexity of the model, let's use ridge regression (L2 regularization) and see how well it does on the dataset:

from sklearn.linear_model import Ridge

ridge = Ridge(alpha=0.7).fit(X_train, y_train)
print(f"Ridge Regression-Training set score: {ridge.score(X_train, y_train):.2f}")
print(f"Ridge Regression-Test set score: {ridge.score(X_test, y_test):.2f}")

    Ridge Regression-Training set score: 0.90
    Ridge Regression-Test set score: 0.76

Although the training set score of ridge regression is slightly lower than the linear regression training score, the test set score of ridge is significantly higher than the linear regression test set score. These scores confirm that ridge regression reduces the model's complexity, leading to a less-overfit-but-more-general model.

The alpha parameter specifies a trade-off between the model's performance on the training set and its simplicity. So, increasing the alpha value (its default value is 1.0) simplifies the model by shrinking the coefficients.

Applying L1 Regularization

Now, let's apply the lasso regression (L1 regularization) to the dataset and explore the results.

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=1.0).fit(X_train, y_train)
print(f"Lasso Regression-Training set score: {lasso.score(X_train, y_train):.2f}")
print(f"Lasso Regression-Test set score: {lasso.score(X_test, y_test):.2f}")

    Lasso Regression-Training set score: 0.29
    Lasso Regression-Test set score: 0.21

As shown, lasso performs quite disappointingly, and it's a sign of underfitting. The lasso model doesn't work well because most of the coefficients have become exactly zero. If we want to know the exact number of features that have been used in the model, we can use the following code:

print(f"Number of features: {sum(lasso.coef_ != 0)}")

    Number of features: 4

This means that only 4 of the 104 features in the training set are used in the lasso regression model, while the rest are ignored.

Changing the Value of `alpha`

Let's adjust alpha to reduce underfitting by decreasing its value to 0.01:

lasso = Lasso(alpha=0.01).fit(X_train, y_train)
print("Lasso Regression-Training set score: {:.2f}".format(lasso.score(X_train, y_train)))
print("Lasso Regression-Test set score: {:.2f}".format(lasso.score(X_test, y_test)))

    Lasso Regression-Training set score: 0.90
    Lasso Regression-Test set score: 0.77

Rerunning the code below shows that by decreasing alpha, the Lasso model uses 32 of the 104 features:

print(f"Number of features: {sum(lasso.coef_ != 0)}")

    Number of features: 32

Although we can reduce alpha even more, it seems that its optimum value is around 0.01.

Applying Elastic Net Regularization

The last technique that we're going to use is elastic net. Let's see how well it does.

from sklearn.linear_model import ElasticNet

elastic_net = ElasticNet(alpha=0.01, l1_ratio=0.01).fit(X_train, y_train)
print(f"Elastic Net-Training set score: {elastic_net.score(X_train, y_train):.2f}")
print(f"Elastic Net-Test set score: {elastic_net.score(X_test, y_test):.2f}")

    Elastic Net-Training set score: 0.84
    Elastic Net-Test set score: 0.70

NOTE

In general, to avoid overfitting, the regularized models are preferable to a plain linear regression model. In most scenarios, ridge works well. But in case you're not certain about using lasso or elastic net, elastic net is a better choice because, as we've seen, lasso removes strongly correlated features.

Next Steps

Today, we explored three different ways to avoid overfitting by implementing regularization in machine learning. We discussed why overfitting happens and what we can do about it. Then we looked at what ridge, lasso, and elastic net regression methods are. We also applied these techniques to the Boston housing dataset and compared the results. Some other techniques you may want to check out are early stopping and dropout which can be used for regularizing complex models. Keep in mind, dropout is mainly used for regularizing artificial neural networks.

I hope that you have learned something new today. Feel free to connect with me on LinkedIn or Twitter.

Tutorials

Regularization in Machine Learning (with Code Examples)

Why Models Overfit and How to Address It

What Is Regularization in Machine Learning?

L2 Regularization

L1 Regularization

Elastic Net

Demonstrating Regularization Techniques with Python

Applying L2 Regularization

Applying L1 Regularization

Changing the Value of `alpha`

Applying Elastic Net Regularization

NOTE

Next Steps

6 Most Common Deep Learning Applications

Tutorial: Introduction to TensorFlow

Regularization in Machine Learning (with Code Examples)

Why Models Overfit and How to Address It

What Is Regularization in Machine Learning?

L2 Regularization

L1 Regularization

Elastic Net

Demonstrating Regularization Techniques with Python

Applying L2 Regularization

Applying L1 Regularization

Changing the Value of alpha

Applying Elastic Net Regularization

NOTE

Next Steps

More learning resources

6 Most Common Deep Learning Applications

Tutorial: Introduction to TensorFlow

Changing the Value of `alpha`