# An Intro to Logistic Regression in Python (w/ 100+ Code Examples)

The **logistic regression** algorithm is a probabilistic machine learning algorithm used for classification tasks. This is usually the first classification algorithm you'll try a classification task on. Unlike many machine learning algorithms that seem to be a black box, the logisitc regression algorithm is easily understood.

In this tutorial, you'll learn everything you need to know about the logistic regression algorithm. You'll start by creating a custom logistic regresssion algorithm. This will help you understand everything happening under the hood and how to debug problems with your logisitic regression models. Next, you'll learn how to train and optimize Scikit-Learn implementation of the logistic regression algorithm. Finally, you'll learn how to handle multiclass classification tasks with this algorithm.

This tutorial covers `L1`

and `L2`

regularization, hyperparameter tuning using grid search, automating machine learning workflow with pipeline, one vs rest classifier, object-oriented programming, modular programming, and documenting Python modules with docstring.

## Build Your Own Custom Logistic Regression Model

In this section, you'll build your own custom logistic regression model using stochastic gradient descent. This logistic regression algorithm can be trained with batch, mini-batch, or stochastic gradient descent.

In batch gradient descent, the entire train set is used to update the model parameters after one epoch. For a very large train set, it may be best to update the model parameters with a random subset of the data, as training the model with the whole set would be computationally expensive. This is the idea behind mini-batch gradient descent. In stochastic gradient descent, model parameters are updated after training on every single data point in the train set. This method can be used when the train set is small.

To summarize, let assume that we have a train set with `m`

rows. The batch size, `n`

, used to train and update model parameters for:

- batch gradient descent: $n = m$
- mini-batch gradient descent: $1 < n < m$
- stochastic gradient descent: $n = 1$

The number of rows, `m`

, in our train set is small. It is okay to use stochastic gradient descent for our example. Next, let's discuss the methods in our custom logistic regression model.

### The `__init__`

Method

To create an instance of an object, you need to initialize or assign some starting values. These starting values are passed through the `__init__`

method. To create an instance of our Stochastic Logistic Regression (`SLR`

) class, we have to pass the `learning_rate`

, `n_epochs`

, and `cutoff`

parameters.

In addition, we have also initialized our intercept, `b`

, and coefficients `w`

values. These are the values that we want to optimize using gradient descent.

```
class SLR(object):
"""
This is the SLR class
"""
def __init__(self, learning_rate=10e-3, n_epochs=10_000, cutoff=0.5):
"""
The __init__ method
Params:
learning_rate
n_epochs
cutoff
"""
self.learning_rate = learning_rate
self.n_epochs = n_epochs
self.cutoff = cutoff
self.w = None
self.b = 0.0
```

### The `__repr__`

Method

The `__repr__`

method helps us make our own printable representation of an object. To understand how this works, let's print the `SLR`

class without the `__repr__`

method:

#### Create an instance of the SLR class

`slr0 = SLR()`

#### Print the slr0 object

`print(slr0)`

When we print the instance of the `SLR`

class without the `__repr__`

method, the output is the address of the object in memory. With the `__repr__`

method, we define how we want the object to be printed, as shown below:

```
class SLR(object):
...
def __init__(self, learning_rate=10e-3, n_epochs=10_000, cutoff=0.5):
...
self.learning_rate = learning_rate
self.n_epochs = n_epochs
self.cutoff = cutoff
...
def __repr__(self):
params = {
'learning_rate': self.learning_rate,
'n_epochs': self.n_epochs,
'cutoff': self.cutoff
}
return "SLR({0}={3}, {1}={4}, {2}={5})".format(*params.keys(), *params.values())
```

#### Create an instance of the SLR class

`slr1 = SLR()`

#### Print the slr1 object

`print(slr1)`

The new instance of our `SLR`

class prints exactly what we have defined in our `__repr__`

method. We have excluded some things from the SLR class for brevity. We will use ellipsis, `...`

, to represent the parts where items have been excluded.

### The `sigmoid`

Method

The sigmoid method is at the heart of the logistic regression algorithm. It maps the linear function $z = {w^T}x + b$ to the open interval $(0, 1)$. The sigmod method outputs probability values that we will compare to the cutoff for predictions. It is mathematically represented as: ${\large {1 \over {1 + e^{-z}}}}$

The shape of the sigmoid function is shown in the figure below. The sigmoid function has an asymptote that approaches a value of $1$ when $z$ goes to $+\infty$ and an asymptote that approaches the value of $0$ when $z$ goes to $-\infty$.

```
class SLR(object):
...
def __init__(self, learning_rate=10e-3, n_epochs=10_000, cutoff=0.5):
...
def __repr__(self):
...
def sigmoid(self, z):
"""
The sigmoid method:
Param:
z
Return:
1.0 / (1.0 + exp(-z))
"""
return 1.0 / (1.0 + np.exp(-z))
```

### The `predict_proba`

Method

This method predicts the probability value for each row. The value of the linear function, `z`

, is calculated using the intercept, `b`

, and coefficients, `w`

, parameters. The value of `z`

is passed to `sigmoid`

method to get a probabilty value.

```
class SLR(object):
...
def __init__(self, learning_rate=10e-3, n_epochs=10_000, cutoff=0.5):
...
def __repr__(self):
...
def sigmoid(self, z):
...
def predict_proba(self, row):
"""
The predict_proba
Param:
row
Return:
sigmoid(z)
"""
z = np.dot(row, self.w) + self.b
return self.sigmoid(z)
```

### The `fit`

Method

This method implements stochastic gradient descent to update the `w`

and `b`

parameters. One of the first things to do is intialize the parameters of `w`

and `b`

. The value of `b`

is set to `0.0`

and the values in `w`

are set to `zeros`

, as shown below:

```
class SLR(object):
...
def __init__(self, learning_rate=10e-3, n_epochs=10_000, cutoff=0.5):
...
self.w = None
self.b = 0.0
...
def fit(self, X, y):
...
self.w = np.zeros(X.shape[1])
...
```

The next important steps are to use the values of `b`

and `w`

to calculate a probability value and calculate the gradients with respect to `w`

and `b`

. These are shown mathematically below:

- Calcualte a probability value,
`yhat`

: $\hat{y_i}$ = ${\large {1 \over {1 + e^{-z_i}}}}$ - Calculate gradient w.r.t
`b`

,`grad_b`

: $\nabla_b = \hat{y_i} - y_i$ - Calculate gradient w.r.t
`w`

,`grad_w`

: $\nabla_w = x_i(\hat{y_i} - y_i)$

```
class SLR(object):
...
def fit(self, X, y):
...
for n_epoch in range(1, self.n_epochs + 1):
losses = []
for i in range(self.m):
# Calculate the probability value for a row
yhat = self.predict_proba(X[i])
# Calculate the gradient w.r.t b
grad_b = yhat - y[i]
# Calculate the gradient w.r.t w
grad_w = X[i] * (yhat - y[i])
```

The final steps are to update the values of `b`

and `w`

using the learning rate, $\alpha$, and calculate the loss function:

- Update the value of
`b`

: $b = b - \alpha\nabla_b$ - Update the value of
`w`

: $w = w - \alpha\nabla_w$ - Calculate log-loss: $-(y_i\log(\hat{y}_i + (1 - y_i)\log((1 - \hat{y}_i))$

```
class SLR(object):
...
def fit(self, X, y):
...
for n_epoch in range(1, self.n_epochs + 1):
losses = []
for i in range(self.m):
...
# Update the value of w
self.w -= self.learning_rate * grad_w / self.m
# Update the value of b
self.b -= self.learning_rate * grad_b / self.m
# Calculate the log-loss
loss = -1/self.m * (y[i] * np.log(yhat) + (1 - y[i]) * np.log(1 - yhat))
losses.append(loss)
# Calculate the cost fuction
self.cost.append(sum(losses))
```

After training for one epoch, the cost function is calculated for that epoch. The cost function is the average of the loss function. If you'd like to learn more about gradient descent and how the parameters were derived, read this article on DataQuest.

Putting everything in the `fit`

method together, we get:

```
class SLR(object):
...
def __init__(self, learning_rate=10e-3, n_epochs=10_000, cutoff=0.5):
...
def __repr__(self):
...
def sigmoid(self, z):
...
def predict_proba(self, row):
...
def fit(self, X, y):
"""
The fit method implement stochastic gradient descent
Param
X, y
Return
None
"""
if not isinstance(X, np.ndarray):
X = X.to_numpy()
if not isinstance(y, np.ndarray):
y = y.to_numpy()
self.w = np.zeros(X.shape[1])
self.cost = []
self.m = X.shape[0]
self.log_loss = {}
self.cost = []
for n_epoch in range(1, self.n_epochs + 1):
losses = []
for i in range(self.m):
yhat = self.predict_proba(X[i])
grad_b = yhat - y[i]
grad_w = X[i] * (yhat - y[i])
self.w -= self.learning_rate * grad_w / self.m
self.b -= self.learning_rate * grad_b / self.m
loss = -1/self.m * (y[i] * np.log(yhat) + (1 - y[i]) * np.log(1 - yhat))
losses.append(loss)
self.cost.append(sum(losses))
```

### The `predict`

Method

This method is used to make discrete predictions. It uses the `predict_proba`

method to get the probability values, then compares these probability values against a cutoff. Probability values below the cutoff belong to one class, and those greater than or equal to the cutoff belong to another class.

```
class SLR(object):
...
def __init__(self, learning_rate=10e-3, n_epochs=10_000, cutoff=0.5):
...
def __repr__(self):
...
def sigmoid(self, z):
...
def predict_proba(self, row):
...
def fit(self, X, y):
...
def predict(self, X):
if not isinstance(X, np.ndarray):
X = X.to_numpy()
self.predict_probas = []
for i in range(X.shape[0]):
ypred = self.predict_proba(X[i])
self.predict_probas.append(ypred)
return (np.array(self.predict_probas) >= self.cutoff) * 1.0
```

### The `score`

Method

This method is used to calculate the accuracy score, using the trained parameters of the model. It uses the `predict`

method to make discrete predictions and compare these values against the actual.

```
class SLR(object):
...
def __init__(self, learning_rate=10e-3, n_epochs=10_000, cutoff=0.5):
...
def __repr__(self):
...
def sigmoid(self, z):
...
def predict_proba(self, row):
...
def fit(self, X, y):
...
def predict(self, X):
...
def score(self, X, y):
"""
The score method
Param
X, y
Return
accuracy_score(y, ypred)
"""
ypred = self.predict(X)
y = y.to_numpy()
return accuracy_score(y, ypred)
```

### The Logistic Regression Module

Putting everything inside a python script (`.py`

file) and saving (`slr.py`

) gives us a custom logistic regression module. You can reuse the code in your logistic regression module by importing it. You can use your custom logistic regression module in multiple Python scripts and Jupyter notebooks.

It's important that you save the module in the same directory with the script or notebook you'll be working on. Let's assume that our `slr.py`

file is saved inside the `logreg`

directory. In the next section, we shall be working with our custom module.

#### ../logreg/slr.py

```
"""
# slr.py
This is stochastic logistic regression module
To use:
from slr import SLR
# Import the SLR class from the module and use its methods
log_reg = SLR() # Initialization with default params
log_reg.fit(X, y) # Fit with train set
log_reg.predict(X) # Make predictions with test set
log_reg.score(X,y) # Get accuracy score
Method:
__init__
__repr__
sigmoid
predict
predict_proba
fit
score
"""
import numpy as np
from sklearn.metrics import accuracy_score
class SLR(object):
"""
This is the SLR class
"""
def __init__(self, learning_rate=10e-3, n_epochs=10_000, cutoff=0.5):
"""
The __init__ method
Params:
learning_rate
n_epochs
cutoff
"""
self.learning_rate = learning_rate
self.n_epochs = n_epochs
self.cutoff = cutoff
self.w = None
self.b = 0.0
def __repr__(self):
params = {
'learning_rate': self.learning_rate,
'n_epochs': self.n_epochs,
'cutoff': self.cutoff
}
return "SLR({0}={3}, {1}={4}, {2}={5})".format(*params.keys(), *params.values())
def sigmoid(self, z):
"""
The sigmoid method:
Param:
z
Return:
1.0 / (1.0 + exp(-z))
"""
return 1.0 / (1.0 + np.exp(-z))
def predict_proba(self, row):
"""
The predict_proba
Param:
row
Return:
sigmoid(z)
"""
z = np.dot(row, self.w) + self.b
return self.sigmoid(z)
def predict(self, X):
if not isinstance(X, np.ndarray):
X = X.to_numpy()
self.predict_probas = []
for i in range(X.shape[0]):
ypred = self.predict_proba(X[i])
self.predict_probas.append(ypred)
return (np.array(self.predict_probas) >= self.cutoff) * 1.0
def score(self, X, y):
"""
The score method
Param
X, y
Return
accuracy_score(y, ypred)
"""
ypred = self.predict(X)
y = y.to_numpy()
return accuracy_score(y, ypred)
def fit(self, X, y):
"""
The fit method implement stochastic gradient descent
Param
X, y
Return
None
"""
if not isinstance(X, np.ndarray):
X = X.to_numpy()
if not isinstance(y, np.ndarray):
y = y.to_numpy()
self.w = np.zeros(X.shape[1])
self.cost = []
self.m = X.shape[0]
self.log_loss = {}
self.cost = []
for n_epoch in range(1, self.n_epochs + 1):
losses = []
for i in range(self.m):
yhat = self.predict_proba(X[i])
grad_b = yhat - y[i]
grad_w = X[i] * (yhat - y[i])
self.w -= self.learning_rate * grad_w / self.m
self.b -= self.learning_rate * grad_b / self.m
loss = -1/self.m * (y[i] * np.log(yhat) + (1 - y[i]) * np.log(1 - yhat))
losses.append(loss)
self.cost.append(sum(losses))
```

## Working with the `slr`

Module

#### ../logreg/stochastic.ipynb

```
import slr
from slr import SLR
```

We've imported the `slr`

module and the `SLR`

class inside this module. Does this step look familiar? Yes! You must've imported several Python libraries in the past--numpy, pandas, and matplotlib are widely used.

You can learn about a Python library, module, class, method, and function from their documentation. Docstrings are an excellent way to document how things work in Python. To view the documentation of an unfamiliar module or function, you can use the `__doc__`

attribute. You can view the methods and attributes of a Python object with the `dir()`

function.

#### Attributes and methods of the slr module

`dir(slr)`

#### Attributes and methods of the SLR class

`dir(SLR)`

Both the `slr`

module and `SLR`

class have the `__doc__`

attribute. We can view what is written in their docstrings:

#### Docstring in the slr.py module

```
print(slr.__doc__)
print(SLR.__doc__)
```

Let's also view the docstrings in the `sigmoid`

and `fit`

methods:

```
print(SLR.sigmoid.__doc__)
print(SLR.fit.__doc__)
```

Alternatively, you can view all the docstrings with the `help()`

function:

`help(SLR)`

## Understanding Logistic Regression with `slr`

#### ../logreg/stochastic.ipynb

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
```

## Load and Split Data

### ../logreg/stochastic.ipynb

#### Load data

```
bin_data = load_breast_cancer()
X = bin_data.data
y = bin_data.target
```

#### Store data as pandas objects

```
X = pd.DataFrame(X, columns=bin_data.feature_names)
y = pd.Series(y, name='diagnosis', dtype=np.int8)
pd.options.display.max_columns = X.shape[1]
X.head()
```

#### The target has two classes - binary classification task

`y.value_counts()`

#### Split the data into train and test sets

`X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=47)`

### Loss and Cost Functions for Logistic Regression

The **linear regression** algorithm uses the **mean squared error** cost function. This cost function is always convex for a linear function $z = {w^T}x + b$. In other words, the cost function always approaches its global minimum during training.

The logistic regression algorithm uses the sigmoid function, which is exponential. This function isn't always convex with the mean squared error cost function. Thus, our logistic regression can get stuck in a local minimum. We have to randomly initialize our model and train several to be sure that we are getting the best results.

We initialized our weights to zeros while building the `SLR`

because we are using a cost function that is always convex with the sigmoid function. Otherwise we should have randomly intialized our weights if we were using the mean squared error cost function.

The loss function used for training the logistic regression algorithm is called **log-loss**. It is given mathematically as:

**log-loss**= $-y_i \cdot \log\hat{y} - (1 - y_i) \cdot \log(1 - \hat{y})$.

The cost function, $J(w, b)$, is just the average of the log-loss function for an epoch:

- $J(w, b) = -{1 \over m}\sum^m_{i=1}y_i \cdot \log\hat{y} + (1 - y_i) \cdot \log(1 - \hat{y}) $

The figure below shows the log-loss plot for each row in the train set for the 1st, 2nd, 100th 999th, and 1000th epochs. There are many spikes in log-loss values in the 1st epoch. As we continue to train, the spikes reduce, and we have many log-loss values closer to zero in the 1000th epoch.

When we average the log-loss values for each epoch, we get the cost function. We expect the cost function value to decrease as the number of epochs increases. The cost function plot when we average the log-loss values for the above examples is shown below:

The above cost function plot is convex. The cost function value continues to approach the global minimum. If you're interested in learning more about convexity and the log-loss function, this article explains why the log-loss function works for logistic regression, but the mean squared error doesn't. While this exchange tries to prove that the log-loss function is always convex.

Next, let's look at how the learning rate affects the shape of the cost function.

### Cost Function and Learning Rate

The learning rate is an important parameter in our SLR model. Here, we created five instances of the SLR class with different learning rates. When the learning rate was very small, the cost function, $Cost (J)$, was a straight horizontal line.

As we increase the learning rate from ${10^{-4}}$ to $10^{-0}$, we get the first half of a U-shaped curve. The larger the learning rate, the faster the cost function approaches the global minimum.

Do you think we should continue to increase the learning rate? There are issues with the model when the learning rate is greater than 1. When we create an instance of the `SLR`

class with a learning rate of 4, we get the plot on **Part A** in the figure below:

What happened to the cost function after 500 epochs? The `predict_proba`

method started to output probabilities very close to zero. The log-loss (and by extension, the cost function) values become positive infinity. You will get `np.nan`

for `np.inf`

log-loss values. This is what happened after 500 epochs. We weren't able to make a plot of the cost function after this point. If you want to observe these `np.inf`

values, you can use the `np.nan_to_num`

function on your log-loss: `np.nan_to_num(log-loss)`

. This function replaces `np.inf`

with very large values.

With a very large learning rate, the cost function misses the global minimum by jumping from Part A to larger values, like in Part B. Thus, it is important that we choose the appropriate learning rate.

## The Performance of the `slr`

In this section, we'll train and make predictions with our custom logistic regression model. First, let's identify a baseline for our model.

### Baseline Model

```
y_train.value_counts(normalize='True')
y_test.value_counts(normalize='True')
```

If we predict `1`

for all the entries in the train and test set, we get an accuracy of 63% and 61%, respectively. Thus, 63% is the baseline for our train set. It is 61% for the test set. We expect our custom model to outperform the baseline.

### Scale Features

```
X_train.describe()
X_train.info()
```

The `describe()`

and `info()`

methods tell us that the features in our data are continuous variables on different scales. Before training, we'll first scale the features with the `StandardScaler()`

function. We don't want to do this manually, so we'll connect the scaling and fitting using a `Pipeline`

:

### Train the `SLR`

#### Connect StandardScaler() and SLR() with a pipe

```
pipe0 = Pipeline([
('scaler', StandardScaler()),
('lr', SLR(learning_rate=1e-2, n_epochs=1000, cutoff=0.5))
])
```

## Train the model

`pipe0.fit(X_train, y_train)`

### Get the accuracies of the train and test set

#### Check if they beat the baseline model

```
accuracies = {
'train accuracy': pipe0.score(X_train, y_train),
'test accuracy': pipe0.score(X_test, y_test)
}
print(*accuracies.items())
```

#### Check other classification metrics for test set

`print(classification_report(y_test, pipe0.predict(X_test)))`

#### Check other classification metrics for train set

`print(classification_report(y_train, pipe0.predict(X_train)))`

Our custom `slr`

model outperformed the baseline on the train and test sets. This is so cool!

Next, let's investigate whether the cutoff value significantly affects prediction accuracies.

### Predict with the `slr`

We will make predictions with `pipe0`

in this section. Then we'll plot the predicted probabilities using a cutoff of 0.5. Points above the cutoff belong in one class, and points below the cutoff belong in a separate class.

```
import matplotlib
matplotlib.rcParams['font.family'] = 'monospace'
```

#### Predict with pipe0

```
predictions = pipe0.predict(X_test)
predict_probas = pipe0['lr'].predict_probas
cutoff = pipe0['lr'].cutoff
```

#### Plot predicted probabilities

```
fig = plt.figure(figsize=(15, 5), constrained_layout=True)
x = range(1, len(predict_probas) + 1)
plt.scatter(x, predict_probas, c=predictions, label='predicted probabilities')
plt.plot(x, [cutoff] * len(predictions), color='red', label='cutoff=0.5')
plt.ylabel('Predicted probabilites', fontsize=12)
plt.xlabel('test data', fontsize=12)
plt.legend(loc=7);
```

#### Recall the accuracy values using pipe0

`print(*accuracies.items())`

We can see from the above plots that there are values close to the cutoff point. Let's assume that these values were among the mis-classified. Let's reduce the cutoff value to 0.35 and observe for improvement:

#### create a new instance, pipe1, with a cutoff value of 0.35

```
pipe1 = Pipeline([
('scaler', StandardScaler()),
('lr', SLR(learning_rate=1e-2, n_epochs=1000, cutoff=0.35))
])
```

#### fit pipe1 instance

`pipe1.fit(X_train, y_train)`

#### Get predicted probabilities for pipe1

```
predictions = pipe1.predict(X_test)
predict_probas = pipe1['lr'].predict_probas
cutoff = pipe1['lr'].cutoff
```

#### Plot the predicted probabilities

```
fig = plt.figure(figsize=(15, 5), constrained_layout=True)
x = range(1, len(predict_probas) + 1)
plt.scatter(x, predict_probas, c=predictions, label='predicted probabilities')
plt.plot(x, [cutoff] * len(predictions), color='red', label='cutoff=0.35')
plt.ylabel('Predicted probabilites', fontsize=12)
plt.xlabel('test data', fontsize=12)
plt.legend(loc=5);
```

#### Get pipe1 accuracies

```
accuracies1 = {
'train accuracy': pipe1.score(X_train, y_train),
'test accuracy': pipe1.score(X_test, y_test)
}
print(*accuracies1.items())
```

Although our accuracies did not improve, we've learned that we can either increase or decrease our cutoff value from the default, depending on the type of problem we are faced with.

Our custom logistic regression model did not perform badly. But we can do much more with the **Scikit-Learn** implementation of the logistic regression algorithm. This machine learning framework has more features and has been rigorously tested. We'll be using Scikit-Learn going forward.

## Scikit-Learn Logistic Regression Model

In this section, we'll use the features of the Scikit-Learn logistic regression model. First, we'll train a logistic regression model with no regularization. Then we'll train models with `L1`

, `L2`

, and `L1 and L2`

regularization.

Let's import this model from Scikit-Learn:

`from sklearn.linear_model import LogisticRegression`

### Without Regularization

Training without regularization simply means setting the `penalty`

parameter to `none`

:

#### Train sklearn logistic regression model with no regularization: pipe2

```
pipe2 = Pipeline([
('scaler', StandardScaler()),
('lr', LogisticRegression(penalty='none'))
])
```

#### fit the model

`pipe2.fit(X_train, y_train)`

#### Fet model accuracies

```
accuracies2 = {
'train accuracy': pipe2.score(X_train, y_train),
'test accuracy': pipe2.score(X_test, y_test)
}
print(*accuracies2.items())
```

We can see that the model is overfitting the train set. The difference between the train and test accuracies is 12%. The purpose of regularization is to prevent overfitting.

Next, let's use the `L1`

regularization.

### With L1 Regularization

Regularization is the modification we make to machine learning algorithms to reduce their generalization errors. In other words, regularization helps us improve a machine learning model's performance on data it hasn't seen before/wasn't trained on.

Regularization in logistic regression simply means adding a regularization parameter to our cost function:

- $J(w, b) = -{1 \over m}\sum^m_{i=1}y_i \cdot \log\hat{y} + (1 - y_i) \cdot \log(1 - \hat{y}) + {\lambda \over m}R(w_i)$

The parameter $\lambda$ and $R(w_i)$ are the regularization parameter and regularization function, respectively. For logistic regression, the regularization parameter, $C$, is given as: $C = {1 \over \lambda}$. For `L1`

regularization, $R(w*i)$ is given as $\sum*{j=1}^p|w_i|$.

The logistic regression cost function with `L1`

regularization becomes:

- $J(w, b) = -{1 \over m}\sum^m_{i=1}y_i \cdot \log\hat{y} + (1 - y
*i) \cdot \log(1 - \hat{y}) + {1 \over m \cdot C} \sum*{j=1}^p|w_i|$

Recall that $w$ is our model's coefficients. Therefore $|w_i|$ is the absolute value of the coefficients.

Next, we are going to implement the Scikit-Learn logistic regression model with `L1`

regularization. Before then, let's discuss the pros and cons of `L1`

regularization. The major advantage of the `L1`

regularization is that it's used for feature selection. The weights of unimportant features are set to zero. Only the coefficients of important features are left. The disadvantage of the `L1`

regularization is that is does not use statistical techniques to remove features that are collinear. It may remove the most predictive feature from the model out of two collinear features.

We'll use `solver='saga'`

because the `saga`

solver works on all types of penalties or regularization.

#### Logistic regression with L1 regularization with regularization parameter C=1e-1

```
pipe3 = Pipeline([
('scaler', StandardScaler()),
('lr', LogisticRegression(penalty='l1', C=1e-1, solver='saga'))
])
```

#### fit the pipe

`pipe3.fit(X_train, y_train)`

#### Get the accuracies

```
accuracies3 = {
'train accuracy': pipe3.score(X_train, y_train),
'test accuracy': pipe3.score(X_test, y_test)
}
print(*accuracies3.items())
```

Recall that the logistic regression set coefficients to zero. Let's inspect these coefficients from our model below:

```
df_coefficients = pd.DataFrame(
{
'feature': X_train.columns,
'coefficient': pipe3['lr'].coef_[0]
}
)
df_coefficients
```

Let's inspect only the important features below:

```
(
df_coefficients[df_coefficients.coefficient != 0]
.sort_values(by=['coefficient'])
)
```

Out of the 30 features in our data, only a handful were considered important by the L1 regularization. Next, let's see how L2 regularization works with logistic regression.

### With L2 Regularization

For `L2`

regularization, $R(w*i)$ is given as ${1 \over 2}\sum*{j=1}^pw_i^2$.

The logistic regression cost function with `L2`

regularization becomes:

- $J(w, b) = -{1 \over m}\sum^m_{i=1}y_i \cdot \log\hat{y} + (1 - y
*i) \cdot \log(1 - \hat{y}) + {1 \over 2 \cdot m \cdot C} \sum*{j=1}^pw_i^2$

Where $w_i^2$ is the square of the model's coefficients.

The `L2`

regularization can't perform feature selection, but it's used to prevent overfitting. Logistic regression with `L2`

regularization is implemented below:

#### Logistic regression with L2 regularization with regularization parameter C=1e-1

```
pipe4 = Pipeline([
('scaler', StandardScaler()),
('lr', LogisticRegression(penalty='l2', C=1e-1, solver='saga'))
])
```

#### fit the pipe

`pipe4.fit(X_train, y_train)`

#### Get the accuracies

```
accuracies4 = {
'train accuracy': pipe4.score(X_train, y_train),
'test accuracy': pipe4.score(X_test, y_test)
}
print(*accuracies4.items())
```

We can observe from the dataframe below that no features have zero coefficients. If we want a simpler model with fewer features, L2 regularization isn't very helpful.

```
pd.DataFrame(
{
'feature': X_train.columns,
'coefficient': pipe4['lr'].coef_[0]
}
).sort_values(by=['coefficient'])
```

### With L1 and L2 Regularization

The logistic regression with `L1`

and `L2`

regularization uses the `L1`

and `L2`

regularization functions. This regularization is called `elasticnet`

and is given mathematically below:

- $J(w, b) = -{1 \over m}\sum^m_{i=1}y_i \cdot \log\hat{y} + (1 - y
*i) \cdot \log(1 - \hat{y}) + {1 \over m \cdot C} \sum*{j=1}^p(\beta \cdot |w_i| + {1 \over 2} \cdot (1 - \beta) \cdot w_i^2)$

The parameter $\beta$ is known as the **L1_ratio**. It's the mixing effect that `L1`

and `L2`

regularization have on the model's coefficients. When this parameter is set to zero, it's `L2`

regularization. When it's set to 1, it is `L1`

regularization.

This regularization is implemented below:

#### Logistic regression with L1 and L2 regularization with regularization parameter C=1e-1 and L1_ratio=0.5

```
pipe5 = Pipeline([
('scaler', StandardScaler()),
('lr', LogisticRegression(penalty='elasticnet', l1_ratio=0.5, C=1e-1, solver='saga'))
])
```

#### Fit the pipe

`pipe5.fit(X_train, y_train)`

#### Get the accuracies

```
accuracies5 = {
'train accuracy': pipe5.score(X_train, y_train),
'test accuracy': pipe5.score(X_test, y_test)
}
print(*accuracies5.items())
```

#### Get the features coefficients

```
pd.DataFrame(
{
'feature': X_train.columns,
'coefficient': pipe5['lr'].coef_[0]
}
)
```

#### Percentage features for which coefficients are set to zero and those not set to zero

```
(
pd.DataFrame(
{
'feature': X_train.columns,
'coefficient': pipe5['lr'].coef_[0]
})
.coefficient
.apply(lambda x: x == 0)
.value_counts(normalize=True)
)
```

We have used `elasticnet`

regularization with 50 percent L1_ratio. L1 and L2 regularization have an equal effect on the model coefficients. With this regularization, we've got more important features than using L1 regularization alone.

We haven't attempted to tune the hyperparameters of our model. Let's see how we can optimize model hyperparameters in the following section.

### Hyperparameter Tuning with GridSearch

Hyperparameters are the assumptions we explicitly make to control the learning process. If we assume that the learning rate and regularization constant take specific values, we aren't quite sure if these values are actually the optimum.

To find the optimum hyperparameters for our model, we can list a range of values for them. Grid search creates a combination of these parameters and plots them into grids. Our estimator trains on each grid and records the score for the selected metric. If accuracy is the selected metric, grid search select the parameters from the grid that gives the highest accuracy as the best estimator.

Let's assume that we want to optimize a model with `C`

and `LR`

hyperparameters. The listed `C`

values are `[1e-1, 1e-3]`

, and those for `LR`

are `[2e-1, 2e-3]`

. Grid search will train on and evaluate the accuracy of each of the grids shown below. The one with the highest accuracy is selected as the best estimator.

You can use grid search for more than two entries in a hyperparamter and for more than two hyperparameters. If three hyperparameters are used, we get a cubiod shape instead of a plane.

Let's optimize our logistic regression model using grid search.

#### import gridsearchcv

`from sklearn.model_selection import GridSearchCV`

#### create pipe - estimator

```
pipe6 = Pipeline([
('scaler', StandardScaler()),
('lr', LogisticRegression(solver='saga'))
])
```

#### specify hyperparameters and their values

```
params = {
'lr__C': [1.0, 1e-1, 1e-2, 1e-3],
'lr__penalty': ['l1', 'l2', 'elasticnet'],
'lr__l1_ratio': [0.25, 0.5, 0.75]
}
```

#### fit gridsearch with pipe as estimator

```
grid_pipe6 = GridSearchCV(
pipe6,
params,
cv=5
)
grid_pipe6.fit(X_train, y_train)
```

#### get best hyperparameter values

`print(grid_pipe6.best_params_)`

#### get the best estimator / model

`print(grid_pipe6.best_estimator_)`

#### Explore the best estimator parameter dictionary

`grid_pipe6.best_estimator_.get_params()`

#### Get best estimator accuracies

```
accuracies6 = {
'train accuracy': grid_pipe6.score(X_train, y_train),
'test accuracy': grid_pipe6.score(X_test, y_test)
}
print(*accuracies6.items())
```

We tuned the hyperparameters to get the highest accuracies for the train and test sets in this tutorial. Maybe we can get even better accuracies if we search outside our current listed values. For example, if our highest listed parameter came out as the best, we may want list more parameter values with the best parameters at the centre of the list. By doing this, we can explore the edges for better parameters.

Next, we'll see how to handle multiclass classification with the logistic regression model.

## Mutliclass Classification with Logistic Regression

### Load Multiclass Data

```
from sklearn.datasets import load_wine
mult_data = load_wine()
X = mult_data.data
y = mult_data.target
X = pd.DataFrame(X, columns=mult_data.feature_names)
y = pd.Series(y, name='class', dtype=np.int8)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=47)
X_train.head()
```

### Baseline Model

#### get baseline for train set

`y_train.value_counts(normalize=True)`

#### get baseline for test set

`y_test.value_counts(normalize=True)`

### Train and Optimize Your Own OneVsRestClassifier

In this section, we'll train our own multiclass classifier. We'll use the One-Vs-Rest method, in which we split the multiclass classifiation task into multiple binary classification tasks. Then we'll train classifiers that that can tell a particular class apart from the rest.

Let's split the multiclass tasks into multiple binary classification tasks by transforming their labels:

#### The target variable for a class_0 vs the rest classifier

`y0_train = (y_train == 0)`

#### The target variable for a class_1 vs the rest classifier

`y1_train = (y_train == 1) `

#### The target variable for a class_2 vs the rest classifier

`y2_train = (y_train == 2) `

#### Create a class_0 vs rest classifier

```
mult_pipe0 = Pipeline([
('scaler', StandardScaler()),
('mult_lr', LogisticRegression(solver='saga'))
])
```

#### Hyperparameters to optimize

```
params = {
'mult_lr__C': [1.0, 1e-1, 1e-2, 1e-3],
'mult_lr__penalty': ['l1', 'l2', 'elasticnet'],
'mult_lr__l1_ratio': [0.10, 0.25, 0.5, 0.75]
}
```

#### Use gridsearch to perform optimization

```
grid_mult_pipe0 = GridSearchCV(
mult_pipe0,
params,
cv=3
)
grid_mult_pipe0.fit(X_train, y0_train)
```

#### Create a class_1 vs rest classifier

```
mult_pipe1 = Pipeline([
('scaler', StandardScaler()),
('mult_lr', LogisticRegression(solver='saga'))
])
```

#### Hyperparameters to optimize

```
params = {
'mult_lr__C': [1.0, 1e-1, 1e-2, 1e-3],
'mult_lr__penalty': ['l1', 'l2', 'elasticnet'],
'mult_lr__l1_ratio': [0.10, 0.25, 0.5, 0.75]
}
```

#### Use gridsearch to perform optimization

```
grid_mult_pipe1 = GridSearchCV(
mult_pipe1,
params,
cv=3
)
grid_mult_pipe1.fit(X_train, y1_train)
```

#### Create a class_2 vs rest classifier

```
mult_pipe2 = Pipeline([
('scaler', StandardScaler()),
('mult_lr', LogisticRegression(solver='saga'))
])
```

#### Hyperparameters to optimize

```
params = {
'mult_lr__C': [1.0, 1e-1, 1e-2, 1e-3],
'mult_lr__penalty': ['l1', 'l2', 'elasticnet'],
'mult_lr__l1_ratio': [0.10, 0.25, 0.5, 0.75]
}
```

#### Use gridsearch to perform optimization

```
grid_mult_pipe2 = GridSearchCV(
mult_pipe2,
params,
cv=3
)
grid_mult_pipe2.fit(X_train, y2_train)
```

#### Get the predicted probability values for class_0 vs rest classifier on the test set

`predict_proba_pipe0 = grid_mult_pipe0.best_estimator_.predict_proba(X_test)`

#### Get the predicted probability values for class_1 vs rest classifier on the test set

`predict_proba_pipe1 = grid_mult_pipe1.best_estimator_.predict_proba(X_test)`

#### Get the predicted probability values for class_2 vs rest classifier on the test set

`predict_proba_pipe2 = grid_mult_pipe2.best_estimator_.predict_proba(X_test)`

#### Create a DataFrame of their result

```
df_result_proba_test = pd.DataFrame(
{
'class_0': predict_proba_pipe0[:, 1],
'class_1': predict_proba_pipe1[:, 1],
'class_2': predict_proba_pipe2[:, 1]
}
)
df_result_proba_test.head()
```

#### Get the classification result for each test set using their probability values

#### A row is given the class that has the highest probability value

`df_result_proba_test.idxmax(axis=1).head()`

#### Convert string output to integer

```
y_mult_pred_test = (
df_result_proba_test.idxmax(axis=1)
.replace({
'class_0': 0,
'class_1': 1,
'class_2': 2
})
)
y_mult_pred_test.head()
```

#### Get test multiclass classification report

`print(classification_report(y_test, y_mult_pred_test))`

Our multiclass classifier works excellently. Let's see how it performs on the train set. We can do this by running the latter part of our previous code with the train set instead of the test set. We can also do so by creating a module for three label multiclass classification.

Let's copy the code below and save it as `onevsall3.py`

in our working directory.

../logreg/onevsall3.py

```
"""
####onevsall3.py
This is a custom one vs all classifier for making predictions
for multiclass problems with 3 classes
"""
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
def onevsall3(X_train, y_train, X_test, y_test):
y0_train = (y_train == 0)
y1_train = (y_train == 1)
y2_train = (y_train == 2)
mult_pipe0 = Pipeline([
('scaler', StandardScaler()),
('mult_lr', LogisticRegression(solver='saga'))
])
params = {
'mult_lr__C': [1.0, 1e-1, 1e-2, 1e-3],
'mult_lr__penalty': ['l1', 'l2', 'elasticnet'],
'mult_lr__l1_ratio': [0.10, 0.25, 0.5, 0.75]
}
grid_mult_pipe0 = GridSearchCV(
mult_pipe0,
params,
cv=3
)
mult_pipe1 = Pipeline([
('scaler', StandardScaler()),
('mult_lr', LogisticRegression(solver='saga'))
])
grid_mult_pipe1 = GridSearchCV(
mult_pipe1,
params,
cv=3
)
mult_pipe2 = Pipeline([
('scaler', StandardScaler()),
('mult_lr', LogisticRegression(solver='saga'))
])
grid_mult_pipe2 = GridSearchCV(
mult_pipe2,
params,
cv=3
)
grid_mult_pipe0.fit(X_train, y0_train)
grid_mult_pipe1.fit(X_train, y1_train)
grid_mult_pipe2.fit(X_train, y2_train)
predict_proba_pipe0 = grid_mult_pipe0.best_estimator_.predict_proba(X_test)
predict_proba_pipe1 = grid_mult_pipe1.best_estimator_.predict_proba(X_test)
predict_proba_pipe2 = grid_mult_pipe2.best_estimator_.predict_proba(X_test)
df_result_proba_test = pd.DataFrame(
{
'class_0': predict_proba_pipe0[:, 1],
'class_1': predict_proba_pipe1[:, 1],
'class_2': predict_proba_pipe2[:, 1]
})
return (
df_result_proba_test.idxmax(axis=1)
.replace({
'class_0': 0,
'class_1': 1,
'class_2': 2
}))
```

#### Import the onevsall3 function from the onevsall3.py module

from onevsall3 import onevsall3

#### Pass the test set data to valid that it works correctly

`test_predict_onevsall = onevsall3(X_train, y_train, X_test, y_test)`

#### Get the test set classification report

`print(classification_report(y_test, test_predict_onevsall))`

Everything looks good! Our `onevsall3`

function is working correctly. Let's see how it performs on the train set.

#### Train set classification report using onevsall3 module

```
train_predict_onevsall = onevsall3(X_train, y_train, X_train, y_train)
print(classification_report(y_train, train_predict_onevsall))
```

Our multiclass classifier performs well on both the train and test sets. Instead of building our own One-Vs-Rest classifier from scratch, let's learn how to implement this using Scikit-Learn.

### Scikit-Learn OneVsRestClassifier

In this section, we'll use Scikit-Learn's OneVsRestClassifier on our multiclass data. To ensure we get excellent performance, we'll use grid search to optimize the hyperparameters.

#### Import OneVsRestClassifier

`from sklearn.multiclass import OneVsRestClassifier`

#### Create an instance of OneVsRestClassifier

`onevsrest = OneVsRestClassifier(LogisticRegression(solver='saga'))`

#### Create your pipe to connect scaling and training

```
onevsrest_pipe = Pipeline([
('scaler', StandardScaler()),
('mult_lr', onevsrest)
])
```

#### Specify your hyperparameters to optimize

```
params = {
'mult_lr__estimator__C': [1.0, 1e-1, 1e-2, 1e-3],
'mult_lr__estimator__penalty': ['l1', 'l2', 'elasticnet'],
'mult_lr__estimator__l1_ratio': [0.10, 0.25, 0.5, 0.75]
}
```

#### Train onevsrest

```
grid_onevsrest = GridSearchCV(
onevsrest_pipe,
params,
cv=3
)
grid_onevsrest.fit(X_train, y_train)
```

#### Get model accuracies

```
print(*{
'test accuracy': grid_onevsrest.best_estimator_.score(X_test, y_test),
'train accuracy': grid_onevsrest.best_estimator_.score(X_train, y_train)}
.items()
)
```

This is good performance, but not as good as our `onevsrest3`

classifier. There's a `multi_class`

parameter in Scikit-Learn logistic regression. Its default value is set to `auto`

. When a multiclass classification task is parsed, this parameter detects it. The logistic regression then uses the default multiclass classification setting to perform multiclass classification.

Let's see how it handles our multiclass classification task in the following section.

### Logistic Regression `multi_class`

Parameter

#### Create a pipe for Scikit-Learn's logistic regression multiclass classifier

```
mult_class_pipe = Pipeline([
('scaler', StandardScaler()),
('mult_lr', LogisticRegression(solver='saga', multi_class='auto'))
])
```

#### Specify hyperparameters to optimize

```
params = {
'mult_lr__C': [1.0, 1e-1, 1e-2, 1e-3],
'mult_lr__penalty': ['l1', 'l2', 'elasticnet'],
'mult_lr__l1_ratio': [0.10, 0.25, 0.5, 0.75]
}
```

#### Train the pipe

```
grid_mult_class = GridSearchCV(
mult_class_pipe,
params,
cv=3
)
grid_mult_class.fit(X_train, y_train)
```

#### Get model accuracies

```
print(*{
'test accuracy': grid_mult_class.score(X_test, y_test),
'train accuracy': grid_mult_class.score(X_train, y_train)}
.items())
```

This is an improvement over the `OneVsRestClassifier`

. However, its performance falls short of our custom `onevsall3`

classifier.

### Conclusion

In this tutorial, we took a deep dive into how the logistic regression algorithm works. We started by building our own logistic regression model from scratch. We used our custom regression model to learn about convexity and the importance of choosing the approprate learning rate. Finally, we used Scikit-Learn implementation of the logistic regression algorithm to learn about regularization, hyperparameter tuning, and multiclass classification.

For next steps, if you feel comfortable using and debugging logistic regression models, you may want to start learning about other commonly used classifiers, like the Support Vector Machines and Decision Tree classifiers.