Regularization in Machine Learning

Introduction

Regularization is a very important concept in machine learning. Overfitting is the major problem in the field of machine learning. Overfitting results in a high error for test data or new data that the model hasn’t seen before. We must make sure that there is no overfitting problem while building the machine learning model.

We can overcome the problem of overfitting with the following techniques:

  • Regularization
  • Feature selection
  • Cross-validation techniques
  • Ensemble techniques

In this tutorial, we are going to discuss regularization techniques and how we can use them to overcome the problem of overfitting.

Regularization is a technique that shrinks the coefficient estimates towards zero. This technique adds a penalty to more complex models and discourages the learning of more complex models to reduce the chance of overfitting.

Now, let’s consider a simple linear regression that looks like

linear regression expression

 

 

Here, b0 represents the intercept, and b1, b2, .. bn represents the slope. For a detailed study of linear regression click this link

The loss function that is associated with linear regression is given below

Residual sum of squares

 

Yi is the True value and Ypredicted is given by this formula

linear regression prediction formula

 

The main objective of a linear regression model is to minimize the cost function i.e Residual Sum of Squares(RSS). Regularization adds a penalty to this RSS to minimize the overfitting to create a generalized model.

 

Types of regularization

Generally, there are two types of regularization and they are

Ridge(L1) regularization

In Ridge regularization, the cost function is altered by adding a penalty term that is equal to the square of the magnitude of the coefficient estimates. After adding the penalty term the cost function becomes

Ridge Regression

Here lambda(λ) is the penalty factor. The value of lambda(λ) determines the penalty added to the cost function. The different value of lambda(λ) determines the number of independent variable shrinkage.

If lambda(λ) = 0, no features will shrink

If lambda(λ) = infinity, all features will shrink

As the value of lambda(λ) increases, the number of features shrinkage will increase. Let’s take a simple example to understand how ridge regression works. For simplicity let’s suppose that the linear regression algorithm is fed with this type of data

linear-regression-example

 

Suppose we want to predict the marks obtained by students based on the number of hours per day they studied. We suppose that only three data for simplicity. As we can see that the best fit line passes through all the training points, it is an example of overfitting. Let’s see how ridge regression is used to overcome the problem of overfitting.

 

Working procedure of ridge regularization

Let’s take λ = 1(which can be any positive number) and suppose that three slopes b1, b2, and b3 equal 1.2, 1.3, 1.4 respectively.

From the above equation of RSS, the value of RSS will be zero(0) for simple linear regression as all the True and predicted value overlaps. Because of this value of RSS, the linear regression algorithm stops there and considers that it has found the best fit line. But we know that because of overfitting, the value of RSS becomes zero(0) but the best fit line is not the true best fit line.

To overcome this effect we need to look for other best-fit lines which can give low bias and low variance. This is the condition where regularization comes into the picture.

From above equation of modified RSS for ridge regression and values of slopes that we considered we  can compute the value of RSS and found to be (0 + 1*(1.2^2) + 1*(1.3^2) + 1*(1.4^2)) = 5.09. We can see that the value of RSS after adding the penalty is not zero(minimum). So, the algorithm will look for another best fit line.

 

From the above updated best fit line, the algorithm calculates the new RSS. Let’s slope values are 0.8, 1.05, and 1.08. The new value of RSS is 0 + 1*(0.8^2) + 1*(1.05^2) + 1*(1.08^2) = 2.7626. The algorithm further looks for the minimization of RSS. The new best-fit line be

 

From the above updated best fit line, the algorithm calculates the new RSS. Let’s slope values are 0.8, 1.05, and 1.08. The new value of RSS is 0 + 1*(0.5^2) + 1*(0.85^2) + 1*(0.95^2) = 1.8189. The algorithm further looks for the minimization of RSS. This process continues until RSS is minimum and the line will be the best fit line.

Hence, the algorithm gets the best fit line which will produce low bias and low variance and removes the problem of overfitting. This is the working mechanism of ridge regression.

 

Lasso(L2) regularization

In Lasso regularization, the cost function is altered by adding a penalty term that is equal to the absolute magnitude of the coefficient estimates. After adding the penalty term the cost function becomes.

Just like in ridge regularization lambda(λ) is the penalty factor. The value of lambda(λ) determines the penalty added to the cost function. The different value of lambda(λ) determines the number of independent variable shrinkage.

⇒ If lambda(λ) = 0, no features will shrink

⇒ If lambda(λ) = infinity, all features will shrink

The working mechanism is the same as that of ridge regression. Let’s take the same problem as above to determine the marks of students based on the number of hours studied by students.

 

The working mechanism of Lasso regularization

Let’s take λ = 1(which can be any positive number) and suppose that three slopes b1, b2, and b3 equal 1.2, 1.3, 1.4 respectively.

For simple linear regression, the value of RSS is zero(0) as there is no error between the true value and the predicted value. To overcome the problem of overfitting using lasso regression, the algorithm calculates the value of RSS using the formula mentioned above. Using the supposed value of slopes the value of RSS is 0 + 1*1.2 + 1*1.3 + 1*1.4 = 3.9.

Since the value of RSS seems to be not minimal, the algorithm will find the next line the same as ridge regression and calculates the RSS. Let’s slope values are 0.8, 1.05, 1.08 and new value of RSS is 0 + 1*0.8 + 1*1.01 + 1*1.05 = 2.86.

The algorithm looks continuously for the best fit line for which the value of RSS will be minimal. Let’s slope values are 0.8, 1.05, and 1.08. The new value of RSS based on this new line is 0 + 1*0.5 + 1*0.85+ 1*0.92 = 2.7.

This process continues until RSS is minimum and the algorithm finds the best-fit line for a model. Hence, we will get the best fit line with low bias and low variance and removes the problem of overfitting. This is the working mechanism of lasso regression.

 

Difference between ridge and lasso regularization

The basic difference between lasso and ridge regularization is that the ridge regularization technique does not shrink the features to complete zero but lasso regression shrinks some features to zero and this is why lasso regularization is used for feature selection.

 

Conclusion

Regularization is a type of regression that shrinks some of the features to avoid complex model building. This regularization is essential for overcoming the overfitting problem. Ridge(L1) regularization only performs the shrinkage of the magnitude of the coefficient but lasso(L2) regularization performs feature scaling too. We need to build a generalized model with low bias and low variance. In the case of overfitting, regularization is very essential for building the generalized model.

Happy Learning 🙂

Check out:

Leave a Comment