Introduction
Regularization is a very important concept in machine learning. Overfitting is the major problem in the field of machine learning. Overfitting results in a high error for test data or new data that the model hasn’t seen before. We must make sure that there is no overfitting problem while building the machine learning model.
We can overcome the problem of overfitting with the following techniques:
- Regularization
- Feature selection
- Cross-validation techniques
- Ensemble techniques
In this tutorial, we are going to discuss regularization techniques and how we can use them to overcome the problem of overfitting.
Regularization is a technique that shrinks the coefficient estimates towards zero. This technique adds a penalty to more complex models and discourages the learning of more complex models to reduce the chance of overfitting.
Now, let’s consider a simple linear regression that looks like
Here, b0 represents the intercept, and b1, b2, .. bn represents the slope. For a detailed study of linear regression click this link
The loss function that is associated with linear regression is given below
Yi is the True value and Ypredicted is given by this formula
The main objective of a linear regression model is to minimize the cost function i.e Residual Sum of Squares(RSS). Regularization adds a penalty to this RSS to minimize the overfitting to create a generalized model.
Types of regularization
Generally, there are two types of regularization and they are
Ridge(L1) regularization
In Ridge regularization, the cost function is altered by adding a penalty term that is equal to the square of the magnitude of the coefficient estimates. After adding the penalty term the cost function becomes
Here lambda(λ) is the penalty factor. The value of lambda(λ) determines the penalty added to the cost function. The different value of lambda(λ) determines the number of independent variable shrinkage.
⇒ If lambda(λ) = 0, no features will shrink
⇒ If lambda(λ) = infinity, all features will shrink
As the value of lambda(λ) increases, the number of features shrinkage will increase. Let’s take a simple example to understand how ridge regression works. For simplicity let’s suppose that the linear regression algorithm is fed with this type of data
Suppose we want to predict the marks obtained by students based on the number of hours per day they studied. We suppose that only three data for simplicity. As we can see that the best fit line passes through all the training points, it is an example of overfitting. Let’s see how ridge regression is used to overcome the problem of overfitting.
Working procedure of ridge regularization
Let’s take λ = 1(which can be any positive number) and suppose that three slopes b1, b2, and b3 equal 1.2, 1.3, 1.4 respectively.
From the above equation of RSS, the value of RSS will be zero(0) for simple linear regression as all the True and predicted value overlaps. Because of this value of RSS, the linear regression algorithm stops there and considers that it has found the best fit line. But we know that because of overfitting, the value of RSS becomes zero(0) but the best fit line is not the true best fit line.
To overcome this effect we need to look for other best-fit lines which can give low bias and low variance. This is the condition where regularization comes into the picture.
From above equation of modified RSS for ridge regression and values of slopes that we considered we can compute the value of RSS and found to be (0 + 1*(1.2^2) + 1*(1.3^2) + 1*(1.4^2)) = 5.09. We can see that the value of RSS after adding the penalty is not zero(minimum). So, the algorithm will look for another best fit line.
From the above updated best fit line, the algorithm calculates the new RSS. Let’s slope values are 0.8, 1.05, and 1.08. The new value of RSS is 0 + 1*(0.8^2) + 1*(1.05^2) + 1*(1.08^2) = 2.7626. The algorithm further looks for the minimization of RSS. The new best-fit line be
From the above updated best fit line, the algorithm calculates the new RSS. Let’s slope values are 0.8, 1.05, and 1.08. The new value of RSS is 0 + 1*(0.5^2) + 1*(0.85^2) + 1*(0.95^2) = 1.8189. The algorithm further looks for the minimization of RSS. This process continues until RSS is minimum and the line will be the best fit line.
Hence, the algorithm gets the best fit line which will produce low bias and low variance and removes the problem of overfitting. This is the working mechanism of ridge regression.
Lasso(L2) regularization
In Lasso regularization, the cost function is altered by adding a penalty term that is equal to the absolute magnitude of the coefficient estimates. After adding the penalty term the cost function becomes.
Just like in ridge regularization lambda(λ) is the penalty factor. The value of lambda(λ) determines the penalty added to the cost function. The different value of lambda(λ) determines the number of independent variable shrinkage.
⇒ If lambda(λ) = 0, no features will shrink
⇒ If lambda(λ) = infinity, all features will shrink
The working mechanism is the same as that of ridge regression. Let’s take the same problem as above to determine the marks of students based on the number of hours studied by students.
The working mechanism of Lasso regularization
Let’s take λ = 1(which can be any positive number) and suppose that three slopes b1, b2, and b3 equal 1.2, 1.3, 1.4 respectively.
For simple linear regression, the value of RSS is zero(0) as there is no error between the true value and the predicted value. To overcome the problem of overfitting using lasso regression, the algorithm calculates the value of RSS using the formula mentioned above. Using the supposed value of slopes the value of RSS is 0 + 1*1.2 + 1*1.3 + 1*1.4 = 3.9.
Since the value of RSS seems to be not minimal, the algorithm will find the next line the same as ridge regression and calculates the RSS. Let’s slope values are 0.8, 1.05, 1.08 and new value of RSS is 0 + 1*0.8 + 1*1.01 + 1*1.05 = 2.86.
The algorithm looks continuously for the best fit line for which the value of RSS will be minimal. Let’s slope values are 0.8, 1.05, and 1.08. The new value of RSS based on this new line is 0 + 1*0.5 + 1*0.85+ 1*0.92 = 2.7.
This process continues until RSS is minimum and the algorithm finds the best-fit line for a model. Hence, we will get the best fit line with low bias and low variance and removes the problem of overfitting. This is the working mechanism of lasso regression.
Difference between ridge and lasso regularization
The basic difference between lasso and ridge regularization is that the ridge regularization technique does not shrink the features to complete zero but lasso regression shrinks some features to zero and this is why lasso regularization is used for feature selection.
Conclusion
Regularization is a type of regression that shrinks some of the features to avoid complex model building. This regularization is essential for overcoming the overfitting problem. Ridge(L1) regularization only performs the shrinkage of the magnitude of the coefficient but lasso(L2) regularization performs feature scaling too. We need to build a generalized model with low bias and low variance. In the case of overfitting, regularization is very essential for building the generalized model.
Happy Learning 🙂
Check out: