Ridge and Lasso Regression Complete Guide

when we start with Machine learning, the First type we study is Regression. And end up with Linear and Logistic regression because they are mostly used and In most courses, you will only find this both. But did you know there are 7 different types of Regression Algorithms.

The Chief Product Officer of DataRobout in one of the Machine learning workshop held in New York said "If you are using Regression without regularization then you need to be very special". And I by listening this statement explored the Regularization techniques and find it very useful.

Overview
Why we require Regularization
Ridge Regression
Lasso Regression
Combining Both Regularization Technique
End Note

Before following this tutorial make sure that you know the implementation and Intuition behind working of Simple linear regression.

Basic Overview

When dealing with the Regression task when we have a large number of features then there are 2 possibilities that we can get. First is a tendency of the model to overfit on a large number of features And second is a computational cost to train the model on such large set of features.

So, Ridge and Lasso Regression appear to work towards the same goal by reducing the model overfitting but the practical use case of both differ substantially. They work by penalizing the magnitude of coefficients of features along with reducing the error between Predicted and actual value and It is known as Regularization Techniques.

The key difference between both techniques is How they penalize the coefficients and that's what we have to understand else working and Intuition are approximately the same.

Why we require to Penalize the Model?

A generalized model always has a low bias and low variance But when we use Linear regression to train the model the tendency of the model is so high to overfit because what happens is Linear regression draws a Best-fit line and if it connects and passes through all training data points so training error is less but when I pass a test data little bit away from line then an error will be high which is known as overfitting. Now How Regularization solves this problem? And how Ridge and Lasso Regression Works for a common goal to avoid overfitting. Let's understand this one by one.

Ridge Regression

A Ridge Regression is also known as L2 Regularization. In linear regression cost function is a square sum of the difference between actual and predicted value. Now we know that In the regularization technique we penalize this cost function by adding one penalty parameter.

In ridge regression, the cost function formula becomes something like this.

y_hat = λ * (slope) ^2

here, lambda is a penalty parameter and slope squared is similar as squared difference between actual and predicted values. Now let's see what is importance of lambda parameter.

suppose my lamda is 1 and slope is 1.3 so the cost will be 1.69 Now this is my whole error and I have to reduce this, In linear regression when cost is 0 we stop there, but here we will go for other iteration and find the same for second line and the process will continue.

In first Iteration the loss will be high because we are calculating the cost of steep line and from second iteration the steepnes of line will be gone and slope will reduce which indicates that with a unit increase in the x-axis there is less value change in the y-axis.

Lambda value can be assigned from 0 to the end positive number.

So Instead of using Best-fit line we use the less steepest line and consider that as a best-fit line since our error is less.

Thus, we are penalizing the features which are having a higher slope by reducing the steeper slope to less steepness and thus it is a Ridge Regression. We apply many iterations to find the less steep line and the iteration in which we get the best results we select that and now the condition for overfitting is reduced.

Remenber that there will be some bias with training dataset but with test dataset we are getting a generalized model which we want.

Usually, we consider the value of lambda smaller but when the value of lambda increases the slope of the line will go very-very close to zero.

The Intuition can be seem little bit complex to understand but go through it two to three times, you will get it. And that's solve for Ridge regression.

Lasso Regression

There is only a slight difference in Ridge regression and Lasso regression. In Lasso regression instead of doing square of slope we take magnitude of slope Because Lasso regression not only helps in penalizing the feature But also helps in feature selection and that's the advantage of using Lasso over Ridge.

let's understand how lasso regression is used for Feature selection?

y = mx + c

y_bar = m1x1 + m2x2 +m3x3 + c1

magnitude is nothing but,

lambda * | m1 + m2 + m3 + m4.. + mn |

Now why this magnitude is important?

When we were doing square of slope in Ridge the slope was somewhere getting near to zero, But in case of Lasso slope will move towards zero and wherever the slope value is very very less that features will be removed which means this features are not important for predicting the output. This happens when steepest line starts moving towards zero then at that time some of the features shrink and reach near to zero which we removed and do feature selection.

ElasticNet Regression(Combining Both Techniques)

ElasticNet is a hybrid version of both Ridge and Lasso Regression which enjoys the advantage of both techniques means you can reduce the overfiting by applying mix of both and can also be used for feature selection.

We use Ridge Regression in such a problem statement where Multicollinearity problem is there which means when independent variable is highly correlated with each-other. And this concept we as a advantage use in Lasso Regression with Feature Selection. we select features in such a way that the features whose coefficients become zero we eliminates them.

So, this both Advantage we have inherit from Ridge and Lasso and combine them in ElasticNet which is flexible with working with any kind of problem statement thus its name is ElasticNet.

End Note

I hope guys you got the intuition and reason behind using Regularization techniques. I know that understanding it can be little bit complex because their is slight difference between all the techniques But try to go through article twice. Their is also a cross validation method of these techniques which Machine learning library sciket-learn provides us. In our net article we will be implementing all this technique on a real-world dataset.

Thank you!

keep learning, Happy Learning