Linear Regression in Machine Learning with Examples

Introduction

Linear Regression is a machine learning model that is based on supervised learning. It performs regression tasks. This model maps the linear relationship between dependent and independent variables, so have named linear regression.

Regression models the target predicted variable based on independent variables. It is used to develop the relationship between variables and forecasting. Depending on the number of independent variables, linear regression is of two types :

  1. Simple Linear Regression
  2. Multiple Linear Regression

 

Simple Linear Regression

In simple linear regression, the independent variable is only one. The formula used in simple linear regression to find the relationship between dependent and independent variables is:

y = Ø1 + Ø2*x

y = Dependent variable (output variable)

x = Independent variable

Ø1 = Intercept

Ø2 = Slope

The simple regression model tries to find the ‘best-fit line’ (blue-colored line in the figure above) by adjusting the slope(Ø2) and the intercept(Ø1). The best-fit line is the line that is drawn such that the sum of the square of the distance between the predicted value and the true value is minimal.

In other words, the sum of the distances from that line to the points is minimal. Once the best Ø1 and Ø2 are available, the model is ready to predict the output for the corresponding input.

 

Multiple Linear Regression

Generally, the independent variables are more than one rather than just one variable. This output variable is dependent upon more than one variable so has been named multiple linear regression. It also develops the linear relationship between dependent and independent variables. The formula used to develop the relationship between dependents and independent variables is:

y = Øo + Ø1*x + Ø2*x + . . . . . . . +Øn*xn

y =  Dependent variable

x = Independent variables

Øo = Intercept

Øi = Slope coefficient for each of the dependent variables, i = 1,2,3 ,. . . . k

k = Number of observations

n = Number of independent variables

The best fit line is determined by tuning the values of Øo and Øi such that the sum of the square of predicted and real value is minimal.

 

Cost Function

After we’ve trained our learning algorithm and got a hypothesis, we need to examine how good our results are. This is done by the so-called cost function.

The cost function measures the accuracy of the hypothesis outputs. It does this by comparing the predicted values of the hypothesis with the actual true value.

By achieving the best-fit regression line, the model aims to predict the ‘y’ value such that the error difference between the predicted value and the real value is minimum.

So, it is very essential to update the value of Øo and Øi in case of multiple regression and the value of Øo and Ø1 in case of simple linear regression, to reach the best value that minimizes the error between the predicted value and true value.

The cost function(J) of linear regression is the Root Mean Squared Error(RMSE) between the predicted y and the true value of y.

 

Gradient Descent

To update  Øo and Ø1 values in order to reduce cost function (minimizing RMSE value) and achieve the best fit line the model uses Gradient Descent. The idea is to start with random Øo and Ø1 values and then iteratively update the values, reaching the minimum cost.

We’ll take a small example to see the working of linear regression. For this, we’ll create dummy datasets having ‘age’, ‘no of hours’ as input parameters, and ‘salary’ as output parameters. For the demonstration, I’ll be using Jupyter Notebook.

At first, we’ll create a dummy dataset

info = {
    'no of hours' : [1, 2, 5, 7, 8, 10, 12, 15, 17],
    'age' : [20, 34, 21, 27, 34, 21, 20, 45, 31],
    'salary' : [1000, 3000, 5000, 8000, 8500, 9000, 12000, 15000, 22000]
}

import pandas as pd
df = pd.DataFrame(info)
print(df)

Output

gradient-descent-dataset

 

 

Let’s visualize the datasets. First of all, we’ll import matplotlib and seaborn to visualize the dataset.

import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x = "age", y= "salary", data = df)
plt.xlabel("age")
plt.ylabel("salary")
plt.title("age vs salary")
plt.show()

Output

Age vs Salary plot

 

Also, we’ll visualize the no of hours vs salary graph

sns.scatterplot(x = "no of hours", y= "salary", data = df)
plt.xlabel("no of hours")
plt.ylabel("salary")
plt.title("no of hours vs salary")
plt.show()

Output

No of hours vs salary

 

 

Also, we will take a look at the ‘no of hours’ vs ‘age’ graph

sns.scatterplot(x = "age", y= "no of hours", data = df)
plt.xlabel("age")
plt.ylabel("no of hours")
plt.title("age vs no of hours")
plt.show()

Output

Age vs no of hours

 

Now, we will use a linear regression model to predict the salary based on the hours and age. The equation used will be in the form of:

salary = Øo + Ø1 * no of hours + Ø2 * age

Øo = Intercept

Ø1 = Coefficient of no of hours

Ø2 = Coefficient of age

 

Now, we will start building the model. Let’s select the features and target variables:

X = df.iloc[:, :2]
y = df.iloc[:, -1]

 

Then, we’ll import the necessary libraries as:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Now, splitting datasets into training and testing datasets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

 

Build the model as:

lr = LinearRegression()
model = lr.fit(X_train, y_train)
pred = model.predict(X_test)
print(pred)

Output

[ 6454.68201512 12813.11470225 24376.50611935]

 

Now, let’s see the values of Øo, Ø1, and Ø2

print("Intercept :",model.intercept_)

Output

Intercept : -8477.293570728314

 

Here we can see that the value of intercept(Øo) = -8477.293570728314

print("Slope :", model.coef_)

Output

Slope : [1059.73878119  376.83817716]

As a result:

Ø1 -> coefficient of no of hours = 1059.73878119

Ø2 -> coefficient of age = 376.83817716

 

Conclusion

A linear regression algorithm is a machine learning algorithm used to do regression analysis. This model develops the linear relationship between dependent and independent variables minimizing the Root Mean Squared Error(RMSE) between the predicted and true value.

Hence, price prediction is one example of linear regression. So, linear regression is the simple yet most useful algorithm of machine learning.

If you want to learn more about the machine learning algorithms types, then check the link here.

Happy Learning 🙂

Leave a Comment