# Linear Regression in Machine Learning with Examples

## Introduction

Linear Regression is a machine learning model that is based on supervised learning. It performs regression tasks. This model maps the linear relationship between dependent and independent variables, so have named linear regression.

Regression models the target predicted variable based on independent variables. It is used to develop the relationship between variables and forecasting. Depending on the number of independent variables, linear regression is of two types :

1. Simple Linear Regression
2. Multiple Linear Regression

### Simple Linear Regression

In simple linear regression, the independent variable is only one. The formula used in simple linear regression to find the relationship between dependent and independent variables is:

`y = Ø1 + Ø2*x`

y = Dependent variable (output variable)

x = Independent variable

Ø1 = Intercept

Ø2 = Slope

The simple regression model tries to find the ‘best-fit line’ (blue-colored line in the figure above) by adjusting the slope(Ø2) and the intercept(Ø1). The best-fit line is the line that is drawn such that the sum of the square of the distance between the predicted value and the true value is minimal.

In other words, the sum of the distances from that line to the points is minimal. Once the best Ø1 and Ø2 are available, the model is ready to predict the output for the corresponding input.

### Multiple Linear Regression

Generally, the independent variables are more than one rather than just one variable. This output variable is dependent upon more than one variable so has been named multiple linear regression. It also develops the linear relationship between dependent and independent variables. The formula used to develop the relationship between dependents and independent variables is:

`y = Øo + Ø1*x + Ø2*x + . . . . . . . +Øn*xn`

y =  Dependent variable

x = Independent variables

Øo = Intercept

Øi = Slope coefficient for each of the dependent variables, i = 1,2,3 ,. . . . k

k = Number of observations

n = Number of independent variables

The best fit line is determined by tuning the values of Øo and Øi such that the sum of the square of predicted and real value is minimal.

### Cost Function

After we’ve trained our learning algorithm and got a hypothesis, we need to examine how good our results are. This is done by the so-called cost function.

The cost function measures the accuracy of the hypothesis outputs. It does this by comparing the predicted values of the hypothesis with the actual true value.

By achieving the best-fit regression line, the model aims to predict the ‘y’ value such that the error difference between the predicted value and the real value is minimum.

So, it is very essential to update the value of Øo and Øi in case of multiple regression and the value of Øo and Ø1 in case of simple linear regression, to reach the best value that minimizes the error between the predicted value and true value.

The cost function(J) of linear regression is the Root Mean Squared Error(RMSE) between the predicted y and the true value of y.

To update  Øo and Ø1 values in order to reduce cost function (minimizing RMSE value) and achieve the best fit line the model uses Gradient Descent. The idea is to start with random Øo and Ø1 values and then iteratively update the values, reaching the minimum cost.

We’ll take a small example to see the working of linear regression. For this, we’ll create dummy datasets having ‘age’, ‘no of hours’ as input parameters, and ‘salary’ as output parameters. For the demonstration, I’ll be using Jupyter Notebook.

At first, we’ll create a dummy dataset

```info = {
'no of hours' : [1, 2, 5, 7, 8, 10, 12, 15, 17],
'age' : [20, 34, 21, 27, 34, 21, 20, 45, 31],
'salary' : [1000, 3000, 5000, 8000, 8500, 9000, 12000, 15000, 22000]
}

import pandas as pd
df = pd.DataFrame(info)
print(df)```

Output Let’s visualize the datasets. First of all, we’ll import matplotlib and seaborn to visualize the dataset.

```import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x = "age", y= "salary", data = df)
plt.xlabel("age")
plt.ylabel("salary")
plt.title("age vs salary")
plt.show()```

Output Also, we’ll visualize the no of hours vs salary graph

```sns.scatterplot(x = "no of hours", y= "salary", data = df)
plt.xlabel("no of hours")
plt.ylabel("salary")
plt.title("no of hours vs salary")
plt.show()```

Output Also, we will take a look at the ‘no of hours’ vs ‘age’ graph

```sns.scatterplot(x = "age", y= "no of hours", data = df)
plt.xlabel("age")
plt.ylabel("no of hours")
plt.title("age vs no of hours")
plt.show()```

Output Now, we will use a linear regression model to predict the salary based on the hours and age. The equation used will be in the form of:

`salary = Øo + Ø1 * no of hours + Ø2 * age`

Øo = Intercept

Ø1 = Coefficient of no of hours

Ø2 = Coefficient of age

Now, we will start building the model. Let’s select the features and target variables:

```X = df.iloc[:, :2]
y = df.iloc[:, -1]```

Then, we’ll import the necessary libraries as:

```from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression```

Now, splitting datasets into training and testing datasets:

`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)`

Build the model as:

```lr = LinearRegression()
model = lr.fit(X_train, y_train)
pred = model.predict(X_test)
print(pred)```

Output

`[ 6454.68201512 12813.11470225 24376.50611935]`

Now, let’s see the values of Øo, Ø1, and Ø2

`print("Intercept :",model.intercept_)`

Output

`Intercept : -8477.293570728314`

Here we can see that the value of intercept(Øo) = -8477.293570728314

`print("Slope :", model.coef_)`

Output

`Slope : [1059.73878119  376.83817716]`

As a result:

Ø1 -> coefficient of no of hours = 1059.73878119

Ø2 -> coefficient of age = 376.83817716

## Conclusion

A linear regression algorithm is a machine learning algorithm used to do regression analysis. This model develops the linear relationship between dependent and independent variables minimizing the Root Mean Squared Error(RMSE) between the predicted and true value.

Hence, price prediction is one example of linear regression. So, linear regression is the simple yet most useful algorithm of machine learning.