Concept of Correlation in Statistics

Introduction

Correlation is the degree of relatedness of two or more variables. Correlation shows the linear relationship between two or more variables. The coefficient that measures the degree of relatedness between two variables is called the correlation coefficient and is denoted by ‘r’. The range of values that ‘r’ can take is from +1 to -1. The coefficient of correlation shows the degree and direction of relatedness between two or more variables.

 

Types of correlation

Positive correlation / direct correlation

In a positive correlation, the values of two variables move in the same direction. If one increases other increases and if one decreases other decreases. The value of ‘r’ for positive correlation ranges between 0 to 1. If r = 0.85, then it shows that if x increases 1 unit y increases 0.85 unit in the same direction.

For example:
1. There is a positive correlation between the percentage obtained and the number of hours invested in the study.

2. There is a positive correlation between the number of sweaters sold per month in the winter season.

Let ‘Marks obtained’ and ‘no of hours invested in the study’ are two variables. If both variables are plotted in two dimensions, we can see a linear relationship between them.

positive correlation

The above scatterplot shows that there is a strong correlation between Marks Obtained and the Number of studied hours.

Negative correlation

In a negative correlation, the values of two variables move in the opposite direction. If one increases other decreases and vice versa. The value for ‘r’ for negative correlation ranges from -1 to 0. If r = -0.40 then it means y decreases by 0.4 units with a unit increase in x in opposite direction.

For example:
1. A decreasing stock return of a business firm.

2. Number of ice creams sold in the winter season.

Suppose a firm has decreasing stock return over the years. If the stock return and the year are plotted in two dimensions we can see a linear relationship but in decreasing manner.

negative correlation

From the above figure, we can say that there is a strong negative correlation.

No correlation

In no correlation, there is no relation between two or more variables. It is denoted by r = 0.
For example:
1. The temperature of the moon and the total number of ice cream sold in January on daily basis has no correlation.

2. The probability of rainfall and the name of the day have no correlation.

no correlation

From above we can see that there is no linear relationship between the temperature of the moon and the number of ice creams sold as there is no linear relationship between them.

Correlation coefficient

The correlation coefficient gives a view of the degree and direction of relatedness between two or more variables. The value for the correlation coefficient ranges from -1 to +1. A correlation coefficient having a value from 0 to 1 is called a positive correlation, from -1 to 0 is called a negative correlation and a correlation coefficient having a value of 0 is called no correlation.

The formula to calculate Karl Pearson’s coefficient of correlation of two variables x and y is

formula of correlation coefficient

In machine learning, correlation is a very essential concept. We can observe the correlation between the features of a dataset.  We can calculate the correlation between features(variables) corr() function.

For demonstration purposes, let’s suppose the following data

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = {
    'No_of_hours_study' : [1, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9],
    'GPA' : [1.02, 1.04, 1.1, 1.15, 1.25, 1.45, 1.75, 1.9, 2.1, 2.5, 3.2, 3.5, 3.6, 3.8, 4, 4],
    'Playing_Hour' : [6, 5.8, 5.5, 5.1, 4.9, 4.7, 4.4, 4.1, 4, 3.8, 3.6, 3.4, 3.1, 2.9, 2.5, 2.3 ],
    'Age' : [15, 11, 13, 17, 18, 20, 23, 11.5, 21.6, 18.9, 19.4, 18, 14, 10, 9.5, 17]
}
df =  pd.DataFrame(data)
correlation = sns.heatmap(df.corr(), annot=True, cmap="YlGnBu")
plt.show()

Output

correlation heatmap

This is how the correlation between variables looks in a heatmap. We can see that there is a high positive correlation between No_of_hours_study and GPA. There is a high negative correlation between GPA and Playing_Hour. And there is almost no correlation between No_of_hours_study and Age.

Leave a Comment