how-to-remove-outliers-in-python-pandas-package

Introduction

Outliers are the data that are distant away from all other observations or unusual data that doesn’t fit the data. In other words, outliers are the data that does not fit the mainstream of data.

 

Impacts of outliers

In machine learning projects, during model building, it is important to remove those outliers because the presence of those outliers can mislead the model.

The presence of outliers may change the mean and standard deviation of the whole dataset which can badly affect the performance of the model. Outliers also increase the variance error and reduce the power of the statistical tests.

Some of the reasons for the presence of outliers are as follows:

  • Data entry error(human error)
  • Experimental measurement error
  • Measurement error(Instrument error)
  • Sampling error

 

Detection of outliers

Detecting outliers is one of the challenging jobs in data cleaning. There is no precise way to detect and remove outliers due to the specific datasets. Yet, raw assumptions and observations must be made to remove those outliers that seem to be unusual among all other data. The two ways to detection of outliers are:

  1. Visualization method
  2. Statistical method

1. Visualization method

In this method, a visualization technique is used to identify the outliers in the dataset. Boxplot and scatterplot are the two methods that are used to identify the outliers. Box plot is used for univariate analysis while scatterplot is used for multivariate analysis.

Boxplot

Boxplot is a graphical method of displaying numerical data based on a five-number summary namely:

i. Minimum(0th percentile)

ii. Maximum(100th percentile)

iii. Median(50th percentile)

iv. First quartile(25th percentile)

v. Third quartile(75th percentile)

Boxplot consists of a line extending from the first and third quartile which are known as whiskers to show the variability of data from the first and third quartile.

This is a boxplot of the age of the individual and the point that lies near the 200 mark is marked as an outlier. The age equal to 200 is lying far away from the other data and seems to be unusual. This is how boxplot(a visualization tool)  is used for the detection of outliers.

 

Scatterplot

Scatterplot is used for multivariate analysis for the detection of outliers. The data point lying far away from the other datapoint can be visualized using a scatterplot.

In the above scatterplot, two points are lying at a very far distance from other data points. By visualizing data using a scatterplot we can detect outliers.

 

2. Statistical method

Statistical terms such as standard deviation, interquartile range, and z-score are used for the detection and removal of outliers. In this tutorial, we’ll use the standard deviation method, interquartile range(IQR) method, and z-score method for outlier detection and removal.

Interquartile range(IQR)

The interquartile range is a difference between the third quartile(Q3) and the first quartile(Q1). In this method, anything lying above  Q3 + 1.5 * IQR and Q1 – 1.5 * IQR  is considered an outlier. For demonstration purposes, I’ll use Jupyter Notebook and heart disease datasets from Kaggle. Let’s read and see some parts of the dataset.

Make sure you have installed pandas and seaborn using the command:

$ pip install pandas
$ pip install seaborn

 

import pandas as pd
df = pd.read_csv("heart.csv")
df.head()

Output

dataset

This is the data frame and we’ll be using the ‘chol’ column for further analysis. First of all, we’ll see whether it has an outlier or not:

import seaborn as sns

sns.boxplot(df['chol'])

Output

boxplot

We can see that there are some outliers. Now, we are going to see how these outliers can be detected and removed using the IQR technique.

For the IQR method, let’s first create a function:

def outliers(df, feature):
    Q1= df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3 - Q1
    upper_limit = Q3 + 1.5 * IQR
    lower_limit = Q1 - 1.5 * IQR
    return upper_limit, lower_limit

upper, lower = outliers(df, "chol")
print("Upper whisker: ", upper)
print("Lower Whisker: ", lower)

Output

Upper whisker:  369.75
Lower Whisker:  115.75

As discussed earlier, anything lying outside between  369.75 and 115.75 is an outlier.

Let’s take look at outliers:

df[(df['chol'] < lower) | (df['chol'] > upper)]

Output

outliers

These are the outliers lying beyond the upper and lower limit computed with the IQR method.

 

To remove these outliers from datasets:

new_df = df[(df['chol'] > lower) & (df['chol'] < upper)]

So, this new data frame new_df contains the data that is between the upper and lower limit as computed using the IQR method.

Using this method, we found that there are five(5) outliers in the dataset. This is how outliers can be easily detected and removed using the IQR method.

 

Standard deviation method

Standard deviation is the measure of how far a data point lies from the mean value. Generally, it is common practice to use 3 standard deviations for the detection and removal of outliers.

It is not mandatory to use 3 standard deviations for the removal of outliers, one can use 4 standard deviations or even 5 standard deviations according to their requirement.

def outlier_removal(df, variable):
    upper_limit = df[variable].mean() + 3 * df[variable].std()
    lower_limit = df[variable].mean() - 3 * df[variable].std()
    return upper_limit, lower_limit

upper_limit, lower_limit = outlier_removal(df, "chol")
print("Upper limit: ", upper_limit)
print("Lower Limit: ",lower_limit)

Output

Upper limit:  401.75627936643036
Lower Limit:  90.77177343885015

Anything that doesn’t come between these two upper limits and lower limits will be considered an outlier.

Now, take look in these outliers:

df[(df['chol'] < lower_limit) | (df['chol'] > upper_limit)]

Output

outliers

These are the outliers that is lying beyond the upper and lower limit as computed using the standard deviation method. Using this method, we found that there are 4 outliers in the dataset.

 

To remove these outliers from our datasets:

new_df = df[(df['chol'] > lower) & (df['chol'] < upper)]

This new data frame contains only those datapoints that are inside the upper and lower limit boundary. So, this is how we can easily detect and remove the outliers from our datasets.

 

Z-score method

Z-score is the measure of how many standard deviation away the data point is. The formula used to calculate the z-score is:

formula

where,

μ = mean

σ = Standard deviation

Z-score is similar to that of the standard deviation method for outlier detection and removal. One can use any of these two(z-score or standard deviation) method for outliers treatment.

Let’s see how a z-score is used to detect and remove the outliers:

df['z_score'] = (df['chol'] - df['chol'].mean()) / df['chol'].std()
df.head()

Output

dataset

Now, using this calculated z-score we’ll mark outliers if the z-score is above 3 or below -3.

df[(df['z_score'] < -3) | (df['z_score'] > 3)]

Output

outliers

These are the outliers that we obtained after removing those data that has z-score below -3 and above 3. We can see that the outliers that we obtained from the z-score method and standard deviation method are exactly the same.

The datasets that have z-score greater than 3 means that it is more than 3 standard deviation away from mean value which is the same concept applied in standard deviation method. So, the z-score method is an alternative to the standard deviation method of outlier detection. Using this method we found that there are 4 outliers in the dataset.

To remove these outlers we can do:

new_df = df[(df['z_score'] < 3) & (df['z_score'] > -3)]

This new data frame gives the dataset that is free from outliers having a z-score between 3 and -3.

 

Conclusion

Outliers detection and removal is an important task in the data cleaning process. These unusual data may change the standard deviation and mean of the dataset causing poor performance of the machine learning model.

Hence, outliers must be removed from the dataset for better performance of the model but it is not always an easy task.

Outliers can be detected using visualization tools such as boxplot and scatterplot. Some of the statistical methods such as IQR, standard deviation, z-score methods can be implemented for detection and removal of the outliers.

As we saw above the z-score method and standard deviation method are exactly the same. In some cases, detection of outliers can be easy but in some cases, it can be challenging and one should go with what is required.

Make your voice heard! The best opinions in the comments below will be included in this article.

Happy Learning 🙂

Leave a Reply

Your email address will not be published.