Outliers are the data that are distant away from all other observations or unusual data that doesn’t fit the data. In other words, outliers are data that do not fit the mainstream data.
Impacts of outliers
In machine learning projects, during model building, it is important to remove those outliers because the presence of those outliers can mislead the model.
The presence of outliers may change the mean and standard deviation of the whole dataset which can badly affect the performance of the model. Outliers also increase the variance error and reduce the power of the statistical tests.
Some of the reasons for the presence of outliers are as follows:
- Data entry error(human error)
- Experimental measurement error
- Measurement error(Instrument error)
- Sampling error
Detection of outliers
Detecting outliers is one of the challenging jobs in data cleaning. There is no precise way to detect and remove outliers due to the specific datasets. Yet, raw assumptions and observations must be made to remove those outliers that seem to be unusual among all other data. The two ways to detection of outliers are:
- Visualization method
- Statistical method
1. Visualization method
In this method, a visualization technique is used to identify the outliers in the dataset. Boxplot and scatterplot are the two methods that are used to identify outliers. Box plot is used for univariate analysis while scatterplot is used for multivariate analysis.
Boxplot is a graphical method of displaying numerical data based on a five-number summary namely:
i. Minimum(0th percentile)
ii. Maximum(100th percentile)
iii. Median(50th percentile)
iv. First quartile(25th percentile)
v. Third quartile(75th percentile)
Boxplot consists of a line extending from the first and third quartile which are known as whiskers to show the variability of data from the first and third quartile.
This is a boxplot of the age of the individual and the point that lies near the 200 mark is marked as an outlier. The age equal to 200 is lying far away from the other data and seems to be unusual. This is how boxplot(a visualization tool) is used for the detection of outliers.
Scatterplot is used for multivariate analysis for the detection of outliers. The data point lying far away from the other data point can be visualized using a scatterplot.
In the above scatterplot, two points are lying at a very far distance from other data points. By visualizing data using a scatterplot we can detect outliers.
2. Statistical method
Statistical terms such as standard deviation, interquartile range, and z-score are used for the detection and removal of outliers. In this tutorial, we’ll use the standard deviation method, interquartile range(IQR) method, and z-score method for outlier detection and removal.
The interquartile range is a difference between the third quartile(Q3) and the first quartile(Q1).
In this method, anything lying above Q3 + 1.5 * IQR and Q1 – 1.5 * IQR is considered an outlier.
For demonstration purposes, I’ll use Jupyter Notebook and heart disease datasets from Kaggle. Let’s read and see some parts of the dataset.
Make sure you have installed pandas and seaborn using the command:
$ pip install pandas $ pip install seaborn
import pandas as pd df = pd.read_csv("heart.csv") df.head()
This is the data frame and we’ll be using the ‘chol’ column for further analysis. First of all, we’ll see whether it has an outlier or not:
import seaborn as sns sns.boxplot(df['chol'])
We can see that there are some outliers. Now, we are going to see how these outliers can be detected and removed using the IQR technique.
For the IQR method, let’s first create a function:
def outliers(df, feature): Q1= df[feature].quantile(0.25) Q3 = df[feature].quantile(0.75) IQR = Q3 - Q1 upper_limit = Q3 + 1.5 * IQR lower_limit = Q1 - 1.5 * IQR return upper_limit, lower_limit upper, lower = outliers(df, "chol") print("Upper whisker: ", upper) print("Lower Whisker: ", lower)
Upper whisker: 369.75 Lower Whisker: 115.75
As discussed earlier, anything lying outside between 369.75 and 115.75 is an outlier.
Let’s take look at outliers:
df[(df['chol'] < lower) | (df['chol'] > upper)]
These are the outliers lying beyond the upper and lower limit computed with the IQR method.
To remove these outliers from datasets:
new_df = df[(df['chol'] > lower) & (df['chol'] < upper)]
So, this new data frame new_df contains the data between the upper and lower limit as computed using the IQR method.
Using this method, we found that there are five(5) outliers in the dataset. This is how outliers can be easily detected and removed using the IQR method.
Standard deviation method
Standard deviation is the measure of how far a data point lies from the mean value. Generally, it is common practice to use 3 standard deviations for the detection and removal of outliers.
It is not mandatory to use 3 standard deviations for the removal of outliers, one can use 4 standard deviations or even 5 standard deviations according to their requirement.
def outlier_removal(df, variable): upper_limit = df[variable].mean() + 3 * df[variable].std() lower_limit = df[variable].mean() - 3 * df[variable].std() return upper_limit, lower_limit upper_limit, lower_limit = outlier_removal(df, "chol") print("Upper limit: ", upper_limit) print("Lower Limit: ",lower_limit)
Upper limit: 401.75627936643036 Lower Limit: 90.77177343885015
Anything that doesn’t come between these two upper limits and lower limits will be considered an outlier.
Now, take look at these outliers:
df[(df['chol'] < lower_limit) | (df['chol'] > upper_limit)]
These are the outliers that are lying beyond the upper and lower limit as computed using the standard deviation method. Using this method, we found that there are 4 outliers in the dataset.
To remove these outliers from our datasets:
new_df = df[(df['chol'] > lower) & (df['chol'] < upper)]
This new data frame contains only those data points that are inside the upper and lower limit boundary. So, this is how we can easily detect and remove the outliers from our datasets.
Z-score is the measure of how many standard deviations away the data point is. The formula used to calculate the z-score is:
μ = mean
σ = Standard deviation
Z-score is similar to that of the standard deviation method for outlier detection and removal. One can use any of these two(z-score or standard deviation) methods for outliers treatment.
Let’s see how a z-score is used to detect and remove the outliers:
df['z_score'] = (df['chol'] - df['chol'].mean()) / df['chol'].std() df.head()
Now, using this calculated z-score we’ll mark outliers if the z-score is above 3 or below -3.
df[(df['z_score'] < -3) | (df['z_score'] > 3)]
We obtained these outliers after removing those data with z-score below -3 and above 3. We can see that the outliers that we obtained from the z-score method and standard deviation method are exactly the same.
The datasets with a z-score greater than 3 means that it is more than 3 standard deviation away from the mean value which is the same concept applied in the standard deviation method. So, the z-score method is an alternative to the standard deviation method of outlier detection. Using this method we found that there are 4 outliers in the dataset.
To remove these outliers we can do:
new_df = df[(df['z_score'] < 3) & (df['z_score'] > -3)]
This new data frame gives the dataset that is free from outliers having a z-score between 3 and -3.
Outliers detection and removal is an important task in the data cleaning process. These unusual data may change the standard deviation and mean of the dataset causing poor performance of the machine learning model.
Hence, outliers must be removed from the dataset for better performance of the model but it is not always an easy task.
Outliers can be detected using visualization tools such as boxplots and scatterplots. Some of the statistical methods such as IQR, standard deviation, and z-score methods can be implemented for the detection and removal of outliers.
As we saw above the z-score method and standard deviation method are exactly the same. In some cases, the detection of outliers can be easy but in some cases, it can be challenging and one should go with what is required.
Make your voice heard! The best opinions in the comments below will be included in this article.
Happy Learning 🙂