How to Handle Imbalance Datasets in Machine Learning

Introduction

An imbalance dataset is a type of dataset that has an unequal distribution of data among the classes of classification of datasets. Most machine learning algorithms work well with balanced datasets. Balance datasets are those datasets that have an almost equal distribution of data among the different classes of datasets.

Let’s say there is a dataset that has 99% of data associated with the majority class and only 1% of data with the minority class.

This is an example of an unbalanced dataset. Also, the dataset that has about 50 – 50 % data on each class is an example of a balanced dataset. We need to handle imbalanced datasets for better performance of our model.

Why do we need balanced datasets?

If a machine learning algorithm is trained on an imbalance dataset then the model will get biased towards the majority class. When a machine learning model is fed with a huge dataset that has a large amount of data associated with the majority class, the machine learning model will make a classification based on the majority class. 

For example

Let’s say that an imbalanced dataset has 95% of data that belongs to one class and only 5% of data to another class. When this dataset is fed to a machine learning algorithm, the accuracy will be 95% sure.

But is it the good result that we are seeking? Not at all. Whatever the data you pass to your model for prediction it will predict in favor of the majority class. This will create a dumb model that gives an accuracy of 95% yet will result in the misclassification of the minority class. Hence we need to handle imbalanced datasets.

Note: For demonstration, I will use Jupyter Notebook and Credit card fraud detection from Kaggle

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df = pd.read_csv("creditcard.csv")
print("Shape of dataset: ", df.shape)

Output

Shape of dataset:  (284807, 31)

The shape of credit card fraud detection is 284807 x 31. There are 31 columns and 284807 rows in this dataset. Let’s see columns of this data frame

df.columns.values

Output

array(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9',
       'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18',
       'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27',
       'V28', 'Amount', 'Class'], dtype=object)

Here Class is the output variable and others are input variables. Let’s see the amount of data that belongs to each class.

sns.countplot(df.Class)

Output

Imbalance Datasets

We can see that there is an unequal distribution of data among class 0 and class 1. Let’s see the amount of data associated with class 0 and class 1.

df['Class'].value_counts()

Output

0    284315
1       492
Name: Class, dtype: int64

There are 284315 data belonging to class 0 and 492 data associated with class 1. This is a perfect example of an imbalanced dataset.

 

Techniques of handling imbalanced datasets

Let’s discuss the techniques that are available for handling imbalanced datasets. We will look under-sampling technique, over-sampling technique, and a combination of both.

Under-sampling technique

In the under-sampling technique, the amount of data in the majority class will be made equal to the amount of data in the minority class. This technique involves the reduction of data of the majority class.

One of the main disadvantages of this technique is that it will result in a loss of information. This loss of information can cause a serious impact on the performance of the model as the model may not get sufficient information from the dataset.

For example
Let’s say the dataset has 900 data that belong to class A and 100 data that belong to class B. In the under-sampling technique, data in class A will be reduced to 100 to make it equal to data in class B.
For under-sampling, we need imblearn library. So, make sure to install this library

$ pip install imblearn

 

Let’s implement under-sampling using NearMiss

from collections import Counter
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import NearMiss

#dependent and independent features
X = df.drop(columns = ["Class"])
y = df['Class']

#split into train and test
X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size = 0.2)

#undersampling with NearMiss
us = NearMiss()
X_train_resample, y_train_resample = us.fit_resample(X_train, y_train)

print("Original Data: ", Counter(y_train))
print("After Under Sampling: ", Counter(y_train_resample))

Output

Original Data:  Counter({0: 227464, 1: 379})
After Under Sampling:  Counter({0: 379, 1: 379})

We can see that before under-sampling there are 227464 data in class 0 and 379 data in class 1. But after under-sampling there are equal data in class 0 and class 1.

 

Oversampling technique

In this technique, the data of the minority class will be duplicated or generated to make it equal to the data of the majority class. Duplication of data will not provide any additional information while generation of data randomly can result in noise in the dataset.

For example
Let’s say the dataset has 900 data that belong to class A and 100 data that belong to class B. In oversampling technique, data in class B will be increased to 900 to make it equal to data in class A.

#oversampling with RandomOverSampler
ros = RandomOverSampler(random_state = 42)
X_train_resample, y_train_resample = ros.fit_resample(X_train, y_train)

print("Original Data: ", Counter(y_train))
print("After Under Sampling: ", Counter(y_train_resample))

Output

Original Data:  Counter({0: 227464, 1: 379})
After Under Sampling:  Counter({0: 227464, 1: 227464})

We can see that oversampling has increased the minority class data by simply duplicating the previous data. But this duplication of data doesn’t provide additional information.

The dataset will have redundant information in it. If we want to generate new data in our data frame rather than duplicating the data, we can use SMOTE

from imblearn.over_sampling import SMOTE
smot = SMOTE(random_state = 42)
X_train_resample, y_train_resample = smot.fit_resample(X_train, y_train)

print("Original Data: ", Counter(y))
print("After Under Sampling: ", Counter(y_train_resample))

Output

Original Data:  Counter({0: 227464, 1: 379})
After Under Sampling:  Counter({0: 227464, 1: 227464})

SMOTE generates the random data using the data within the dataset and RandomOverSampler duplicates the data of minority class

 

The mixture of under-sampling and oversampling

As we know both over-sampling and under-sampling have their own disadvantages as mentioned above, we can use a mixture of both over-sampling and under-sampling techniques. Let’s see how this is implemented.

from imblearn.combine import SMOTETomek
combine = SMOTETomek(random_state = 42)
X_train_resample, y_train_resample = combine.fit_resample(X_train, y_train)

print("Original Data: ", Counter(y))
print("After Under Sampling: ", Counter(y_train_resample))

Output

Original Data:  Counter({0: 227466, 1: 379})
After Under Sampling:  Counter({0: 227466, 1: 227466})

 

Conclusion

Machine learning algorithms work well with balanced datasets. To eliminate bias towards the majority class of a machine learning model we need to handle imbalanced datasets.

In under-sampling, we increase the data of the majority class. This causes loss of information and poor performance of the model.

In oversampling data in minority, class is increased and makes it equal to the data in the majority class by duplicating the data or generating new data. This can create noise in the dataset and mislead the learning of the model. We can use a combination of both oversampling and undersampling. Hence, it is very essential to handle the imbalanced datasets.

Happy Learning 🙂

Leave a Comment