how-to-handle-null-values-in-pandas

Introduction

Null values in the dataset are the empty field represented as NaN(Not a Number). It does not mean zero value, actually, it is an empty field. Datasets that are available for preparing machine learning models may contain some null values in them.

Those null values must be filled with another meaningful value or they must be dropped from the dataset. Null values may present in datasets because of the error by humans during data entry or any other factors.

In this tutorial, we’ll learn how to fill those null values in the dataset. They must be filled or dropped from the dataset so that the machine learning model can perform well.

For demonstration, I will be using a jupyter notebook. We are going to create a dataset having some null values having both categorical values and numerical values.

Here are some of the ways to fill the null values from datasets using the python pandas library:

1. Dropping null values

Python Dataframe has a dropna() function that is used to drop the null values from datasets. This method should only be used when the dataset is too large and null values are in small numbers.

import pandas as pd
import numpy as np
info = {
    'Age' : [np.NaN, 23, 45, np.NaN, 34, np.NaN, np.NaN, 45, 54],
    'Gender' : ['male', np.NaN, 'female', 'male', np.NaN, np.NaN, 'female', 'male', 'female']
}
df = pd.DataFrame(info)
print(df)

Output

    Age  Gender
0   NaN    male
1  23.0     NaN
2  45.0  female
3   NaN    male
4  34.0     NaN
5   NaN     NaN
6   NaN  female
7  45.0    male
8  54.0  female

This dataset has some of the null values represented by NaN values. Before treating those null values, let’s see how we can know how many null values are present or not present in the dataset.

df.isnull().sum()

Output

Age       4
Gender    3
dtype: int64

Using isnull() and sum() function we will be able to know how many null values are present in each column. There are 4 null values in the ‘Age’ column and 3 null values in the ‘Gender’ column.

 

Let’s take a look at how dropna() is implemented to drop null values from the dataset.

df2 = df.dropna()
print(df2)

Output

    Age  Gender
2  45.0  female
7  45.0    male
8  54.0  female

Using the dropna() function we can drop all the rows from the dataset that has a null value. The above dataframe is obtained after dropping all the rows having null values.

 

2. Using fillna() function

Using the fillna() function, we can fill the null values with the desired value.

df['Age'] = df['Age'].fillna(0)
df['Gender'] = df['Gender'].fillna("Not Specified")
print(df)

Output

    Age         Gender
0   0.0           male
1  23.0  Not Specified
2  45.0         female
3   0.0           male
4  34.0  Not Specified
5   0.0  Not Specified
6   0.0         female
7  45.0           male
8  54.0         female

Null values in the ‘Age’ column are filled with zero(which is not a good practice) and the ‘Gender’ column with ‘Not Specified’.

Rather than filling the null values with another kind of data, we can use the ‘ffill’ and ‘bfill’ methods. The former method means forward fill which fills the null values using previous data while the latter method means backward fill which fills null values using the next real value in the dataset.

df['Age'] = df['Age'].fillna(method = 'ffill')
df['Gender'] = df['Gender'].fillna(method = 'ffill')
print(df)

Output

    Age  Gender
0   NaN    male
1  23.0    male
2  45.0  female
3  45.0    male
4  34.0    male
5  34.0    male
6  34.0  female
7  45.0    male
8  54.0  female

What the ‘ffill’ method does is that if there is a null value in any column it will fill that null value using the previous value. We can see that the first null value in ‘Age’ column is not filled because there is no previous data to fill it.

This is one of the disadvantages of the ‘ffill’ method. The next null value is filled with ‘45.0’ as the previous value is ‘45.0’. The same process is applied in ‘Gender’ column to fill the null values.

df['Age'] = df['Age'].fillna(method = 'bfill')
df['Gender'] = df['Gender'].fillna(method = 'bfill')
print(df)

Output

    Age  Gender
0  23.0    male
1  23.0  female
2  45.0  female
3  34.0    male
4  34.0  female
5  45.0  female
6  45.0  female
7  45.0    male
8  54.0  female

‘bfill’ method fills the current null value with the next real value. We can see that, unlike in the ‘ffill’ method, it filled the first null value from the ‘Age’ column with the next real value which is 23.0.

 

3. Using interpolation

Interpolation is one of the methods that is used to fill the null values. Dataframe has interpolate() function that is used to fill the null values.

df.interpolate(method = 'linear', limit_direction = 'forward')

Output

         Age  Gender
0        NaN    male
1  23.000000     NaN
2  45.000000  female
3  39.500000    male
4  34.000000     NaN
5  37.666667     NaN
6  41.333333  female
7  45.000000    male
8  54.000000  female

 

Interpolation works only on numerical data. So, we can see that null values in the ‘Gender’ dataframe are unfilled.

df2 = df.interpolate(method = 'linear', limit_direction = 'backward')
print(df2)

Output

         Age  Gender
0  23.000000    male
1  23.000000     NaN
2  45.000000  female
3  39.500000    male
4  34.000000     NaN
5  37.666667     NaN
6  41.333333  female
7  45.000000    male
8  54.000000  female

This is how we can use the interpolation method to fill the null values in the dataset.

 

4. Using the mean, median, and mode method

Let’s see how mean, median, and mode are used to fill the null values in the dataset. Mean and median are used to fill the null values of numerical data and mode is used to fill the null values of categorical data.

df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df)

Output

    Age  Gender
0  40.2    male
1  23.0  female
2  45.0  female
3  40.2    male
4  34.0  female
5  40.2  female
6  40.2  female
7  45.0    male
8  54.0  female

 

Categorical values are filled with the mode value of the same column i.e ‘Gender’ column. The ‘Age’ column is filled with a mean value of the same column. The mean value is 40.2 and the mode value is ‘female’. This is one of the most used methods for filling the null values for categorical and numerical null values.

df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Age'] = df['Age'].fillna(df['Age'].median())
print(df)

Output

    Age  Gender
0  45.0    male
1  23.0  female
2  45.0  female
3  45.0    male
4  34.0  female
5  45.0  female
6  45.0  female
7  45.0    male
8  54.0  female

Choose median or mean, why?

The numerical values also can be filled using the median value. Sometimes filling null values with mean values can hamper the whole dataset in case of presence of outliers can alter the mean and standard deviation of data. So, filling null values with median values can also be a very effective method.

 

Conclusion

The raw dataset that is available for preparing the machine learning model may have some null values in it. We must fill those null values with suitable and meaningful data so that the model performance on those data is good. There are several ways of filling null values.

Interpolation, fillna, dropna, and using mean, median, and mode values are some of the ways of filling null values. Dropping of null values is not useful in a small dataset but can be useful if the dataset is large and has few null values in it. The machine learning model needs those null values to be filled or removed. Hence, filling null values with suitable values is very essential.

Happy Learning:-)

Leave a Reply

Your email address will not be published.