Introduction
Null values in the dataset are the empty field represented as NaN(Not a Number). It does not mean zero value, actually, it is an empty field. Datasets that are available for preparing machine learning models may contain some null values in them.
Those null values must be filled with another meaningful value or they must be dropped from the dataset. Null values may present in datasets because of the error by humans during data entry or any other factors.
In this tutorial, we’ll learn how to fill those null values in the dataset. They must be filled or dropped from the dataset so that the machine learning model can perform well.
For demonstration, I will be using a jupyter notebook. We are going to create a dataset having some null values having both categorical values and numerical values.
Here are some of the ways to fill the null values from datasets using the python pandas library:
1. Dropping null values
Python Dataframe has a dropna() function that is used to drop the null values from datasets. This method should only be used when the dataset is too large and null values are in small numbers.
import pandas as pd import numpy as np info = { 'Age' : [np.NaN, 23, 45, np.NaN, 34, np.NaN, np.NaN, 45, 54], 'Gender' : ['male', np.NaN, 'female', 'male', np.NaN, np.NaN, 'female', 'male', 'female'] } df = pd.DataFrame(info) print(df)
Output
Age Gender 0 NaN male 1 23.0 NaN 2 45.0 female 3 NaN male 4 34.0 NaN 5 NaN NaN 6 NaN female 7 45.0 male 8 54.0 female
This dataset has some of the null values represented by NaN values. Before treating those null values, let’s see how we can know how many null values are present or not present in the dataset.
df.isnull().sum()
Output
Age 4 Gender 3 dtype: int64
Using isnull() and sum() function we will be able to know how many null values are present in each column. There are 4 null values in the ‘Age’ column and 3 null values in the ‘Gender’ column.
Let’s take a look at how dropna() is implemented to drop null values from the dataset.
df2 = df.dropna() print(df2)
Output
Age Gender 2 45.0 female 7 45.0 male 8 54.0 female
Using the dropna() function we can drop all the rows from the dataset that has a null value. The above dataframe is obtained after dropping all the rows having null values.
2. Using fillna() function
Using the fillna() function, we can fill the null values with the desired value.
df['Age'] = df['Age'].fillna(0) df['Gender'] = df['Gender'].fillna("Not Specified") print(df)
Output
Age Gender 0 0.0 male 1 23.0 Not Specified 2 45.0 female 3 0.0 male 4 34.0 Not Specified 5 0.0 Not Specified 6 0.0 female 7 45.0 male 8 54.0 female
Null values in the ‘Age’ column are filled with zero(which is not a good practice) and the ‘Gender’ column with ‘Not Specified’.
Rather than filling the null values with another kind of data, we can use the ‘ffill’ and ‘bfill’ methods. The former method means forward fill which fills the null values using previous data while the latter method means backward fill which fills null values using the next real value in the dataset.
df['Age'] = df['Age'].fillna(method = 'ffill') df['Gender'] = df['Gender'].fillna(method = 'ffill') print(df)
Output
Age Gender 0 NaN male 1 23.0 male 2 45.0 female 3 45.0 male 4 34.0 male 5 34.0 male 6 34.0 female 7 45.0 male 8 54.0 female
What the ‘ffill’ method does is that if there is a null value in any column it will fill that null value using the previous value. We can see that the first null value in ‘Age’ column is not filled because there is no previous data to fill it.
This is one of the disadvantages of the ‘ffill’ method. The next null value is filled with ‘45.0’ as the previous value is ‘45.0’. The same process is applied in ‘Gender’ column to fill the null values.
df['Age'] = df['Age'].fillna(method = 'bfill') df['Gender'] = df['Gender'].fillna(method = 'bfill') print(df)
Output
Age Gender 0 23.0 male 1 23.0 female 2 45.0 female 3 34.0 male 4 34.0 female 5 45.0 female 6 45.0 female 7 45.0 male 8 54.0 female
‘bfill’ method fills the current null value with the next real value. We can see that, unlike in the ‘ffill’ method, it filled the first null value from the ‘Age’ column with the next real value which is 23.0.
3. Using interpolation
Interpolation is one of the methods that is used to fill the null values. Dataframe has interpolate() function that is used to fill the null values.
df.interpolate(method = 'linear', limit_direction = 'forward')
Output
Age Gender 0 NaN male 1 23.000000 NaN 2 45.000000 female 3 39.500000 male 4 34.000000 NaN 5 37.666667 NaN 6 41.333333 female 7 45.000000 male 8 54.000000 female
Interpolation works only on numerical data. So, we can see that null values in the ‘Gender’ dataframe are unfilled.
df2 = df.interpolate(method = 'linear', limit_direction = 'backward') print(df2)
Output
Age Gender 0 23.000000 male 1 23.000000 NaN 2 45.000000 female 3 39.500000 male 4 34.000000 NaN 5 37.666667 NaN 6 41.333333 female 7 45.000000 male 8 54.000000 female
This is how we can use the interpolation method to fill the null values in the dataset.
4. Using the mean, median, and mode method
Let’s see how mean, median, and mode are used to fill the null values in the dataset. Mean and median are used to fill the null values of numerical data and mode is used to fill the null values of categorical data.
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0]) df['Age'] = df['Age'].fillna(df['Age'].mean()) print(df)
Output
Age Gender 0 40.2 male 1 23.0 female 2 45.0 female 3 40.2 male 4 34.0 female 5 40.2 female 6 40.2 female 7 45.0 male 8 54.0 female
Categorical values are filled with the mode value of the same column i.e ‘Gender’ column. The ‘Age’ column is filled with a mean value of the same column. The mean value is 40.2 and the mode value is ‘female’. This is one of the most used methods for filling the null values for categorical and numerical null values.
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0]) df['Age'] = df['Age'].fillna(df['Age'].median()) print(df)
Output
Age Gender 0 45.0 male 1 23.0 female 2 45.0 female 3 45.0 male 4 34.0 female 5 45.0 female 6 45.0 female 7 45.0 male 8 54.0 female
Choose median or mean, why?
The numerical values also can be filled using the median value. Sometimes filling null values with mean values can hamper the whole dataset in case of presence of outliers can alter the mean and standard deviation of data. So, filling null values with median values can also be a very effective method.
Conclusion
The raw dataset that is available for preparing the machine learning model may have some null values in it. We must fill those null values with suitable and meaningful data so that the model performance on those data is good. There are several ways of filling null values.
Interpolation, fillna, dropna, and using mean, median, and mode values are some of the ways of filling null values. Dropping of null values is not useful in a small dataset but can be useful if the dataset is large and has few null values in it. The machine learning model needs those null values to be filled or removed. Hence, filling null values with suitable values is very essential.
Happy Learning:-)