How to Convert Categorical Data to Numerical Data in Python

Introduction

The dataset available for machine learning implementation has numerical as well as categorical features. Categorical data are of three types namely ordinalnominal and boolean.

  • Ordinal data are those data that have priority ordering with each variable.
  • Nominal data are those data that don’t have priority ordering.
  • Boolean data are those data having labeled as either True or False.

Categorical features refer to string-type data and can be easily understood by human beings. But in the case of machines, they cannot interpret the categorical data directly. Therefore, the categorical data must be translated into numerical data that can be understood by machines.

Machine learning models cannot interpret the categorical data. Hence, needs the translation to numerical format. There are many ways to convert categorical data into numerical data. Here in this tutorial, we’ll be discussing the three most used methods and they are:

  1. Label Encoding
  2. One Hot Encoding
  3. Dummy Variable Encoding

We are going to discuss each method in detail with some examples. If you are new to machine learning, I’ll try to make the concept clear and easily understandable. So, without further due, let’s dive into the topic.

I assumed that you have already installed Python in your system. We will be using Python packages pandas and sckit-learn in the examples below. So make sure you installed them using pip installer before running examples in your editor as:

$ pip install pandas
$ pip intall sckit-learn

 

1. Label Encoding

Label Encoding refers to the conversion of categorical data into numerical data that a computer can understand. In this method, every class of data is assigned to a number starting from zero(0). We’ll make a data frame just to see how label encoding works:

import pandas as pd

info = {
    'Gender' : ['male', 'female', 'female', 'male', 'female', 'female'],
    'Position' : ['CEO', 'Cleaner', 'Employee', 'Cleaner', 'CEO', 'Cleaner']
}
df = pd.DataFrame(info)
print(df)

Output

   Gender  Position
0    male       CEO
1  female   Cleaner
2  female  Employee
3    male   Cleaner
4  female       CEO
5  female   Cleaner

This is the data frame that we are going to work with. In this data frame, it contains Gender and Position(say in a company) as features. Since these features are categorical, needed to be converted into numerical data.

We’ll implement LabelEncoder of sklearn to convert these features into numeric values as:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
gender_encoded = le.fit_transform(df['Gender'])

print(gender_encoded)

Output

[1 0 0 1 0 0]

As we can see, the male is encoded into zero (0) and the female is encoded into one(1). We’ve to add this encoded data to the original data frame, we can do this as:

df['encoded_gender'] = gender_encoded
print(df)

Output

   Gender  Position  encoded_gender
0    male       CEO               1
1  female   Cleaner               0
2  female  Employee               0
3    male   Cleaner               1
4  female       CEO               0
5  female   Cleaner               0

The column with categorical data needs to be dropped from the original data frame. Now, we are going to implement label encoding to the ‘Position’ column to convert it into numerical data as:

encoded_position = le.fit_transform(df['Position'])
df['encoded_position'] = encoded_position
print(df)

Output

   Gender  Position  encoded_gender  encoded_position
0    male       CEO               1                 0
1  female   Cleaner               0                 1
2  female  Employee               0                 2
3    male   Cleaner               1                 1
4  female       CEO               0                 0
5  female   Cleaner               0                 1

If you compare the Position and encoded_position column, it can be seen that CEO is encoded to 0, Cleaner is encoded to 1, and Employee is encoded to 2 i.e

CEO ⇒ 0

Cleaner ⇒ 1

Employee ⇒ 2

One big disadvantage of the label encoder is that it encodes categorical features into numbers starting from 0 which can lead to a priority issue. For instance, the male is encoded to 0, and the female is encoded to 1.

With this, it can be interpreted that females have high priority than males, which is meaningless. Label encoding is done with an ordinal dataset(a finite set of values with rank ordering between variables).

 

2. One Hot Encoding

Dataset with nominal(no rank ordering between variables) categories, integer encoding may not be sufficient. Integer encoding to nominal data may lead to misleading which can result in poor performance of the model.

One hot encoding convert integer encoding into a binary variable. Each bit represents a category. If the variable cannot belong to multiple categories at once, then only one bit in the group can be “on”. This is called one-hot encoding.

Before applying one hot encoding, the categorical variable is converted into a numeric value using a label encoder, and the one-hot encoding is implemented to this numeric variable.

We are going to use previously encoded data for one-hot encoding as:

from sklearn.preprocessing import OneHotEncoder

gender_encoded = le.fit_transform(df['Gender'])
gender_encoded = gender_encoded.reshape(len(gender_encoded), -1)
one = OneHotEncoder(sparse=False)

print(one.fit_transform(gender_encoded))

Output

[[0. 1.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]]

Since one hot encoding takes data in vertical(column) format, we need to reshape the array obtained from the label encoder. Here,  male is encoded to [0 1] and female is encoded into [1 0].

 

3. Dummy Variable Encoding

The problem with the one-hot encoding technique is that it creates redundancy. For example, if the male is represented by [0 1], then we don’t need [1 0] to represent females.

We can represent females with [0 0]. This is called dummy variable encoding. It represents n categories with n-1 binary values.

To see how a dummy variable encoding works, we’ll encode categorical data of the previous dataset as:

import pandas as pd

pos = pd.get_dummies(df['Position'], drop_first = True)
print(pos)

Output

   Cleaner  Employee
0        0         0
1        1         0
2        0         1
3        1         0
4        0         0
5        1         0

We can see that Cleaner is represented with [1 0]Employee is represented with [0 1] and CEO is represented by [0 0].

Here, there are 3(n=3) categories and dummy variable encoding represents it in 2(n-1)bit binary value.

Pandas have get_dummies() function that is implemented to perform dummy variable encoding.

drop_first argument is set True to drop the first column so that a dummy state can be achieved.

 

Conclusion

The machine understands numerical data only but not categorical data. Implementing a machine learning model needs input and output variables into numerical form. Hence, it is very important to handle the categorical data.

The label encoding method is used with the ordinal dataset while one-hot encoding is used with the nominal dataset. The label encoding method gives rise to the priority issue while one hot encoding leads to the redundancy problem. Dummy variable encoding is used for linear regression(and other regression algorithms).

Happy Learning 🙂

Leave a Comment