Introduction
The dataset available for machine learning implementation has numerical as well as categorical features. Categorical data are of three types namely ordinal, nominal and boolean.
- Ordinal data are those data that have priority ordering with each variable.
- Nominal data are those data that don’t have priority ordering.
- Boolean data are those data having labeled as either True or False.
Categorical features refer to string-type data and can be easily understood by human beings. But in the case of machines, they cannot interpret the categorical data directly. Therefore, the categorical data must be translated into numerical data that can be understood by machines.
Machine learning models cannot interpret the categorical data. Hence, needs the translation to numerical format. There are many ways to convert categorical data into numerical data. Here in this tutorial, we’ll be discussing the three most used methods and they are:
- Label Encoding
- One Hot Encoding
- Dummy Variable Encoding
We are going to discuss each method in detail with some examples. If you are new to machine learning, I’ll try to make the concept clear and easily understandable. So, without further due, let’s dive into the topic.
I assumed that you have already installed Python in your system. We will be using Python packages pandas and sckit-learn in the examples below. So make sure you installed them using pip installer before running examples in your editor as:
$ pip install pandas $ pip intall sckit-learn
1. Label Encoding
Label Encoding refers to the conversion of categorical data into numerical data that a computer can understand. In this method, every class of data is assigned to a number starting from zero(0). We’ll make a data frame just to see how label encoding works:
import pandas as pd info = { 'Gender' : ['male', 'female', 'female', 'male', 'female', 'female'], 'Position' : ['CEO', 'Cleaner', 'Employee', 'Cleaner', 'CEO', 'Cleaner'] } df = pd.DataFrame(info) print(df)
Output
Gender Position 0 male CEO 1 female Cleaner 2 female Employee 3 male Cleaner 4 female CEO 5 female Cleaner
This is the data frame that we are going to work with. In this data frame, it contains Gender and Position(say in a company) as features. Since these features are categorical, needed to be converted into numerical data.
We’ll implement LabelEncoder of sklearn to convert these features into numeric values as:
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() gender_encoded = le.fit_transform(df['Gender']) print(gender_encoded)
Output
[1 0 0 1 0 0]
As we can see, the male is encoded into zero (0) and the female is encoded into one(1). We’ve to add this encoded data to the original data frame, we can do this as:
df['encoded_gender'] = gender_encoded print(df)
Output
Gender Position encoded_gender 0 male CEO 1 1 female Cleaner 0 2 female Employee 0 3 male Cleaner 1 4 female CEO 0 5 female Cleaner 0
The column with categorical data needs to be dropped from the original data frame. Now, we are going to implement label encoding to the ‘Position’ column to convert it into numerical data as:
encoded_position = le.fit_transform(df['Position']) df['encoded_position'] = encoded_position print(df)
Output
Gender Position encoded_gender encoded_position 0 male CEO 1 0 1 female Cleaner 0 1 2 female Employee 0 2 3 male Cleaner 1 1 4 female CEO 0 0 5 female Cleaner 0 1
If you compare the Position and encoded_position column, it can be seen that CEO is encoded to 0, Cleaner is encoded to 1, and Employee is encoded to 2 i.e
CEO ⇒ 0
Cleaner ⇒ 1
Employee ⇒ 2
One big disadvantage of the label encoder is that it encodes categorical features into numbers starting from 0 which can lead to a priority issue. For instance, the male is encoded to 0, and the female is encoded to 1.
With this, it can be interpreted that females have high priority than males, which is meaningless. Label encoding is done with an ordinal dataset(a finite set of values with rank ordering between variables).
2. One Hot Encoding
Dataset with nominal(no rank ordering between variables) categories, integer encoding may not be sufficient. Integer encoding to nominal data may lead to misleading which can result in poor performance of the model.
One hot encoding convert integer encoding into a binary variable. Each bit represents a category. If the variable cannot belong to multiple categories at once, then only one bit in the group can be “on”. This is called one-hot encoding.
Before applying one hot encoding, the categorical variable is converted into a numeric value using a label encoder, and the one-hot encoding is implemented to this numeric variable.
We are going to use previously encoded data for one-hot encoding as:
from sklearn.preprocessing import OneHotEncoder gender_encoded = le.fit_transform(df['Gender']) gender_encoded = gender_encoded.reshape(len(gender_encoded), -1) one = OneHotEncoder(sparse=False) print(one.fit_transform(gender_encoded))
Output
[[0. 1.] [1. 0.] [1. 0.] [0. 1.] [1. 0.] [1. 0.]]
Since one hot encoding takes data in vertical(column) format, we need to reshape the array obtained from the label encoder. Here, male is encoded to [0 1] and female is encoded into [1 0].
3. Dummy Variable Encoding
The problem with the one-hot encoding technique is that it creates redundancy. For example, if the male is represented by [0 1], then we don’t need [1 0] to represent females.
We can represent females with [0 0]. This is called dummy variable encoding. It represents n categories with n-1 binary values.
To see how a dummy variable encoding works, we’ll encode categorical data of the previous dataset as:
import pandas as pd pos = pd.get_dummies(df['Position'], drop_first = True) print(pos)
Output
Cleaner Employee 0 0 0 1 1 0 2 0 1 3 1 0 4 0 0 5 1 0
We can see that Cleaner is represented with [1 0], Employee is represented with [0 1] and CEO is represented by [0 0].
Here, there are 3(n=3) categories and dummy variable encoding represents it in 2(n-1)bit binary value.
Pandas have get_dummies() function that is implemented to perform dummy variable encoding.
drop_first argument is set True to drop the first column so that a dummy state can be achieved.
Conclusion
The machine understands numerical data only but not categorical data. Implementing a machine learning model needs input and output variables into numerical form. Hence, it is very important to handle the categorical data.
The label encoding method is used with the ordinal dataset while one-hot encoding is used with the nominal dataset. The label encoding method gives rise to the priority issue while one hot encoding leads to the redundancy problem. Dummy variable encoding is used for linear regression(and other regression algorithms).
Happy Learning 🙂