Feature Selection Methods in Machine Learning

Introduction

Feature selection is the process of reducing the number of input features when developing a machine learning model. It is done because it reduces the computational cost of the model and improves its performance of the model.

Features that have a high correlation with the output variable are selected for training the model. Selecting the subset of the input features is important because it can help build the most efficient model with those features that are most relevant to the target variable.

Model building with redundant features may mislead the model and may hamper the performance of the model. Hence, feature selection is essential.

 

Categorization of the features selection

Features selection is subdivided into two parts namely:

  1. Supervised technique: It is the technique used for lbar-chart-heart-datasetabeled data
  2. Unsupervised technique: It is the technique used for unlabeled data

For demonstration, I am using Jupyter Notebook and I will use the heart disease prediction dataset from Kaggle for the implementation of various feature selection techniques. Here are some of the methods for feature selection:

 

1. Filter method

The filter method computes the relation of individual features to the target variable based on the amount of correlation that the feature has with the target variable. It is a univariate analysis as it checks how relevant the features with target variables are individual. The types of filter methods are as follows:

a) Information gain method

The information gain method computes the reduction of entropy. Information gain is based on the information theory that gives how much information a feature gives in relation to that of another variable. Let’s see how the information gain method is used for feature selection:

At first, I am going to load the dataset

import pandas as pd

df = pd.read_csv("heart.csv")
df.head()

Output

heart-dataset-csv

Now, let’s implement the information gain method as:

X = df.iloc[:, :-1]
y = df.iloc[:, -1]

from sklearn.feature_selection import mutual_info_classif

scores = mutual_info_classif(X, y)
print(scores)

Output

[0.00113135 0.         0.14604363 0.         0.09081394 0.01610141
 0.03330569 0.08534967 0.10247971 0.0602119  0.11768226 0.10865301
 0.16903598]

 

Let’s plot a bar chart for better visualization:

import matplotlib.pyplot as plt

features = df.columns[0:13]
new_df = pd.Series(importane, features)
new_df.plot(kind = 'barh')
plt.ylabel("Features")
plt.xlabel("scores")
plt.title("Features with scores")
plt.show()

Output

bar-chart-heart-dataset

Visualizing this bar chart, we can select the number of features as per requirement. Feature ‘trtbps’ seems to have the lowest score and features such as ‘sex’ and ‘age’ also can be dropped from the dataset while training the model.

 

b) Chi-square method

The Chi-square method is used for categorical data and calculates the chi-square between input features and the target variable. The chi-squared distribution assumes the null hypothesis to be true. The formula used for the calculation of chi-square is:

chi-square-formula

Now, let’s implement the chi-square method:

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

feature = SelectKBest(score_func = chi2, k = 'all')
best_features = feature.fit(X,y)
print(best_features.scores_)

Output

[ 23.28662399,   7.57683451,  62.59809791,  14.8239245 ,
  23.93639448,   0.20293368,   2.97827075, 188.32047169,
  38.91437697,  72.64425301,   9.8040952 ,  66.44076512,
  5.79185297]

 

Using these scores and features, let’s plot the bar chart for better understanding:

import matplotlib.pyplot as plt

features = df.columns[0:13]
new_df = pd.Series(best_features.scores_, features)
new_df.plot(kind = 'barh')
plt.ylabel("Features")
plt.xlabel("scores")
plt.title("Features with scores")
plt.show()

Output

bar-plot-heart-dataset

 

Visualizing this bar chart, we can select the top 10 or top 8 features. Also, you can set k = 10(say) instead of k = ‘all’ for selecting the top 10 features from the dataset. Feature ‘thalachh’ has the highest score and feature ‘fbs’ has the lowest score.

 

c) Correlation coefficient method

In this method, the correlation coefficient of the input feature is calculated with the target variable. Correlation can be positive and negative.

A positive correlation coefficient means if there is an increase or decrease in the feature variable then there is a corresponding increase or decrease in the output variable. A negative correlation means that if there is an increase in the feature there is a decrease in the target variable and vice versa.

The correlation coefficient(r) has values ranging from -1 to 1.

If r = 1, high positive correlation,

If r = 0, no correlation,

If r = -1, highly negative correlation

Now, let’s see how the correlation coefficient method is used for feature selection:

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize = (13, 10))
sns.heatmap(df.corr(), annot = True)
plt.show()

Output

heatmap-heart-dataset

We know that the correlation coefficient of a variable to itself is 1. Looking above the correlation matrix, it is found that features ‘cp’, ‘thalachh’, ‘slp’ are highly positively correlated to the output variable, and features ‘thall’, ‘caa’, ‘lodpeak’, ‘exng’, ‘age’ and ‘sex’ have a negative correlation with the output variable. Other than these, above mentioned features, don’t have that much correlation with the output variable. Hence, we can drop these features from the dataset.

 

2. Wrapper method

The wrapper method doesn’t use a statistical method for feature selection. It takes a subset of features and applies them to train the model and calculates the accuracy. And it keeps this process on repeat until it came with the best features and best accuracy of the model. Since it involves the training of the model several times, it is very expensive and time-consuming. This method is only suitable for small datasets only.

a) Recursive Feature Elimination

Recursive Feature Elimination(RFE) recursively removes the redundant features until the desired number of features is achieved and hence improving the performance and accuracy of the model.

from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
rfe = RFE(SVC(kernel = 'linear'), n_features_to_select = 8)
rfe.fit(X_train, y_train)
pred = rfe.predict(X_test)
print("Accuracy : ", accuracy_score(pred, y_test))

Output

Accuracy :  0.8524590163934426

This is how RFE is implemented to select the features and obtain the accuracy of the model.

 

b) Forward selection method

The forward selection method is an iterative process that starts with no feature in the model. In each iteration, it keeps adding the most relevant features to the target variable. It continues this task until the addition of new features doesn’t improve the model performance. We are going to use the same dataset as taken in the above feature selection methods. For this, we need to extend the module so

$ pip install mlxtend

 

from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector

ffs = SequentialFeatureSelector(KNeighborsClassifier(n_neighbors = 4), 
                                k_features = 10, 
                                forward = True, 
                                n_jobs = -1)
fs  = ffs.fit(X, y)
print(fs.k_feature_names_)
fs.k_score_

Output

('age', 'sex', 'cp', 'fbs', 'restecg', 'exng', 'oldpeak', 'slp', 'caa', 'thall')
0.7625136612021859

These are the top 10 features that are most relevant to the output variable. We can select any number of features by specifying the value of k_features.

 

c) Backward elimination method

The backward elimination method is just the reverse process of the forward selection method. Initially, it trains the model with all the features in it, and iteration by iteration reduces the number of features ensuring the selection of the best parameters for the model and hence increasing the accuracy of the model.

from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector

ffs = SequentialFeatureSelector(KNeighborsClassifier(n_neighbors = 4), 
                                k_features = 8, 
                                forward = False, 
                                n_jobs = -1)
fs  = ffs.fit(X, y)
print(fs.k_feature_names_)
print(fs.k_score_)

Output

('sex', 'cp', 'fbs', 'restecg', 'exng', 'slp', 'caa', 'thall')
0.8513114754098361

These are the top 8 features that are relevant to the target variable.

 

3) Embedded method

The embedded method performs feature selection while creating the machine learning model.

LASSO regularization

In this method some of the coefficients are shrinking to zero, indicating certain features are multiplied by zero to estimate the target. So, these features can be removed because they do not contribute to the performance of the model.

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

sfm = SelectFromModel(LogisticRegression(C = 1, penalty = 'l2'))
sfm.fit(X_train, y_train)
important_features = X_train.columns[(sfm.get_support())] 
print(important_features)

Output

Index(['sex', 'cp', 'restecg', 'exng', 'oldpeak', 'caa', 'thall'], dtype='object')

 

Conclusion

Feature selection is important for filtering the redundant features from the dataset. The presence of redundant features can mislead the model which can cause degradation in model performance.

The filter method of model selection uses a statistical approach to select the features while the wrapper method doesn’t use a statistical approach for feature selection. This method is only suitable for a small number of datasets and can be very complex in terms of computation with large datasets.

The embedded method selects the features in the time of model building and thus has a name embedded. Hence, feature selection is important because all the features are not relevant to the output variable, and selecting only a subset of the features available improves the performance of the model.

Reference

Hands-on with Feature Selection Techniques

Happy Learning 🙂

2 thoughts on “Feature Selection Methods in Machine Learning”

Leave a Comment