Feature selection is the process of reducing the number of input features when developing a machine learning model. It is done because it reduces the computational cost of the model and improves its performance of the model.
Features that have a high correlation with the output variable are selected for training the model. Selecting the subset of the input features is important because it can help build the most efficient model with those features that are most relevant to the target variable.
Model building with redundant features may mislead the model and may hamper the performance of the model. Hence, features selection is essential.
Categorization of the features selection
Features selection is subdivided into two parts namely:
- Supervised technique: It is the technique used for labeled data
- Unsupervised technique: It is the technique used for unlabeled data
For demonstration, I am using Jupyter Notebook and I will use the heart disease prediction dataset from Kaggle for the implementation of various feature selection techniques. Here are some of the methods for feature selection:
1. Filter method
The filter method computes the relation of individual features to the target variable based on the amount of correlation that the feature has with the target variable. It is a univariate analysis as it checks how relevant the features with target variables are individual. The types of filter methods are as follows:
a) Information gain method
The information gain method computes the reduction of entropy. Information gain is based on the information theory that gives how much information a feature gives in relation to that of another variable. Let’s see how the information gain method is used for feature selection:
At first, I am going to load the dataset
import pandas as pd df = pd.read_csv("heart.csv") df.head()
Now, let’s implement the information gain method as:
X = df.iloc[:, :-1] y = df.iloc[:, -1] from sklearn.feature_selection import mutual_info_classif scores = mutual_info_classif(X, y) print(scores)
[0.00113135 0. 0.14604363 0. 0.09081394 0.01610141 0.03330569 0.08534967 0.10247971 0.0602119 0.11768226 0.10865301 0.16903598]
Let’s plot a bar chart for better visualization:
import matplotlib.pyplot as plt features = df.columns[0:13] new_df = pd.Series(importane, features) new_df.plot(kind = 'barh') plt.ylabel("Features") plt.xlabel("scores") plt.title("Features with scores") plt.show()
Visualizing this bar chart, we can select the number of features as per requirement. Feature ‘trtbps’ seems to have the lowest score and features such as ‘sex’ and ‘age’ also can be dropped from the dataset while training the model.
b) Chi-square method
The Chi-square method is used for categorical data and calculates the chi-square between input features and the target variable. The chi-squared distribution assumes the null hypothesis to be true. The formula used for the calculation of chi-square is:
Now, let’s implement the chi-square method:
from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 feature = SelectKBest(score_func = chi2, k = 'all') best_features = feature.fit(X,y) print(best_features.scores_)
[ 23.28662399, 7.57683451, 62.59809791, 14.8239245 , 23.93639448, 0.20293368, 2.97827075, 188.32047169, 38.91437697, 72.64425301, 9.8040952 , 66.44076512, 5.79185297]
Using these scores and features, let’s plot the bar chart for better understanding:
import matplotlib.pyplot as plt features = df.columns[0:13] new_df = pd.Series(best_features.scores_, features) new_df.plot(kind = 'barh') plt.ylabel("Features") plt.xlabel("scores") plt.title("Features with scores") plt.show()
Visualizing this bar chart, we can select the top 10 or top 8 features. Also, you can set k = 10(say) instead of k = ‘all’ for selecting the top 10 features from the dataset. Feature ‘thalachh’ has the highest score and feature ‘fbs’ has the lowest score.
c) Correlation coefficient method
In this method, the correlation coefficient of the input feature is calculated with the target variable. Correlation can be positive and negative.
A positive correlation coefficient means if there is an increase or decrease in the feature variable then there is a corresponding increase or decrease in the output variable. A negative correlation means that if there is an increase in the feature there is a decrease in the target variable and vice versa.
The correlation coefficient(r) has values ranging from -1 to 1.
If r = 1, high positive correlation,
If r = 0, no correlation,
If r = -1, highly negative correlation
Now, let’s see how the correlation coefficient method is used for feature selection:
import seaborn as sns import matplotlib.pyplot as plt plt.figure(figsize = (13, 10)) sns.heatmap(df.corr(), annot = True) plt.show()
We know that the correlation coefficient of a variable to itself is 1. Looking above the correlation matrix, it is found that features ‘cp’, ‘thalachh’, ‘slp’ are highly positively correlated to the output variable, and features ‘thall’, ‘caa’, ‘lodpeak’, ‘exng’, ‘age’ and ‘sex’ have a negative correlation with the output variable. Other than these, above mentioned features, don’t have that much correlation with the output variable. Hence, we can drop these features from the dataset.
2. Wrapper method
The wrapper method doesn’t use a statistical method for feature selection. It takes a subset of features and applies them to train the model and calculates the accuracy. And it keeps this process on repeat until it came with the best features and best accuracy of the model. Since it involves the training of the model several times, it is very expensive and time-consuming. This method is only suitable for small datasets only.
a) Recursive Feature Elimination
Recursive Feature Elimination(RFE) recursively removes the redundant features until the desired number of features is achieved and hence improving the performance and accuracy of the model.
from sklearn.feature_selection import RFE from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.metrics import accuracy_score X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) rfe = RFE(SVC(kernel = 'linear'), n_features_to_select = 8) rfe.fit(X_train, y_train) pred = rfe.predict(X_test) print("Accuracy : ", accuracy_score(pred, y_test))
Accuracy : 0.8524590163934426
This is how RFE is implemented to select the features and obtain the accuracy of the model.
b) Forward selection method
The forward selection method is an iterative process that starts with no feature in the model. In each iteration, it keeps adding the most relevant features to the target variable. It continues this task until the addition of new features doesn’t improve the model performance. We are going to use the same dataset as taken in the above feature selection methods. For this, we need to extend the module so
$ pip install mlxtend
from sklearn.neighbors import KNeighborsClassifier from mlxtend.feature_selection import SequentialFeatureSelector ffs = SequentialFeatureSelector(KNeighborsClassifier(n_neighbors = 4), k_features = 10, forward = True, n_jobs = -1) fs = ffs.fit(X, y) print(fs.k_feature_names_) fs.k_score_
('age', 'sex', 'cp', 'fbs', 'restecg', 'exng', 'oldpeak', 'slp', 'caa', 'thall') 0.7625136612021859
These are the top 10 features that are most relevant to the output variable. We can select any number of features by specifying the value of k_features.
c) Backward elimination method
The backward elimination method is just the reverse process of the forward selection method. Initially, it trains the model with all the features in it, and iteration by iteration reduces the number of features ensuring the selection of the best parameters for the model and hence increasing the accuracy of the model.
from sklearn.neighbors import KNeighborsClassifier from mlxtend.feature_selection import SequentialFeatureSelector ffs = SequentialFeatureSelector(KNeighborsClassifier(n_neighbors = 4), k_features = 8, forward = False, n_jobs = -1) fs = ffs.fit(X, y) print(fs.k_feature_names_) print(fs.k_score_)
('sex', 'cp', 'fbs', 'restecg', 'exng', 'slp', 'caa', 'thall') 0.8513114754098361
These are the top 8 features that are relevant to the target variable.
3) Embedded method
The embedded method performs feature selection while creating the machine learning model.
In this method some of the coefficients are shrinking to zero, indicating certain features are multiplied by zero to estimate the target. So, these features can be removed because they do not contribute to the performance of the model.
from sklearn.linear_model import LogisticRegression from sklearn.feature_selection import SelectFromModel sfm = SelectFromModel(LogisticRegression(C = 1, penalty = 'l2')) sfm.fit(X_train, y_train) important_features = X_train.columns[(sfm.get_support())] print(important_features)
Index(['sex', 'cp', 'restecg', 'exng', 'oldpeak', 'caa', 'thall'], dtype='object')
Feature selection is important for filtering the redundant features from the dataset. The presence of the redundant features can mislead the model which can cause degradation in model performance.
The filter method of model selection uses a statistical approach to select the features while the wrapper method doesn’t use a statistical approach for feature selection. This method is only suitable for a small number of datasets and can be very complex in terms of computation with large datasets.
The embedded method selects the features in the time of model building and thus has a name embedded. Hence, feature selection is important because all the features are not relevant to the output variable, and selecting only a subset of the features available improves the performance of the model.
Happy Learning 🙂