Hello, Welcome to the wonderful series on Hands-On Feature Engineering. If you have followed our previous tutorials of feature engineering then we are almost at the end of the series and have covered 5 different feature engineering techniques that you can check here. In this tutorial, we will be discussing different techniques for feature selection. without more waste let's have a look over the main headings we will be studying.
Table of Contents
A brief introduction to Feature Selection
Feature Selection is a process where we automatically select the most important features using some statistical measures that contributes the most in predicting the output.
While working with real-time datasets all the time we will not get the data in which all features dominate the most or all features have their importance for building a model. Hence we have to look out at feature relationships and select the best features because it directly affects the model training time as well as the performance.
Why Performing Feature Selection is important
It is been said that if you give good features to the worst machine learning model, and bad data to a highly optimized machine learning model then the bad model will outperform and will generate better results as compared to the optimized model because a complete game depends on data that on which and what kind of data you train a particular model. below are three important benefits of performing feature selection is discussed.
1) Reduces Overfitting - feature selection helps to reduce the complexity of the model like it helps to remove multicollinearity from data, and the features which have very low or no relationship with the output variable. Multicollinearity is a property that says that two features are highly correlated with each other and equally related to the output variable. In this case, it is better to find that two features and avoid anyone reducing overfitting.
2) Improves accuracy - If overfitting is reduced then the model can easily extract feature relationships and accuracy is improved.
3) Reduces training time - If features are less and good then the training time of the model automatically gets reduced.
Let's get started with various methods used for feature selection. For demonstrating each technique we will use the Mobile price classification dataset which you can easily find on Kaggle from here.
Techniques For Feature Selection
1) Univariate Selection
Statistical tests can be used to select those features which have the highest relationship with the target variable. It selects the top best features based on univariate statistical tests. In this test, each feature is compared against the target variable to find its significance in determining the output variable. The test is also known as the analysis of variance (ANOVA). Python Sciket-learn library provides a SelectKBest class to select important features which can be used with different statistical measures.
The example below uses the chi-square statistical test for nen-negative features to select 10 of the best features from the Mobile Price Range prediction dataset.
The chi-square test is used for categorical variables in a dataset. It finds the chi-square score between an independent and dependent variable and selects the best-suited desired features.
Many statistical tests can be used with this statistical method. for example, Annova F-value is appropriate for numerical inputs and categorical data. This can be achieved using the f_classif() function.
2) Feature Importance
This is a very nice feature selection technique. we can get the importance of each feature using the feature_importance property. Random Forest algorithm(bagging) and Extra Trees can be used to estimate the importance of features.
The highest the score is, the more important is that feature in predicting the output. let's train the model and see its importance and pick the top 10 best features.
3) Correlation Matrix using Heatmap
This is the most used technique in every problem statement first for finding the relationship of variables with each other as well as the target variables. using two variable which contributes to the same nature in predicting the target variable is not suggested and should not use, so this technique is useful to avoid such a feature that has the same relationship with the target feature, and the property is known as Multicollinearity.
Correlation can be positive(strong positive relationship) or negative(strong negative relationship). Heatmap makes it easy to easily identify the features with are highly correlated with the target feature, and easy visualization to observe the relationship between variables.
If you observed that some features are correlated and have the same relationship with the target feature then you can remove them or you can also remove them using the threshold method.
How to remove Co-related Features?
we can remove the correlated features using the threshold method. in this technique, we set the particular threshold and check that if a certain feature has the same threshold then we collect them and pick one out of it.
4) Information Gain
Information Gain(IG) measures how much information does a particular feature gives about a class. we use this information to gain value to pick the best features which are well suited for predicting a class.
Information gain is used in a decision tree algorithm type ID3 along with entropy to decide the root node and best splitting criterion, only that concept is used to select the best features.
5) Recursive Feature Elimination
Recursive Feature Elimination(RFE) works by recursively removing attributes and training a model on a remaining attribute. It uses model accuracy to find out which features contribute the most in predicting the target class.
The example below uses the RFE with the logistic regression to select the best 3 features. The choice of algorithm does not matter too much as long as it is skillful and consistent.
6) Principal Component Analysis
Principal Component Analysis(PCA) is a dimensionality reduction technique that basically uses linear algebra to transform a dataset into a compressed form. A property of PCA is that you can choose a number of dimensions or principal components in the transformed result you want hence, this technique works well for feature selection purposes as well.
SUMMARY
Feature selection is a core method of a feature engineering lifecycle while working on any problem statement. we have studied six different feature selection techniques in this article, and I hope that it was easy to understand all these techniques. If you have any queries, corrections, or suggestions please use the comment section below 👇.
👉 let me know which is your favorite technique or which one works best for you. And if you want to add on any technique in this list please comment it down. I will be happy to see the list growing and get to learn something from your side.