Blogs


Feature Selection methods In data science

Introduction

One of the best ways I use to learn machine learning is by benchmarking myself against the best data scientists in competitions. It gives you a lot of insight into how you perform against the best on a level playing field.

Initially, I used to believe that machine learning is going to be all about algorithms – know which one to apply when and you will come on the top. When I got there, I realized that was not the case – the important point is that the same algorithms which a lot of other people were using are used by most data scientists. Why some models are better than others in Kaggle competitions and any competition i ever saw or participated lies in two things: Feature Creation and Feature Selection.

In other words, it boils down to creating variables which capture hidden business insights and then making the right choices about which variable to choose for your predictive models! Sadly or thankfully, both these skills require a ton of practice. There is also some art involved in creating new features – some people have a knack of finding trends where other people struggle.

In this article, I will focus on one of the 2 critical parts of getting your models right – feature selection. I will discuss in detail why feature selection plays such a vital role in creating an effective predictive model.

Table of Contents

  1. Importance of Feature Selection
  2. Filter Methods
  3. Wrapper Methods
  4. Embedded Methods
  5. Difference between Filter and Wrapper methods

1. Importance of Feature Selection

Machine learning works on a simple rule – if you put garbage in, you will only get garbage to come out. By garbage here, I mean noise in data.

This becomes even more important when the number of features is very large. You need not use every feature at your disposal for creating an algorithm. You can assist your algorithm by feeding in only those features that are really important. I have myself witnessed feature subsets giving better results than the complete set of feature for the same algorithm. Or as puts it – “Sometimes, less is better!”

Not only in the competitions but this can be very useful in industrial applications as well. You not only reduce the training time and the evaluation time, you also have fewer things to worry about!

Top reasons to use feature selection are:

  • It enables the machine learning algorithm to train faster.
  • It reduces the complexity of a model and makes it easier to interpret.
  • It improves the accuracy of a model if the right subset is chosen.
  • It reduces overfitting.

Next, we’ll discuss various methodologies and techniques that you can use to subset your feature space and help your models perform better and efficiently. So, let’s get started.

2. Filter Methods

Filter methods are generally used as a preprocessing step. The selection of features is independent of any machine learning algorithms. Instead, features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. The correlation is a subjective term here. For basic guidance, you can refer to the following table for defining correlation coefficients.

  • Pearson’s Correlation: It is used as a measure for quantifying linear dependence between two continuous variables X and Y. Its value varies from -1 to +1. Pearson’s correlation is given as:
  • LDA: Linear discriminant analysis is used to find a linear combination of features that characterizes or separates two or more classes (or levels) of a categorical variable.
  • ANOVA: ANOVA stands for Analysis of variance. It is similar to LDA except for the fact that it is operated using one or more categorical independent features and one continuous dependent feature. It provides a statistical test of whether the means of several groups are equal or not.
  • Chi-Square: It is a is a statistical test applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution.

One thing that should be kept in mind is that the filter method does not remove multicollinearity. So, you must deal with multicollinearity of features as well before training models for your data.

3. Wrapper Methods

In wrapper methods, we try to use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from your subset. The problem is essentially reduced to a search problem. These methods are usually computationally very expensive.

Some common examples of wrapper methods are forward feature selection, backward feature elimination, recursive feature elimination, etc.

  • Forward Selection: Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model until an addition of a new variable does not improve the performance of the model.
  • Backward Elimination: In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.
  • Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the best performing feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are exhausted. It then ranks the features based on the order of their elimination.

One of the best ways for implementing feature selection with wrapper methods is to use Boruta package that finds the importance of a feature by creating shadow features.

It works in the following steps:

  1. Firstly, it adds randomness to the given data set by creating shuffled copies of all features (which are called shadow features).
  2. Then, it trains a random forest classifier on the extended data set and applies a feature importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each feature where higher means more important.
  3. At every iteration, it checks whether a real feature has a higher importance than the best of its shadow features (i.e. whether the feature has a higher Z-score than the maximum Z-score of its shadow features) and constantly removes features which are deemed highly unimportant.
  4. Finally, the algorithm stops either when all features get confirmed or rejected or it reaches a specified limit of random forest runs.

 

4. Embedded Methods

 

Embedded methods combine the qualities’ of filter and wrapper methods. It’s implemented by algorithms that have their own built-in feature selection methods.

Some of the most popular examples of these methods are LASSO and RIDGE regression which have inbuilt penalization functions to reduce overfitting.

  • Lasso regression performs L1 regularization which adds penalty equivalent to the absolute value of the magnitude of coefficients.
  • Ridge regression performs L2 regularization which adds penalty equivalent to the square of the magnitude of coefficients.

5. Difference between Filter and Wrapper methods

The main differences between the filter and wrapper methods for feature selection are:

  • Filter methods measure the relevance of features by their correlation with dependent variable while wrapper methods measure the usefulness of a subset of feature by actually training a model on it.
  • Filter methods are much faster compared to wrapper methods as they do not involve training the models. On the other hand, wrapper methods are computationally very expensive as well.
  • Filter methods use statistical methods for evaluation of a subset of features while wrapper methods use cross-validation.
  • Filter methods might fail to find the best subset of features in many occasions but wrapper methods can always provide the best subset of features.
  • Using the subset of features from the wrapper methods make the model more prone to overfitting as compared to using a subset of features from the filter methods.

End Notes

I believe that his article has given you a good idea of how you can perform feature selection to get the best out of your models. These are the broad categories that are commonly used for feature selection. I believe you will be convinced about the potential uplift in your model that you can unlock using feature selection and added benefits of feature selection.

Did you enjoy reading this article? Always appreciated for taking time and if you want to read about my other posts please follow my blog for updates here or visit my Github.

 Do share your views in the comment section below.