Mater most important techniques for encoding categorical variables | Feature Engineering Part-II


Hello and welcome to a wonderful series of feature engineering where we are exploring different techniques to make our data better to gain maximum information from it and build a generalized model. This is part-2 where we will explore different techniques for categorical encoding. In our previous blog, we have studied different techniques from missing value imputation. without much waste let's head over to the agenda of an article to know what we will be studying in this article.


Introduction to Categorical Encoding

We know that Machines are capable to understand only numerical data, But while handling any real-world scenario we have to deal with categories of data. The problem arises with how to give this data to Machine Learning Models, To solve this we have many techniques as well as python pre-built functions and libraries which help to encode the categories to numbers, and then we can feed it to the model.

Techniques for categorical encoding | Feature engineering techniques

Categorical Data is very much important in most of the problem statements because it consists of most of the information that needs to be retrieved and analyzed and must include during feature engineering tasks. So, let's get started and understand each technique in detail.

Overview of Sample Dataset

The dataset we will use to demonstrate each method is a very famous Titanic dataset. You can easily find this data from Kaggle or visit this link. If you are a machine learning practitioner then I hope you know about this particular dataset.

Overview of Titanic Dataset

Techniques to Handle Categorical Variables

We know that categorical variables come in different types or we can say as ordinal and Nominal. so there are different techniques to handle ordinal and nominal variables that we will study, It's important to identify variables as nominal or ordinal because applying different techniques in a different variable may result in inappropriate outputs.

1) One-Hot Encoding

The techniques consist of encoding each categorical variable into a boolean variable that takes values as 0 and 1. This means if a particular category is present in an observation(row) it assigns it as 1 else 0.

The one-hot encoding comes in various variants.

  • The first variant is split category in k-1 variables
  • second in to split into k variables

One Hot Encoding in K-1 Variables

It simply means to keep 1 variable less while splitting, for example, I have 5 categories in observation and when I apply one-hot encoding to this using k-1 split criteria then in result I will only have 4 columns split. And to identify the fifth value where all the 4 are 0 then It means it is in the 5th category Hence this particular technique is used to use feature space efficiently. Now I hope that this is clear to you.

To implement a one-hot encoding technique Python has a very nice function as get_dummies() And for splitting into k-1 categories we use a parameter as drop_first which is assigned as True.

One Hot Encoding with K categories

It is also similar to the above one but the only difference is here we get one extra column or all the categories get splitter into features.

💻Code:

Advantages of One-Hot Encoding
  1. It does not lose any kind of information
  2. It is more suitable for linear models
  3. It is easy to implement and can be used in the production

Disadvantages of One-Hot Encoding
  1. It Expands feature space
  2. If we have more categories then the particular technique is not useful or we can have the top 10 categories.
  3. It does not add any extra information while encoding.

2) Integer Label Encoding

The technique includes assigning categories a number from 0 to n or 1 to n. In this technique, we take the unique categories and assign them one integer. This method can also be done based on ranks or count of a particular frequency or simply as unique categories. In all these 3 cases it works fine.

Let's take an example of the Embarked column in our dataset. 

💻Code:

On printing the mapping, the output dictionary looks like same as shown below.

                {'S': 0, 'C': 1, 'Q': 2}  
{'S': 0, 'C': 1, 'Q': 2}        {'S': 0, 'C': 1, 'Q': 2}
Advantages Of Integer Label Encoding
  1. It does not expand the feature space
  2. Straightforward Implementation
  3. It works very well in tree-based algorithms
  4. Allows agile benchmarking of Machine Learning Models.

Disadvantages of Integer Label Encoding
  1. It cannot handle the new categories in data automatically
  2. Not suitable for linear models
  3. If we use this technique to encode as per ranking then it provides some additional information otherwise it does not add any extra information with categories.

3) Count Of Frequency Encoding

In this technique, the particular category in an observation gets replaced by the number of counts it appeared in a dataset or the percentage occurring in the complete dataset.

💻Code:

Advantages of Count of Frequency Encoding
  1. Easy and straightforward implementation
  2. It also works well in tree-based algorithms

Disadvantages of Count of Frequency Encoding
  1. We can lose the valuable category information in this case. suppose if 2 category has the same frequency count then It may lead to distortion of relationship.
  2. Not capable to handle the new category.

4) Target Guided Encoding

The categories are assigned a numerical value from 0 to 1 but according to relation with the target variable. In this technique, we extract the list of unique categories in which the top category has the highest mean according to the target column.

💻Code:

Advantages of Target Guided Encoding
  1. simple and easy to implement
  2. Easy to use the relation with the target column
  3. It creates a monotonic relationship between categories and the target variable.

Disadvantages of Target Guided Encoding
  1. Sometimes It may lead to overfitting during modeling.

5) Mean Encoding

Mean encoding means replacing the category with the mean target value for that category in a variable. It is similar to the above technique only difference is that in the above we take a rank and assign an integer whereas in this we directly encode with the mean values.

💻Code:

Advantages of Mean Encoding
  1. It does not expand the feature space
  2. It also creates a monotonic relationship between categories and target
  3. simple, easy, and straightforward implementation.

Disadvantages of Mean Encoding
  1. Loss of information if 2 category gives the same mean.
  2. It may lead to overfitting.

6) Probability Ratio Encoding

This technique is suitable for Binary classification problems only. we calculate the mean of target being 1. It means that we calculate what is the probability of the target variable to give an outcome as 1 and oppose it is 1-p(1) is the probability of the target is 0.

How we will perform this technique in our titanic dataset as

  • Probability of survived based on cabin
  • Probability of not survived = 1 - P(survived)
  • Probability Ratio = P(survived) / P(not survived)
  • we got the dictionary of probability ratio concerning cabin
  • Map this dictionary to cabin column and convert it to numeric.
💻Code:

Advantages of Probability Ratio Encoding
  1. It is suitable for linear models because it creates a monotonic relationship between variables and targets.
  2. Do not expand the feature space.
  3. Captures information within the category, therefore creating more predictive features.

Disadvantages of Probability Ratio Encoding
  1. Not suitable when the denominator is 0.
  2. can lead to overfitting.

7) Weight Of Evidence Encoding

The technique is almost similar to probability ratio encoding only instead of mapping directly with ratio we take the log of the probability ratio and map with the log value. The technique is also applicable for Binary classification problems only.

  Mathematical Formula:-    WOE = ln (p(1) / p(0))

Code:

Advantages of Weight Of Evidence
  1. It orders the categories on a log scale which is most appropriate and natural for Logistic Regression
  2. We can compare the transformed variables because they are on the same scale(log). which makes it easy to determine which one is more predictive.

Disadvantages of Weight Of Evidence
  1. It cannot be defined when the denominator is 0

SUMMARY

Handling Categorical features in data is a very important task because it is not possible that you will always get cleaned or encoded data. whenever we work on business problems then most of the data that is extracted from the customer is categorical so it needs to handle carefully. categorical variable is classified into binary(eg-gender), ordinal(bad, good, excellent) and nominal(color name, city name). so you have to identify this variable and use the appropriate technique to encode them as per the problem statement.

👉 I hope that each technique was easy to grab and I hope that you are now capable to use this technique in your project for categorical encoding and on practice and by heat and trial method you will be able to identify the right technique for your use. If you have any queries then feel free to post them in the comment section below. You can also snd your feedback, suggestions, or queries through a contact form.

Thank You for your time! 😊

Post a Comment

If you have any doubt or suggestions then, please let me know.

Previous Post Next Post