[Feature Engineering & Python]What is Integer Encoding and One-Hot Encoding when doing Feature Engineering ?

邱麗安 Annette Chiu
7 min readAug 27, 2018

--

Photo by TanahAirStudio - Portfolio on graphicriver

When a product manager tells you” We need to be able to predict whether a particular customer will stay with us. Here are the dataset of users interactions with our product for five years” you cannot just grab the data, load it into a library and get a prediction. You need to deal with the some issues that a data analyst needs to consider when working on a machine learning problem: Feature Engineering, Overfitting, and Hyper Parameter tuning. Some algorithms can work with categorical data directly. For example, a decision tree can be learned directly from categorical data with no data transform required that might be one of the reason that some data analysts prefer to use decision tree models because the feature engineering process can be more effective. However, most of the time machine learning algorithms require all input variables and output variables to be numeric.
In machine learning problems, the first thing we need to do is to convert the original training data into inputs that the model can accept, and retain as much information as possible. This process is usually called Feature Engineering. The extracted features include discrete features (such as province, gender, etc.) and continuous features (weight, price, etc.). For simple models, such as the LR model (Logistic Regression) model, the industry usually discretizes continuous features (this is equivalent to discrete features), so as to increase the non-linear expression ability of LR. For discrete features, in general, the value is just a number, which does not have a meaning of comparative size. Therefore, as an input to the model, it usually needs to be further transformed. There are two ways to transform, one is One-Hot encoding, and the other is dummy variable.

In these article we talk about what is integer encoding and one-hot encoding? why and when we use it to deal with machine learning test. In many machine learning case, features are not always continuous values, but may be categorical values. For example: [“male”, “female”] [“from Europe”, “from US”, “from Asia”] [“uses Firefox”, “uses Chrome”, “uses Safari”, “uses Internet Explorer”]

If you use those kind of number to represent, that will be more efficient, for example:

[“male”, “from US”, “uses Internet Explorer”] -> [0, 1, 3]

[“female”, “from Asia”, “uses Chrome”]-> [1, 2, 1]

However, after we converting to a numerical representation, the above data cannot be directly used in our classifier. Because the estimator will consider that classes to be ordered, but is actually unordered. For example clothes’ size category data is arbitrarily sorted. If your example has a categorical feature “colors” and this feature has three possible values: “red””yellow””green”, we can transform this feature into a data frame and name the columns at first. Later we will transfer the features to the format that the learning algorithm can read.

# Python program to demonstrate 
import pandas as pd
df2 = pd.DataFrame(
[['Green','M', 10.1, 1 ], ['Red','L',13.5, 2 ], ['Blue','XL',13.5, 2 ], ['Red','L', 10, 1 ]])
df2.columns = ['Color','Size','Price','Classlable']
df2 #Print the DataFrame
# Python program to conduct the onehot_encoding onehot_encoding = pd.get_dummies(df2['Color'], prefix='Color')
df2 = df2.drop('Color',1)
onehot_encoding
#combine the original part and the one-hot partpd.concat([onehot_encoding, df2],axis=1)#use the size mapping function to show the value of size size_mapping = { 'XL': 3, 'L': 2,'M': 1}
df['size'] = df['size'].map(size_mapping)
Picture1: Python program to demonstrate data frame Picture 2: onehot_encoding the color column

In this case, the order of the a feature’s values is not important, using ordered features will confused the learning algorithm and potentially lead to overfitting. Therefore, we use the concat format to combine the original part and the one-hot part for the algorithm can read. We should not transform red into1, yellow into2, and green into 3 to avoid increasing the dimensionality because that would imply that there’s an order among the values in this category and this specific order will change the decision making.

Some categorical data need integer encoding rather than one-hot encoding.

We must be careful that some features in the data frame cannot be transfer to one-hot encoding format. Furthermore, we will use Integer Encoding to convert categorical data to numerical data. For example, the size column imply that there’s an order among the values in this category. Assume the bigger size is more important for the model we set the Size XL to 3, L to 2 and M to 1. ( ‘XL’: 3, ‘L’: 2,’M’: 1} we will use the mapping function to transfer the size to the meaningful number. Finally, we will use concat format to combine the original part and the one-hot part and the data frame will be ready to go.

Convert Size categorical data to numerical data
Concat format to combine the original part and the one-hot part

Feature Engineering for Support Vector Machine

After basic understanding of feature engineering, we will use a classic machine learning project to demonstrate how to process the engineering.

Introduction

For our project, we decided to use an SVM classifier in order to learn and predict whether a person’s income would exceed $50,000 a year based on various features (such as age, race, education, etc.). In order to build our model, we first discuss our objective, detail our data preprocessing decisions and procedures, and then mathematically solve for our optimization problem before implementing the algorithm in full. With regard to features, we can analyze single features and analyze the relationships between different features. There are two types of features: categorical and numeric.

Numerical: are all numbers

Categorical: characteristics or strings

dataset: https://archive.ics.uci.edu/ml/datasets/adult

Dataset_raw.describe()

Dataset_ raw.describe(include=[‘O’])

The original data set has 14 features for each item, namely age, working class, fnlwgt, education, education quantity, marital status, occupation, relationship, race, gender, acquisition, capital loss, hour-per week and among the native-speaking countries, six characteristics are continuous values, including age, fnlwgt.education-NUM, water supply bureau gain, sewer loss, hourly and weekly; the other eight characteristics are discrete. Discretize continuous features and convert discrete features with medium numbers into medium-sized binary features. In this article, we will focus on how to deal with the numerical category like age. How to deal with the feature sometimes rely on the conversation between the data science team and people with the domain knowledge. The most difficult part of feature engineering is not to write programs but to judge how to analyze each feature. There is no standard answer to this project, relying on analysts’ industry experience. In this case, we want to analyze the impact of different age groups on salary. The values ​​here must be converted because the larger the age number does not mean the higher the analysis weight. Age is the continuous value, expanded to 10 digits:

# Python program to demonstrate 
dataset_bin['age'] = pd.cut(dataset_raw['age'], 10)
#Split continuous values
dataset_con['age'] = dataset_raw['age'] # non-discretised
#The left picture is the result of splitting, the right picture is divided according to different income levelsplt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5))
plt.subplot(1, 2, 1)
sns.countplot(y="age", data=dataset_bin);
plt.subplot(1, 2, 2)
sns.distplot(dataset_con.loc[dataset_con['predclass'] == 1]['age'], kde_kws={"label": ">$50K"});
sns.distplot(dataset_con.loc[dataset_con['predclass'] == 0]['age'], kde_kws={"label": "<$50K"});
The left picture is the result of splitting, the right picture is divided according to different income levels

Conclusion

Before we put the training dataset to the model, we must convert the categorical data to a numerical form because some algorithms cannot work with categorical data directly. When working with statistics, it’s important to recognize the different types of data. Most data fall into one of two group: Numerical and Categorical. these two types of variables in statistics can be divided further to Numeric Continuous and Numeric Discrete. For example, a person age in years is continuous and how many people in your family is discrete numeric. The categorical number can be divided further to Nominal Variables and Ordinal Variables. For example, the movies categories comedy, romantic are nominal variable. ordinal variable are like “Not very much” “They are okey”. Besides, Most of the time we have to deal with the integer and continuous numerical value. In this way, we will not limit the performance of the machine learning algorithms.

[1] Andriy Burkov, THE HUNDERED-PAGE MACHINE LEARNING BOOK(2019), Publisher Andriy Burkov

[2] Prateek Gupta, Data Science with Jupyter (2019), BPB Publications

--

--