How to convert categorical text data into numerical data using OneHotEncoder
Machine learning algorithms handle numerical data better than text data. A dataset can contain categorical data in text form such a gender, food_type, taxonomic_class, etc. In order to better utilize the power of machine learning algorithms we would have to convert the categorical data in text form into numerical form. This can be done using encoders. There are a few types of encoders in scikit-learn that convert the categorical data into either binary or numerical data. Here we will learn about OneHotEncoder in scikit-learn.
OneHotEncoder converts the categorical data into binary data in which each category in dataframe column is converted into one separate column where the value of the column is 1 in rows where that particular category is present. For example, if the category of gender in row number 12 in a dataset is 'male'. Then the column corresponding to 'male' category created by OneHotEncoder will have 1 in row number 12.
We will see an example how to encode a categorical column using OneHotEncoder using a 'pokemon' dataset. The structure of the data is as follows:
In above, we have red the dataset from a 'comma-separated value' file and deleted all rows with missing data.
We will encode the column "Type 1". For that let's first see how many unique values are present in the column.
print(df['Type 1'].unique())
print(len(df['Type 1'].unique()))
So, there are 18 different categories in the column. To encode the column using OneHotEncoder, we use following commands.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded_col = encoder.fit_transform(df['Type 1']
.values
.reshape(-1,1))
The OneHotEncoder transforms the data into a sparse matrix. This can be viewed using, the type() function. Also, the dimension of the matrix reveals that the encoder has created one column for each of the 18 categories.
Now, the next step is to delete the original column and replace it with the new encoded columns. Also, it would be useful if we could name the columns such that the column name represents the original column from which the category has come and the category name itself. For that we will use the encoder.get_feature_names_out() function.
encoder.get_feature_names_out()
array(['x0_Bug', 'x0_Dark', 'x0_Dragon', 'x0_Electric', 'x0_Fairy',
'x0_Fighting', 'x0_Fire', 'x0_Flying', 'x0_Ghost', 'x0_Grass',
'x0_Ground', 'x0_Ice', 'x0_Normal', 'x0_Poison', 'x0_Psychic',
'x0_Rock', 'x0_Steel', 'x0_Water'], dtype=object)
Here, we see that the feature names in the OneHot-Encoded data represent the different categories arrange in alphabetical order and the categories are prepended by "x0_". However, we want to prepend the categories with the name of the column in which they were present in the original dataframe, i.e. "Type 1". This is done as follows:
encoder.get_feature_names_out(['Type_1'])
array(['Type_1_Bug', 'Type_1_Dark', 'Type_1_Dragon', 'Type_1_Electric',
'Type_1_Fairy', 'Type_1_Fighting', 'Type_1_Fire', 'Type_1_Flying',
'Type_1_Ghost', 'Type_1_Grass', 'Type_1_Ground', 'Type_1_Ice',
'Type_1_Normal', 'Type_1_Poison', 'Type_1_Psychic', 'Type_1_Rock',
'Type_1_Steel', 'Type_1_Water'], dtype=object)
Note that I have used "_" in place of space in "Type 1". The above function can be used to add new column to the dataframe as follows:
Below is the final encoded dataset. We see that each "Type 1" category gets its new column and wherever the particular category is present in a row, the value is 1.0 otherwise the value is 0.0. In the end will combine the above information to make a function that transforms a categorical column in a dataframe using OneHotEncoder. #convert sparse matrix to dense array
encoded_col_dense = encoded_col.toarray()
# get column names
new_cols = encoder.get_feature_names_out(['Type_1'])
# convert the encoded matrix to dataframe
enc_mat_df = pd.DataFrame(encoded_col_dense, columns=new_cols)
# concatenate this dataframe with original data
df = df.reset_index()
df_with_encoding = pd.concat([df, enc_mat_df], axis=1)