Fill missing values using SimpleImputer

Data often would contain missing values. Sometime it makes sense to fill the missing values with some appropriate value.

For example we may want to fill the missing value with, say, mean of the available values. We can fill such missing values by calculating the mean of the column and using the fillna() function. However, if several columns have missing values then we might have to repeat this process several times or write a loop.

Scikit-learn offers functionality called as SimpleImputer to easily fill the missing values . 

Let’s first create an example dataset to see how to use the SimpleImputer.

import numpy as np
import pandas as pd
# create data
col1 = np.random.normal(loc=10, scale=2, size=10)
col1[3] = np.nan
col2 = np.random.normal(loc=20, scale=3, size=10)
col2[7] = np.nan
col3 = np.random.normal(loc=40, scale=4, size=10)
col3[6] = np.nan

df = pd.DataFrame()
df['col1'] = col1
df['col2'] = col2
df['col3'] = col3
df

col1 col2 col3
0 11.087917 11.001978 46.698114
1 7.356664 24.681053 38.251831
2 10.399026 18.130117 39.443336
3 NaN 18.529337 37.623165
4 9.298997 21.411288 46.382642
5 10.807994 21.458481 41.515546
6 9.883776 20.749925 NaN
7 7.527174 NaN 31.557465
8 8.492642 16.355656 42.161619
9 10.667234 18.704572 43.391324

The data contains three columns and each column has one missing value. We will fill these missing values with the mean of the available values in the column.

For this we import SimpleImputer class from Scikit-learn and create an instance of it specifying that we want to make it fill the missing values with the mean of that column.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'mean')

imputer.fit(df)
SimpleImputer()

At this point the imputer would have calculated and stored the mean values of all columns in the statistics_ instance variable.

imputer.statistics_
array([ 9.50238043, 19.00248959, 40.78056004])

To tranform the data, we use the imputer.transform() method with the dataframe as argument.

This operation returns a NumPy array.

ndata = imputer.transform(df)
print(type(ndata))
ndata
<class 'numpy.ndarray'>

array([[11.08791688, 11.00197836, 46.69811352],
       [ 7.35666354, 24.68105263, 38.25183079],
       [10.39902622, 18.13011651, 39.4433365 ],
       [ 9.50238043, 18.52933739, 37.62316472],
       [ 9.29899722, 21.4112885 , 46.38264182],
       [10.80799406, 21.45848068, 41.51554579],
       [ 9.88377643, 20.74992457, 40.78056004],
       [ 7.52717371, 19.00248959, 31.55746493],
       [ 8.4926418 , 16.35565612, 42.16161851],
       [10.66723403, 18.70457156, 43.39132382]])

We now create a new dataframe with same columns and index as that of our first dataframe.

df_na_filled = pd.DataFrame(ndata,
                           columns=df.columns,
                           index=df.index)
df_na_filled

col1 col2 col3
0 11.087917 11.001978 46.698114
1 7.356664 24.681053 38.251831
2 10.399026 18.130117 39.443336
3 9.502380 18.529337 37.623165
4 9.298997 21.411288 46.382642
5 10.807994 21.458481 41.515546
6 9.883776 20.749925 40.780560
7 7.527174 19.002490 31.557465
8 8.492642 16.355656 42.161619
9 10.667234 18.704572 43.391324

We see that the missing values are replaced with the mean values. The effective code can be written as follows:

Popular posts from this blog

Principal Coordinate analysis in R and python

Principal Coordinate Analysis (PCoA) in R