Fill missing values using SimpleImputer

Data often would contain missing values. Sometime it makes sense to fill the missing values with some appropriate value.

For example we may want to fill the missing value with, say, mean of the available values. We can fill such missing values by calculating the mean of the column and using the fillna() function. However, if several columns have missing values then we might have to repeat this process several times or write a loop.

Scikit-learn offers functionality called as SimpleImputer to easily fill the missing values .

Let’s first create an example dataset to see how to use the SimpleImputer.

import numpy as np
import pandas as pd

# create data
col1 = np.random.normal(loc=10, scale=2, size=10)
col1[3] = np.nan
col2 = np.random.normal(loc=20, scale=3, size=10)
col2[7] = np.nan
col3 = np.random.normal(loc=40, scale=4, size=10)
col3[6] = np.nan

df = pd.DataFrame()
df['col1'] = col1
df['col2'] = col2
df['col3'] = col3
df

	col1	col2	col3
0	11.087917	11.001978	46.698114
1	7.356664	24.681053	38.251831
2	10.399026	18.130117	39.443336
3	NaN	18.529337	37.623165
4	9.298997	21.411288	46.382642
5	10.807994	21.458481	41.515546
6	9.883776	20.749925	NaN
7	7.527174	NaN	31.557465
8	8.492642	16.355656	42.161619
9	10.667234	18.704572	43.391324

The data contains three columns and each column has one missing value. We will fill these missing values with the mean of the available values in the column.

For this we import SimpleImputer class from Scikit-learn and create an instance of it specifying that we want to make it fill the missing values with the mean of that column.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'mean')

imputer.fit(df)

SimpleImputer()

At this point the imputer would have calculated and stored the mean values of all columns in the statistics_ instance variable.

imputer.statistics_

array([ 9.50238043, 19.00248959, 40.78056004])

To tranform the data, we use the imputer.transform() method with the dataframe as argument.

This operation returns a NumPy array.

ndata = imputer.transform(df)
print(type(ndata))
ndata

<class 'numpy.ndarray'>

array([[11.08791688, 11.00197836, 46.69811352],
       [ 7.35666354, 24.68105263, 38.25183079],
       [10.39902622, 18.13011651, 39.4433365 ],
       [ 9.50238043, 18.52933739, 37.62316472],
       [ 9.29899722, 21.4112885 , 46.38264182],
       [10.80799406, 21.45848068, 41.51554579],
       [ 9.88377643, 20.74992457, 40.78056004],
       [ 7.52717371, 19.00248959, 31.55746493],
       [ 8.4926418 , 16.35565612, 42.16161851],
       [10.66723403, 18.70457156, 43.39132382]])

We now create a new dataframe with same columns and index as that of our first dataframe.

df_na_filled = pd.DataFrame(ndata,
                           columns=df.columns,
                           index=df.index)
df_na_filled

	col1	col2	col3
0	11.087917	11.001978	46.698114
1	7.356664	24.681053	38.251831
2	10.399026	18.130117	39.443336
3	9.502380	18.529337	37.623165
4	9.298997	21.411288	46.382642
5	10.807994	21.458481	41.515546
6	9.883776	20.749925	40.780560
7	7.527174	19.002490	31.557465
8	8.492642	16.355656	42.161619
9	10.667234	18.704572	43.391324

We see that the missing values are replaced with the mean values. The effective code can be written as follows:

Search This Blog

The Dry Lab Stuff

Fill missing values using SimpleImputer

Popular Posts

Principal Coordinate Analysis (PCoA) in R

Why accuracy is not a good metric for scoring classification models?