Fill missing values using SimpleImputer
Data often would contain missing values. Sometime it makes sense to fill the missing values with some appropriate value.
For example we may want to fill the missing value with, say,
mean of the available values. We can fill such missing
values by calculating the mean of the column and using the
fillna()
function. However, if several columns have missing
values then we might have to repeat this process several times or write
a loop.
Scikit-learn offers functionality called as
SimpleImputer
to easily fill the missing values .
Let’s first create an example dataset to see how to use the
SimpleImputer
.
import numpy as np
import pandas as pd
# create data
= np.random.normal(loc=10, scale=2, size=10)
col1 3] = np.nan
col1[= np.random.normal(loc=20, scale=3, size=10)
col2 7] = np.nan
col2[= np.random.normal(loc=40, scale=4, size=10)
col3 6] = np.nan
col3[
= pd.DataFrame()
df 'col1'] = col1
df['col2'] = col2
df['col3'] = col3
df[ df
|
col1 | col2 | col3 |
---|---|---|---|
0 | 11.087917 | 11.001978 | 46.698114 |
1 | 7.356664 | 24.681053 | 38.251831 |
2 | 10.399026 | 18.130117 | 39.443336 |
3 | NaN | 18.529337 | 37.623165 |
4 | 9.298997 | 21.411288 | 46.382642 |
5 | 10.807994 | 21.458481 | 41.515546 |
6 | 9.883776 | 20.749925 | NaN |
7 | 7.527174 | NaN | 31.557465 |
8 | 8.492642 | 16.355656 | 42.161619 |
9 | 10.667234 | 18.704572 | 43.391324 |
The data contains three columns and each column has one missing value. We will fill these missing values with the mean of the available values in the column.
For this we import SimpleImputer
class from
Scikit-learn and create an instance of it specifying that we
want to make it fill the missing values with the mean of that
column.
from sklearn.impute import SimpleImputer
= SimpleImputer(strategy = 'mean')
imputer
imputer.fit(df)
SimpleImputer()
At this point the imputer would have calculated and stored the mean
values of all columns in the statistics_
instance
variable.
imputer.statistics_
array([ 9.50238043, 19.00248959, 40.78056004])
To tranform the data, we use the imputer.transform()
method with the dataframe as argument.
This operation returns a NumPy array.
= imputer.transform(df)
ndata print(type(ndata))
ndata
<class 'numpy.ndarray'>
array([[11.08791688, 11.00197836, 46.69811352],
[ 7.35666354, 24.68105263, 38.25183079],
[10.39902622, 18.13011651, 39.4433365 ],
[ 9.50238043, 18.52933739, 37.62316472],
[ 9.29899722, 21.4112885 , 46.38264182],
[10.80799406, 21.45848068, 41.51554579],
[ 9.88377643, 20.74992457, 40.78056004],
[ 7.52717371, 19.00248959, 31.55746493],
[ 8.4926418 , 16.35565612, 42.16161851],
[10.66723403, 18.70457156, 43.39132382]])
We now create a new dataframe with same columns and index as that of our first dataframe.
= pd.DataFrame(ndata,
df_na_filled =df.columns,
columns=df.index)
index df_na_filled
|
col1 | col2 | col3 |
---|---|---|---|
0 | 11.087917 | 11.001978 | 46.698114 |
1 | 7.356664 | 24.681053 | 38.251831 |
2 | 10.399026 | 18.130117 | 39.443336 |
3 | 9.502380 | 18.529337 | 37.623165 |
4 | 9.298997 | 21.411288 | 46.382642 |
5 | 10.807994 | 21.458481 | 41.515546 |
6 | 9.883776 | 20.749925 | 40.780560 |
7 | 7.527174 | 19.002490 | 31.557465 |
8 | 8.492642 | 16.355656 | 42.161619 |
9 | 10.667234 | 18.704572 | 43.391324 |
We see that the missing values are replaced with the mean values. The effective code can be written as follows: