Mean imputation replaces missing values with the mean value of that feature/variable. Mean imputation is one of the most 'naive' imputation methods because unlike more complex methods like k-nearest neighbors imputation, it does not use the information we have about an observation to estimate a value for it.
import pandas as pd import numpy as np from sklearn.preprocessing import Imputer
# Create an empty dataset df = pd.DataFrame() # Create two variables called x0 and x1. Make the first value of x1 a missing value df['x0'] = [0.3051,0.4949,0.6974,0.3769,0.2231,0.341,0.4436,0.5897,0.6308,0.5] df['x1'] = [np.nan,0.2654,0.2615,0.5846,0.4615,0.8308,0.4962,0.3269,0.5346,0.6731] # View the dataset df
# Create an imputer object that looks for 'Nan' values, then replaces them with the mean value of the feature by columns (axis=0) mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0) # Train the imputor on the df dataset mean_imputer = mean_imputer.fit(df)
# Apply the imputer to the df dataset imputed_df = mean_imputer.transform(df.values)
# View the data imputed_df
array([[ 0.3051 , 0.49273333], [ 0.4949 , 0.2654 ], [ 0.6974 , 0.2615 ], [ 0.3769 , 0.5846 ], [ 0.2231 , 0.4615 ], [ 0.341 , 0.8308 ], [ 0.4436 , 0.4962 ], [ 0.5897 , 0.3269 ], [ 0.6308 , 0.5346 ], [ 0.5 , 0.6731 ]])
0.49273333 is the imputed value, replacing the