Impute Missing Values Using K-Nearest Neighbors

Nearest neighbor imputations which weights samples using the mean squared difference on features for which two rows both have observed data. In this example, we use 3 nearest rows which have a feature to fill in each row's missing features.

Preliminaries

import pandas as pd
import numpy as np
from fancyimpute import KNN
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline  
Using Theano backend.

Create Data

df = pd.DataFrame()

df['x0'] = [0.3051,0.4949,0.6974,0.3769,0.2231,0.341,0.4436,0.5897,0.6308,0.5]
df['x1'] = [np.nan,0.2654,0.2615,0.5846,0.4615,0.8308,0.4962,0.3269,0.5346,0.6731]


df.head()
x0 x1
0 0.3051 NaN
1 0.4949 0.2654
2 0.6974 0.2615
3 0.3769 0.5846
4 0.2231 0.4615

View The Raw Data

# Create data, with the first observation containing a missing value
X = df.as_matrix(columns=['x0', 'x1'])

# View data
X
array([[ 0.3051,     nan],
       [ 0.4949,  0.2654],
       [ 0.6974,  0.2615],
       [ 0.3769,  0.5846],
       [ 0.2231,  0.4615],
       [ 0.341 ,  0.8308],
       [ 0.4436,  0.4962],
       [ 0.5897,  0.3269],
       [ 0.6308,  0.5346],
       [ 0.5   ,  0.6731]])
# Plot data
plt.scatter(X[:,0],X[:,1])
<matplotlib.collections.PathCollection at 0x1191cf780>

png

Impute Using K-Nearest Neighbors

# Imput missing values from three closest observations
X_imputed = KNN(k=3).complete(X)

# View new data
X_imputed
Computing pairwise distances between 10 samples
Computing distances for sample #1/10, elapsed time: 0.000
Imputing row 1/10 with 1 missing columns, elapsed time: 0.002





array([[ 0.3051    ,  0.73900744],
       [ 0.4949    ,  0.2654    ],
       [ 0.6974    ,  0.2615    ],
       [ 0.3769    ,  0.5846    ],
       [ 0.2231    ,  0.4615    ],
       [ 0.341     ,  0.8308    ],
       [ 0.4436    ,  0.4962    ],
       [ 0.5897    ,  0.3269    ],
       [ 0.6308    ,  0.5346    ],
       [ 0.5       ,  0.6731    ]])

Notice that the first observation previously contained a missing value. However, now it contains a new value: 0.73900744

View Imported Value

n=['  Imputed X1','','','','','','','','','']

fig, ax = plt.subplots()
ax.scatter(X_imputed[:,0], X_imputed[:,1])

for i, txt in enumerate(n):
    ax.annotate(txt, (X_imputed[:,0][i],X_imputed[:,1][i]))

png