Find Best Preprocessing Steps During Model Selection

We have to be careful to properly handle preprocessing when conducting model selection. First, GridSearchCV uses cross-validation to determine which model has the highest performance. However, in cross-validation we are in effect pretending that the fold held out as the test set is not seen, and thus not part of fitting any preprocessing steps (e.g. scaling or standardization). For this reason, we cannot preprocess the data then run GridSearchCV.

Second, some preprocessing methods have their own parameter which often have to be supplied by the user. By including candidate component values in the search space, they are treated like any other hyperparameter be to searched over.


# Load libraries
import numpy as np
from sklearn import datasets
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Set random seed

Load Iris Dataset

# Load data
iris = datasets.load_iris()
X =
y =

Create Proprocessing Object

We are include two different preprocessing steps: principal component analysis and a k-best feature selection.

# Create a combined preprocessing object
preprocess = FeatureUnion([('pca', PCA()), ("kbest", SelectKBest(k=1))])

Create Pipeline

# Create a pipeline
pipe = Pipeline([('preprocess', preprocess), ('classifier', LogisticRegression())])

Create Search Space Of Hyperparameter Values

# Create space of candidate values
search_space = [{'preprocess__pca__n_components': [1, 2, 3],
                 'classifier__penalty': ['l1', 'l2'],
                 'classifier__C': np.logspace(0, 4, 10)}]
# Create grid search 
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0, n_jobs=-1)
# Fit grid search
best_model =, y)

View Best Model’s Hyperparamters

# View best hyperparameters
print('Best Number Of Princpal Components:', best_model.best_estimator_.get_params()['preprocess__pca__n_components'])
print('Best Penalty:', best_model.best_estimator_.get_params()['classifier__penalty'])
print('Best C:', best_model.best_estimator_.get_params()['classifier__C'])
Best Number Of Princpal Components: 3
Best Penalty: l1
Best C: 59.9484250319