Setting up a simple GridSearchCV in Scikit-learn
Let's say we have training dataset (train_X)
and test dataset (train_y) and we want to set up a grid search for
training the k-neighborhood search model for optimizing the best value
for number of neighbors (k). We also want to divide the training dataset
into training and validation sets.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
Define the model
model = KNeighborsClassifier()
Define which values of k we want to test
ks = list(range(1,11))
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Define grid parameters
grid_params = dict(n_neighbors=ks)
Define GridSearchCV
grid = GridSearchCV(model, grid_params, cv=10, scoring='accuracy',
return_train_score=False,verbose=1)
The parameter cv=10 divides the training datasets into ten parts. For first iteration of cross-validation, nine parts joined are taken as training set and one part is used as validation set and model performance is tested on the validation set.. For each iteration of cross-validation, one part of the divided data is taken as validation set.
Finally we train the model by following command:
grid_search = grid.fit(train_X, train_y)
This will test all the values of k on all cross-validation sets.