Applying Sci-kit Learn's kNN algorithm to Fresh Data

Question

While I was studying Scikit-learn's kNN algorithm, I realized that if I use sklearn.model_selection.train_test_split, the provided data gets automatically split into the train data and the test data set, according to the proportions provided as parameters.
Then based on the train data, the algorithm looks at the k-nearest neighbor points closest to the test data points to determine whether the test data points belong to a certain criteria or not.
I was wondering whether there was a way to predict the criteria NOT for the test data sets, which were already a part of the provided data set, but brand new data that were not provided during the whole process.
Is there a way to do that using sci-kit learn?

Peter · Accepted Answer

KNN is not fitted to "the k-nearest neighbor points closest to the test data points". You specify the fit option, like:
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)

Usually this will be xtrain, ytrain, while you test the model performance using "new" (unseen) data and compare the true targets to the prediction.
neigh.predict(xtest)

or
neigh.predict_proba(xtest)

See docs: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

Jayaram Iyer · Answer

After your initial validation of the model using train-test split, if you are satisfied with the performance,  You can create a final model by training on the entire dataset. That way you put to use all available labeled for running inferences on brand new data.
You would simply perform a:
model = KNeighborsClassifier()
model.fit(X, y)
Where X, y represent your entire training data.

Applying Sci-kit Learn's kNN algorithm to Fresh Data

2 Answers

Add your own answers!

Ask a Question