TransWikia.com

Applying Sci-kit Learn's kNN algorithm to Fresh Data

Data Science Asked by J1J1P13 on May 7, 2021

While I was studying Scikit-learn’s kNN algorithm, I realized that if I use sklearn.model_selection.train_test_split, the provided data gets automatically split into the train data and the test data set, according to the proportions provided as parameters.

Then based on the train data, the algorithm looks at the k-nearest neighbor points closest to the test data points to determine whether the test data points belong to a certain criteria or not.

I was wondering whether there was a way to predict the criteria NOT for the test data sets, which were already a part of the provided data set, but brand new data that were not provided during the whole process.

Is there a way to do that using sci-kit learn?

2 Answers

KNN is not fitted to "the k-nearest neighbor points closest to the test data points". You specify the fit option, like:

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)

Usually this will be xtrain, ytrain, while you test the model performance using "new" (unseen) data and compare the true targets to the prediction.

neigh.predict(xtest)

or

neigh.predict_proba(xtest)

See docs: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

Correct answer by Peter on May 7, 2021

After your initial validation of the model using train-test split, if you are satisfied with the performance, You can create a final model by training on the entire dataset. That way you put to use all available labeled for running inferences on brand new data.

You would simply perform a:

model = KNeighborsClassifier()

model.fit(X, y)

Where X, y represent your entire training data.

Answered by Jayaram Iyer on May 7, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP