TransWikia.com

Best way to remove useless features (non-english words) when there are more than 100,000 features?

Data Science Asked on May 6, 2021

I am in a situation where i have more than 100,000 features, and i need to select the top features to give them to my final neural network model.

So far i have been using RandomForestClassifier in sklearn, and first i use fit and then i use feature_importances to select the top n features. ( I also use StandardScaler and transform to normalize data)

now i have two questions:

  1. am i approaching this right? is this a correct way remove useless features?

  2. is there any better way to select for example top 200 features to give to my final model when there are more than 100,000 features? my task is sentence classification and these features are actually BOW features of non english words, so basically i want to learn the top most important "words" in my corpus that can be used to classify sentences.

my final model is a neural network, but i cannot give all the features to it and let it decide because of performance issues, i need to filter some of the features and then give them to my neural network model. also my final model is written in pytorch, so currently i am using sklearn to select top n features and then use pytorch to train the final mode, so if there is an easier approach for this please let me know.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP