Best way to remove useless features (non-english words) when there are more than 100,000 features?

Question

I am in a situation where i have more than 100,000 features, and i need to select the top features to give them to my final neural network model.
So far i have been using RandomForestClassifier in sklearn, and first i use fit and then i use feature_importances to select the top n features. ( I also use StandardScaler and transform to normalize data)
now i have two questions:

am i approaching this right? is this a correct way remove useless features?

is there any better way to select for example top 200 features to give to my final model when there are more than 100,000 features? my task is sentence classification and these features are actually BOW features of non english words, so basically i want to learn the top most important "words" in my corpus that can be used to classify sentences.

my final model is a neural network, but i cannot give all the features to it and let it decide because of performance issues, i need to filter some of the features and then give them to my neural network model. also my final model is written in pytorch, so currently i am using sklearn to select top n features and then use pytorch to train the final mode, so if there is an easier approach for this please let me know.

Best way to remove useless features (non-english words) when there are more than 100,000 features?

Add your own answers!

Ask a Question