Parameter Tuning for Random Forest Text Classifier

Question

I train a binary random forest classifier on skikit-learn's 20 newsgroups dataset. I want to tune the parameters and try so by gridsearch and 3-fold crossvalidation on the training data. Is there any problem with that methodology? For the max_depth parameter i get the really high value 500 and that seems quite too much. Any advice?

Econometrics33 · Answer

One benefit of bagging (random forest is essentially a variation of bagging), is that bagging allows for more complex classifiers in the `bag', which normally have the risk of overfitting and increasing generalization error. So essentially, the depth of your random forest should not be an issue; bagging will help stabilize the classifier estimates. This is why the random forest approach performs so well in practice. In addition, it is shown that the random forest technique will not overfit as more trees are grown (an important result) [Breiman2001].

Answered by Econometrics33 on November 26, 2021

PhilippPro · Answer

In random forest you could use the out-of-bag predictions for tuning. That would make your tuning algorithm faster.

Max_depth = 500 does not have to be too much. The default of random forest in R is to have the maximum depth of the trees, so that is ok. You should validate your final parameter settings via cross-validation (you then have a nested cross-validation), then you could see if there was some problem in the tuning process.

Parameter Tuning for Random Forest Text Classifier

2 Answers

Add your own answers!

Ask a Question