The importance() in randomForest returns different results, how to interpret this?

Question

the importance has two variables %IncMSE and IncNodePurity, my results for these two are totally different…I’m predicting a player’s value, and want to know which attributes are more important for predicting. How to interpret this result?

The code I used:

varImpPlot(fa_rating.rf)

and the result returns is shown below:

importance r random forest

The code I used:

varImpPlot(fa_rating.rf)

and the result returns is shown below:

AlexK · Answer

There is a good description of these two measures in Introduction to Statistical Learning with Applications in R, page 330:

Two measures of variable importance are reported.  The former is based upon the mean decrease of accuracy in predictions on the out of bag samples when a given variable is excluded from the model.  The latter is a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees.  In the case of regression trees, the node impurity is measured by the training RSS, and for classification trees by the deviance.

In a more plain English language, this means that these are two different rankings that use different criteria for measuring variable importance.  In both cases, the higher a variable is on the chart, the more important it is determined to be.  With MSE, the chart tells you that MSE (mean squared error) would increase by about 200% if you were to exclude age from the model.  With node impurity, the scale is arbitrary (depends on the data), but it measures the difference between RSS (sum of squared errors) before and and after the split on that variable.

There are certain characteristics associated with impurity-based importances that have likely produced different results in your situation, per this blog article:

Firstly, feature selection based on impurity reduction is biased towards preferring variables with more categories. Secondly, when the dataset has two (or more) correlated features, then from the point of view of the model, any of these correlated features can be used as the predictor, with no concrete preference of one over the others. But once one of them is used, the importance of others is significantly reduced since effectively the impurity they can remove is already removed by the first feature. As a consequence, they will have a lower reported importance. This is not an issue when we want to use feature selection to reduce overfitting, since it makes sense to remove features that are mostly duplicated by other features. But when interpreting the data, it can lead to the incorrect conclusion that one of the variables is a strong predictor while the others in the same group are unimportant, while actually they are very close in terms of their relationship with the response variable.

A couple of additional points:

Some people think that since the goal behind getting feature importances is to understand how much each feature contributes to the model's predictive performance, calculating the decrease in accuracy is a more direct way to measure importance, while the decrease in impurity is more like a proxy.

Also, FYI, Python's sklearn library uses the node impurity method to get feature importances, so if you are interested in making your results comparable to that of a Python user, you can use the NodePurity method.

The importance() in randomForest returns different results, how to interpret this?

One Answer

Add your own answers!

Ask a Question