TransWikia.com

Logistic regression vs Random Forest on imbalanced data set

Data Science Asked by Mayank Mittal on March 18, 2021

I have an imbalanced data set where positives are just 10% of the whole sample. I am using logistic regression and random forest for classification. While comparing the results of these models, I have found that the probability output of logistic regression ranges between [0,1] while that of random forest ranges between [0, 0.6].
I cannot share the data set but my doubt is around the working of these algorithms. How can random forest generate probability less than 0.6?

One Answer

To have a probability of 1 in a RF, it means that your algorithm can construct a leaf containing only positive sample. Since it doesn't, this means that your features are not explaining the variance of the output or that your algorithm is under-fitted.
I suggest that you try optimize the hyper-parameters of your RF by using cross-validation and use some oversampling to reduce the bias in your dataset.

Correct answer by mirimo on March 18, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP