TransWikia.com

How do decision trees in random forests handle conflicts?

Cross Validated Asked by Fuad on November 2, 2021

Let’s say our input elements (training data) are 6 people with three attributes, Height, Weight, and Gender, and we are predicting if that person will have cancer or not (boolean 0 or 1).
Let’s say we want to create 2 decision trees in our random forest, each tree containing 3 people.
Now let’s analyze the 3 people in one of these trees.

Person ID: 0
Height: 170cm
Weight: 70kg
Gender: Male
Cancer?: Yes

Person ID: 1
Height 150cm
Weight: 55kg
Gender: Female
Cancer? No

Person ID: 2
Height: 170cm
Weight: 70kg
Gender: Male
Cancer?: No

We have a conflict, because both Person 0 and Person 2 have the same attributes for Height, Weight, and Gender, but Person 0 has cancer, but Person 2 does not.

How does the decision tree creation algorithm (within the context of random forests) handle this?

One Answer

  • First of all, having three samples, or even using all six samples, to build a decision tree is not a good idea. This is insufficient data to get reasonable results. With six samples you could probably come up with hand crafted rules to make rule-based classification.
  • Answering your general question, if you have a node in decision tree that has even counts, than it is up to you (or the implementation of the algorithm you are using) how you make decisions in such cases. Majority rule to make decision won't work, so the popular solution would be to brake the ties randomly.
  • Notice that you would have same kind of problem with any other algorithm. For example, logistic regression could predict the 0.5 probability for such case and it is up to you how you make the classification decision. The naive >0.5 rule does not seem to be good idea in here (notice that >0.5 threshold is arbitrary, there are many ways of obtaining more reasonable threshold for a specific problem).
  • The wise solution would be for the algorithm to say "I don't know" in such cases, but it is not that trivial to implement it (e.g. is 0.501 far enough from 0.5?).
  • Finally, this should not be the case for most real-life data problems. Usually that data would have more samples, so the exactly same number of counts would be less likely. Moreover, unless your data contains only from categorical variables with small numbers of categories, there would be probably greater variability in the combinations of the categories (but don't understate it).

Answered by Tim on November 2, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP