how does splitting occur at a node in a decision-tree with non-categorical data?

Data Science Asked by A-ar on December 23, 2020

According to a website (: , these values are chosen randomly:
enter image description here

I don’t think this is the case with any optimized way of creating a decision tree. In this image(different example) the value is 2.45 for the root node:
marked in red

Was this value chosen randomly like explained in the website? If not and the value is not chosen randomly then how is it calculated?

2 Answers

Values chosen at the node level to split the data are determined to minimize the Gini Impurity index which represents the entropy or the chaos in your data. It chooses the value that separates your classes best.

Take an example :

You have a population of 10 people. Only variable available is the age. You're predicting if the person has a certain disease. After doing some EDA , you notice that between 20-40 , both your classes , again let's assume its binary classification , are similarly present. and Exactly at the age 56 and going up, you get 4 ones and zero. Meanwhile, below the 56, you have equally distributed classes across that range.

You tree when considering which value to split the node on will calculate the Information Gain or Gini Impurity, and then splits your population across 2 leafs, where the entropy is minimized in those 2 leafs as much as possible. Again you will have a leaf with 4 ones a zero and the other will have, lets say, 3 zeros and 2 ones.

In simpler terms, you tree when splitting a population in a node, will try to make the leaves the purest they can be, and by purest, i mean containing only one class. That's how the value is decided.

Hope it makes sense.

Answered by Blenz on December 23, 2020

No, I don't think values according to which the branches are seperated are chosen at random. Instead, weighted average is calculated for each category and the category with the highest weighted average is chosen as the root node. This is also referred as Information gain Consider this dataset

dataset. enter image description here

Consider the above picture, Here the outlook is chosen as Root node, And how is outlook chosen as root node?

First, We calculate the total entropy of the data. Lets say its 0.95. Now inorder the pick the right root node, We will find weighted averages of all the subcategories. There are 4 four categories here, So we will obtain 4 weighted entropy averages. Lets say they are 0.3, 0.2, 0.4, 0.8. Now we will subtract the induvidual weighted entropy averages from the total entropy. So we will get (0.95-0.3), (0.95-0.2), (0.95-0.4), (0.95-0.8). Among all the three which ever category has the highest value that category will be chosen as the root node. These 4 values are the information gain of each of the categories i.e Whichever category has the highest information gain, we will pick it as the root node. In our case, its the outlook category/feature .Hope it helps

Check this for more clarity

Answered by karthikeyan mg on December 23, 2020

Add your own answers!

Related Questions

How should I sample from a mixture distribution?

1  Asked on December 4, 2020 by josh-purtell


Gradient descent formula implementation in python

1  Asked on December 4, 2020 by manas-tripathi


How do I build an image dataset for CNN?

3  Asked on December 4, 2020 by 55thswiss


Model Validation accuracy stuck at 0.65671 Keras

4  Asked on December 4, 2020 by talha-anwar


Scipy curve_fit and method “dogbox”

1  Asked on December 3, 2020 by zipline86


Ask a Question

Get help from others!

© 2022 All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP