TransWikia.com

Is there a way to get gini index values for every node in rpart model?

Data Science Asked by Malyada N on June 20, 2021

df <- tibble(x=factor(c("A", "B")), y=factor(c(1, 0)))
model <- rpart(formula=y~., data=df, method="class", control=rpart.control(minsplit=2))

Here model would have 1 parent and two child nodes. How to get gini index values for these nodes from rpart model object?

One Answer

Gini impurity can be calculated as $1-p_{1}^2-p_{2}^2$ for each node. For example, if node 1 contains 40% '1' and 60% '0', gini = 1 - 0.4^2 - 0.6^2. The information of node size n, number of '0' dev are stored in model$frame. The Gini for each node could be calculated with node size n and number of '0' dev in model$frame:

frame <- model$frame
frame[['gini']] = 1 - (frame[['dev']] / frame[['n']])^2 - (1 - frame[['dev']] / frame[['n']])^2

frame[,c('var','n','dev','gini')]
>      var  n dev      gini
> 1     x3 10   5 0.5000000
> 2 <leaf>  4   1 0.3750000
> 3 <leaf>  6   2 0.4444444

The Gini improvment for each split is calculated by weighted difference between parent and children nodes.

frame[['improve']] = NA
for (i in 1:nrow(frame)) {
  if (frame[i,'var'] == '<leaf>') next

  ind = which(rownames(frame) %in% (as.numeric(rownames(frame)[i])*2+c(0,1)))
  frame[i,'improve'] = frame[i,'n']*frame[i,'gini'] - frame[ind[1],'n']*frame[ind[1],'gini'] - frame[ind[2],'n']*frame[ind[2],'gini']
}

frame[,c('var','n','dev','gini','improve')]
>      var  n dev      gini   improve
> 1     x3 10   5 0.5000000 0.8333333
> 2 <leaf>  4   1 0.3750000        NA
> 3 <leaf>  6   2 0.4444444        NA

#comparing with
model$splits
>    count ncat   improve index  adj
> x3    10    2 0.8333333     1 0.00
> x2    10    2 0.2380952     2 0.00
> x2     0    2 0.7000000     3 0.25

Answered by Ryan SY Kwan on June 20, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP