Data Science Asked by rahul on October 29, 2020
I’m working with an imbalanced dataset. I’m using a decision tree (scikit-learn) to build a model.
For explaining my problem I’ve taken iris dataset.
When I’m setting class_weight=None
, I understood how the tree is assigning the probability scores when I use predict_proba.
When I’m setting class_weight='balanced'
, I know its using target value to calculate class weights but I’m not able to understand how the tree is assigning the probability scores.
import sklearn.datasets as datasets
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
iris=datasets.load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
y=iris.target
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.33, random_state=1)
# class_weight=None
dtree=DecisionTreeClassifier(max_depth=2)
dtree.fit(X_train,y_train)
dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data, filled=True, rounded=True, special_characters=True, feature_names=X_train.columns)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png()) # I use jupyter-notebook for visualizing the image
# printing unique probabilities in each class
probas = dtree.predict_proba(X_train)
print(np.unique(probas[:,0]))
print(np.unique(probas[:,1]))
print(np.unique(probas[:,2]))
# ratio for calculating probabilities
print(0/33, 0/34, 33/33)
print(0/33, 1/34, 30/33)
print(0/33, 3/33, 33/34)
The probabilities assigned by the tree and my ratios (determined by looking at tree image) are matching.
When I use the option class_weights='balanced'
. I get the below tree.
# class_weight='balanced'
dtree_balanced=DecisionTreeClassifier(max_depth=2, class_weight='balanced')
dtree_balanced.fit(X_train,y_train)
dot_data = StringIO()
export_graphviz(dtree_balanced, out_file=dot_data,filled=True, rounded=True, special_characters=True, feature_names=X_train.columns)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
I’m printing unique probabilities using below code
probas = dtree_balanced.predict_proba(X_train)
print(np.unique(probas[:,0]))
print(np.unique(probas[:,1]))
print(np.unique(probas[:,2]))
I’m not able to understand (come-up with a formula) how the tree is assigning these probabilities.
We should consider two points. First, class_weight='balanced'
does not change the actual number of samples in a class, only the weight of class $w_{c_i}$ is changed. Second, the [un-normalized] probability of class $c_i$ in each node is calculated as
$w_{c_i}$ x (number of samples from $c_i$ in that node / size of $c_i$)
For example, in balanced mode, the [un-normalized] probability of $c_3$ in the green leaf is calculated as
$33.bar{3}% times (3 / 36) ≈ 2.778%$
compared to $36% times (3 / 36) = 3%$ in unbalanced mode.
The probability (normalized) in balanced mode would be:
$100 times 2.778/(2.778+32.258) % = 7.9289%$
Remark. The word "probability" is not applicable to each isolated node except for the root node. This is the un-normalized version of the probability used to classify a data point inside a leaf, though the normalization is not required for comparison. However, the notion is applicable to the aggregate of nodes at the same level and the leaves from upper levels (i.e. set of all samples).
Correct answer by Esmailian on October 29, 2020
0 Asked on August 15, 2020 by stefan-radonjic
autoencoder cnn deep learning machine learning unsupervised learning
2 Asked on August 14, 2020 by artem-betley
3 Asked on August 14, 2020 by georgio-sayegh
1 Asked on August 13, 2020 by ana-smile
1 Asked on August 13, 2020 by darome
1 Asked on August 13, 2020
1 Asked on August 13, 2020 by ben-williams
1 Asked on August 13, 2020
evaluation imbalance machine learning scikit learn weighted data
0 Asked on August 12, 2020 by payal-bhatia
0 Asked on August 12, 2020
1 Asked on August 12, 2020 by user85181
deep learning keras machine learning multiclass classification python
1 Asked on August 10, 2020 by naseer
0 Asked on August 10, 2020 by compguy24
1 Asked on August 10, 2020 by chandraraj-singh
2 Asked on August 9, 2020 by mmmmmay
1 Asked on August 9, 2020 by elvin-ugonna
3 Asked on August 8, 2020 by dave-challis
4 Asked on August 8, 2020 by adihere
1 Asked on August 7, 2020 by cdr
Get help from others!
Recent Answers
Recent Questions
© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP