TransWikia.com

In XGBoost, how is a leaf index corresponding to the particular leaf node in actual base learner trees?

Data Science Asked by CyberPlayerOne on July 29, 2021

I’ve trained a XGBoost model for regression, where the max depth is 2.

# Create the ensemble
ensemble_size = 200
ensemble = xgb.XGBRegressor(n_estimators=ensemble_size, n_jobs=4, max_depth=2, learning_rate=0.1,
                            objective='reg:squarederror')

ensemble.fit(train_x, train_y)

I’ve plotted the first tree in the ensemble:

# Plot single tree
plot_tree(ensemble, rankdir='LR')

enter image description here

Now I retrieve the leaf indices of the first training sample in the XGBoost ensemble model:

ensemble.apply(train_x[:1]) # leaf indices in all 200 base learner trees

array([[6, 6, 4, 6, 4, 6, 5, 5, 4, 5, 4, 3, 5, 4, 5, 3, 6, 3, 5, 5, 3,
3,
3, 5, 4, 4, 3, 4, 3, 6, 6, 6, 4, 6, 6, 3, 5, 3, 5, 4, 6, 4, 4, 6,
3, 3, 6, 3, 6, 3, 4, 3, 6, 6, 3, 6, 5, 3, 6, 6, 3, 4, 6, 5, 3, 3,
3, 6, 3, 4, 3, 6, 3, 6, 3, 3, 3, 4, 6, 3, 4, 4, 6, 3, 3, 6, 3, 6,
6, 3, 3, 4, 4, 4, 3, 3, 6, 6, 3, 3, 6, 3, 3, 3, 6, 6, 6, 4, 4, 3,
5, 3, 3, 3, 4, 5, 3, 3, 6, 3, 3, 6, 3, 4, 5, 3, 6, 3, 5, 3, 4, 4,
3, 3, 4, 6, 6, 6, 6, 3, 4, 4, 3, 5, 6, 6, 3, 5, 3, 3, 6, 6, 3, 3,
6, 3, 3, 4, 4, 3, 4, 3, 5, 3, 3, 3, 3, 3, 4, 4, 6, 3, 6, 4, 4, 5,
6, 3, 4, 5, 6, 3, 4, 3, 4, 5, 6, 6, 5, 4, 3, 3, 6, 6, 3, 6, 5, 4,
3, 3]], dtype=int32)

Here is my question:

  • Since there are four leaf nodes in the first tree, how come there is
    index 6 for the first training sample?

  • In the official doc for apply(), it says "Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering." So if max_depth is 2, the leaves are numbered between 0 and 7. Since there are only four leaves in a binary tree of depth 2, shouldn’t the leaves numbered within [0, 4)? What is the reason behind the design $[0; 2^{(self.max_depth+1)})$?

Related question: https://stackoverflow.com/questions/58585537/how-to-interpret-the-leaf-index-in-xgboost-tree

One Answer

I think what you are seeing is the fact that all nodes in the tree are indexed because a priori the model doesn't know where splits will happen (i.e. any node could be a leaf). My guess is that the nodes follow an ordering similar to:

node ordering

In your case all of the leaf nodes are at the max depth of the tree, so nodes 3-6 show up in your list. By contrast, if your data was all the same value I would expect all labels to be node 0 (because the split criteria was not met). And then you could have intermediate situations where after 1 split there is a node which does not meet the split criteria (in this case you could see either node 1 or 2 show up in your list).

Hope this helps!

Correct answer by Brandon Donehoo on July 29, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP