List of samples that each tree in a random forest is trained on in Scikit-Learn

Question

In Scikit-learn's random forest, you can set bootstrap=True and each tree would select a subset of samples to train on. Is there a way to see which samples are used in each tree?
I went through the documentation about the tree estimators and all the attributes of the trees that are made available by Scikit-learn, but none of them seems to provide what I'm looking for.

10xAI · Answer

I don't think it is possible to get it directly but we may utilize the random seed.

random_stateint, RandomState instance or None, default=None
Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True)

This is from the RF Github code
def _generate_sample_indices(random_state, n_samples, n_samples_bootstrap):
    """
    Private function used to _parallel_build_trees function."""

random_instance = check_random_state(random_state)
    sample_indices = random_instance.randint(0, n_samples, n_samples_bootstrap)

return sample_indices

So, we can get these with a custom code if we fixed the seed above e.g. for 2 Tree,
import pandas as pd, numpy as np
num = 20 # max index
np.random.seed(0) # Fix the seed

sample_1 = np.random.randint(0,num,(1,num))
oob_1 = [elem for elem in np.arange(num) if elem not in sample_1 ]
sample_2 = np.random.randint(0,num,(1,num))
oob_2 = [elem for elem in np.arange(num) if elem not in sample_2 ]

Please verify it with a custom code. I have not verified it.

pixelmitch · Answer

It is possible, actually. The answer is not too different than the one given by @10xAI, but it is not trying to exploit the order of the random seeds implicitly, since it would break for parallel training. So the answer above could maybe only work for trees not trained in parallel. But not sure.
The actual working answer is simple, and it resides in using the random generator stored in each estimator and using it to redo the random sampling.
So, for instance, assume rf is your trained random forest, then it is easy to get both sampled and unsampled indices by importing the appropriate functions and replicating the sampling using the seed in each rf.estimators[0].random_state. For example, to retrieve the lists of sampled and unsampled indices:

import sklearn.ensemble._forest as forest_utils

n_samples = len(Y) # number of training samples

n_samples_bootstrap = forest_utils._get_n_samples_bootstrap(
    n_samples, rf.max_samples
)

unsampled_indices_trees = []
sampled_indices_trees = []

for estimator in rf.estimators_:
    unsampled_indices = forest_utils._generate_unsampled_indices(
        estimator.random_state, n_samples, n_samples_bootstrap)
    unsampled_indices_trees.append(unsampled_indices)

sampled_indices = forest_utils._generate_sample_indices(
        estimator.random_state, n_samples, n_samples_bootstrap)
    sampled_indices_trees.append(sampled_indices)

estimator is a decision tree in this case, so one can use all the methods to compute custom oob_scores and whatnot.
Hope this helps!

List of samples that each tree in a random forest is trained on in Scikit-Learn

2 Answers

Add your own answers!

Ask a Question