TransWikia.com

Reproducing randomForest Proximity Matrix from R package in Python

Data Science Asked on August 4, 2021

I am trying to port this little piece of R code to python:

rf <- randomForest(features, proximity = T, oob.prox = T, ntree = 2000)
dists <- as.dist(1 - rf$proximity)

with parameters
oob.prox: Should proximity be calculated only on “out-of-bag” data?
proximity: if proximity=TRUE when randomForest is called, a matrix of proximity measures among the input (based on the frequency that pairs of data points are in the same terminal nodes).

I am currently trying using sklearn.ensemble.RandomTreesEmbedding for this task, however there is no functionality for the proximity matrix. I found the following developer comment though:

We don’t implement proximity matrix in Scikit-Learn (yet).
However, this could be done by relying on the apply function provided
in our implementation of decision trees. That is, for all pairs of
samples in your dataset, iterate over the decision trees in the forest
(through forest.estimators_) and count the number of times they fall
in the same leaf, i.e., the number of times apply give the same node
id for both samples in the pair.

And so I tried, utilizing numpy’s pdist() function along with my custom distance (or in this case, proximity) measure. I still have several problems:

  1. The proximity function is outstandingly slow
  2. How to handle the out-of-bag behaviour
  3. How to recreate the exact behaviour of as.dist(1- rf$proximity): I think I need to normalize my count matrix, then subtract it from 1 and then afterwards compute the euclidean distances between its rows!?

My code as of now looks like this:

# grow a random forest from points
rf = ensemble.RandomTreesEmbedding(n_estimators=200, 
    random_state=0,
    max_depth=5
)
rfdata = rf.fit_transform(xdata);


# define an affinity measure function to use with numpy's pdist
def treeprox(u, v):
    leafcount = 0
    # needs reshaping for single samples
    u = u.reshape(1,-1)
    v = v.reshape(1,-1)
    a = rf.apply(u)
    b = rf.apply(v)
    # count number of times they fall in the same leaf 
    # (use of np forces element-wise)
    c = np.sum(np.array(a)==np.array(b))
    return c
 
distm = pdist(xdata, proxfun)
distm = squareform(distm)

There must be a better way I guess, since this functionality is readily provided by the R package randomForest.
Any suggestions?
tia

One Answer

I have written some code for this. It can be found here. In answer to your specific questions:

  1. I have tried to optimize for speed. What I did should be a little faster than the code above.
  2. I do not use out of bag records. In fact the original documentation does not suggest this. I created another post to see if the consequences are understood.
  3. This is handled in my code by normalizing by the total possible leaves that could be matched.

Answered by Keith on August 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP