Sklearn Random Feature Importances Identical for Predicting Different Response Variables

Data Science Asked by Sebastian Topalian on November 1, 2020

I have created four random forest models they have the same X data, but their y data are four different response variables. The sklearn random forest feature importance is identical for all four. All four models achieve their purpose and make different predictions, but their random forest feature importance is the same.

Has anyone experienced this before?

I created the models with a series of nested objects like illustrated below. I used the same code before without having identical random forest feature importances, there was however the difference that inside each object I ran a 3-fold CV to determine max_features, whereas here I just used the default which is all of them.

Current code:

class NoCVMethod:
    def __init__(self, X_train, y_train, X_test, y_test, y, Method):
        self.clf = Method, y_train)
        self.predictions = self.clf.predict(X_test)
        self.rev_preds = rev_pred(y[-(13978+97):].values,self.predictions)
        self.residuals = y_test - self.rev_preds
        self.RMSE = np.mean((self.residuals)**2)**0.5
class Different_variables:
    def __init__(self, X_train, y_train, X_test, y_test, Method):
        self.TSS = NoCVMethod(X_train, y_train[y_train.columns.tolist()[0]], X_test, y_test[y_test.columns.tolist()[0]], y[y.columns.tolist()[0]], Method)
        self.NOx = NoCVMethod(X_train, y_train[y_train.columns.tolist()[1]], X_test, y_test[y_test.columns.tolist()[1]], y[y.columns.tolist()[1]], Method)
        self.NH4 = NoCVMethod(X_train, y_train[y_train.columns.tolist()[2]], X_test, y_test[y_test.columns.tolist()[2]], y[y.columns.tolist()[2]], Method)
        self.PO4 = NoCVMethod(X_train, y_train[y_train.columns.tolist()[3]], X_test, y_test[y_test.columns.tolist()[3]], y[y.columns.tolist()[3]], Method)

Old code:

class CVMethod:
    def __init__(self, X_train, y_train, X_test, y_test, y, param_dict, Method):
        self.pipeline = Pipeline([
            ('scale', StandardScaler()),
            ('clf', Method)
        self.param_grid = param_dict
        self.grid = GridSearchCV(self.pipeline, param_grid = self.param_grid, cv = 3, verbose = False, n_jobs = -1), y_train)
        self.predictions = self.grid.predict(X_test).ravel()
        self.rev_preds = rev_pred(y[-(13978+97):].values,self.predictions)
        self.residuals = y_test - self.rev_preds
        self.RMSE = np.mean((self.residuals)**2)**0.5
class CVDifferent_variables:
    def __init__(self, X_train, y_train, X_test, y_test, param_dict, Method):
        self.TSS = CVMethod(X_train, y_train[y_train.columns.tolist()[0]], X_test, y_test[y_test.columns.tolist()[0]], y[y.columns.tolist()[0]], param_dict, Method)
        self.NOx = CVMethod(X_train, y_train[y_train.columns.tolist()[1]], X_test, y_test[y_test.columns.tolist()[1]], y[y.columns.tolist()[1]], param_dict, Method)
        self.NH4 = CVMethod(X_train, y_train[y_train.columns.tolist()[2]], X_test, y_test[y_test.columns.tolist()[2]], y[y.columns.tolist()[2]], param_dict, Method)
        self.PO4 = CVMethod(X_train, y_train[y_train.columns.tolist()[3]], X_test, y_test[y_test.columns.tolist()[3]], y[y.columns.tolist()[3]], param_dict, Method)

One Answer

It seems that your self.clf points to your Method. At the end, you are probably printing the features importance of a unique classifier.

Maybe you should copy it:

from sklearn.base import clone

class NoCVMethod:
    self.clf = clone(Method) # only copy the estimator
    # OR
    self.clf = deepcopy(Method) # if you want to also copy the data estimator

See here (or here as you suggested) for more details about copying an sklearn estimator.

Correct answer by etiennedm on November 1, 2020

Add your own answers!

Related Questions

Trained BERT models perform unpredictably on test set

1  Asked on April 1, 2021 by peterpaul


Calculation of PCA

0  Asked on March 31, 2021


Spacy Text classification (Binary Classification)

1  Asked on March 30, 2021 by krishna-rao-gadde


How often to call DQN Replay memory?

0  Asked on March 30, 2021 by muhammad-hammad-saghir


KMeans clusterization on documents

2  Asked on March 30, 2021


Ask a Question

Get help from others!

© 2023 All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP