TransWikia.com

Python Random Forest Prediction Probabilities Reliability, Overfitting?

Cross Validated Asked by rkhan8 on September 24, 2020

TLDR: RF prediction probabilities are not consistent

I have created a calibrated Random Forest Model to predict probabilities for attrition of the workforce, but what I am finding is that probabilities for the same employee changes drastically over the span of a day. So for example, lets say I developed the model yesterday, and employee A has a prediction probability of .007560, I run the same employee today through the model, and it outputs the probability is .684939. The only difference between yesterday’s data and today’s data for this employee will be things like tenure, which is a minimal difference (i.e., tenure has increased by 1 day or .003 years). So reliability in the model predicting probabilities is not there.

I know what is causing this difference: the slight changes in these tenure variables but I don’t know why. I suspect over-fitting? But I followed CV procedures when developing the model (code is below) and accuracy didn’t really change between train and test sets.

What makes it more confusing is that if I create a new model today, and predict probabilities on this employee’s data from yesterday and today, the results are consistent for both days at 0.007764, which is close to the original .007560. I presume its not exactly the same because there is a slight change in the data for the employee. So why the model can provide consistent probabilities on "past" data, but not on "future" data is beyond me.

I am only giving one example here, but this is happening on a larger scale, hence why it is an issue. So my solution for right now is to develop the model every time I need to update workforce attrition probabilities, but keeping all of the parameters the same. I don’t like this approach because for the ML pipeline, id refer to just load the saved model and run the data through it, rather than developing the model and run the data through it. Also, would like to understand what is happening here. Below is my code, the example given here is a calibrated RF model, but same thing is happening with a non-calibrated RF model as well. Thanks!

#train model    
X_train, X_test, y_train, y_test = train_test_split(X_final, y_vars, test_size=0.5, random_state=20)
#CALIBRATED
rf_unfit1=RandomForestClassifier(n_estimators=1800, max_features='auto', max_depth=100, min_samples_split=2, min_samples_leaf=1, bootstrap=False)
rf_cv=CalibratedClassifierCV(rf_unfit1,method='sigmoid',cv=3)
rf_cv.fit(X_train, y_train)
filename=r'C:UsersDownloadsmodel'
joblib.dump(rf_cv, filename)
#make predictions
df = pd.read_csv(r'C:UsersDownloadstest.csv')
rf_cv = joblib.load(open(r'C:UsersDownloadsmodel', 'rb'))
final_cv_probs=rf_cv.predict_proba(df)[:,1]
print(final_cv_probs)

Below are the prediction probabilities for the same employee. Input variables are Columns A – G. Highlighted columns show input variables with slight changes in variable values as days progress.

enter image description here

Below is the random grid search I conducted to select my hyperparameters

X_train, X_test, y_train, y_test = train_test_split(X_final, y_vars, test_size=0.3, random_state=9)
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
rf_dev= RandomForestClassifier(random_state=9)
rf_random=RandomizedSearchCV(estimator= rf_dev, param_distributions=random_grid, n_iter=100, cv=3,verbose=2,random_state=9, n_jobs=-1)

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP