# Python Random Forest Prediction Probabilities Reliability, Overfitting?

Cross Validated Asked by rkhan8 on September 24, 2020

TLDR: RF prediction probabilities are not consistent

I have created a calibrated Random Forest Model to predict probabilities for attrition of the workforce, but what I am finding is that probabilities for the same employee changes drastically over the span of a day. So for example, lets say I developed the model yesterday, and employee A has a prediction probability of .007560, I run the same employee today through the model, and it outputs the probability is .684939. The only difference between yesterday’s data and today’s data for this employee will be things like tenure, which is a minimal difference (i.e., tenure has increased by 1 day or .003 years). So reliability in the model predicting probabilities is not there.

I know what is causing this difference: the slight changes in these tenure variables but I don’t know why. I suspect over-fitting? But I followed CV procedures when developing the model (code is below) and accuracy didn’t really change between train and test sets.

What makes it more confusing is that if I create a new model today, and predict probabilities on this employee’s data from yesterday and today, the results are consistent for both days at 0.007764, which is close to the original .007560. I presume its not exactly the same because there is a slight change in the data for the employee. So why the model can provide consistent probabilities on "past" data, but not on "future" data is beyond me.

I am only giving one example here, but this is happening on a larger scale, hence why it is an issue. So my solution for right now is to develop the model every time I need to update workforce attrition probabilities, but keeping all of the parameters the same. I don’t like this approach because for the ML pipeline, id refer to just load the saved model and run the data through it, rather than developing the model and run the data through it. Also, would like to understand what is happening here. Below is my code, the example given here is a calibrated RF model, but same thing is happening with a non-calibrated RF model as well. Thanks!

#train model
X_train, X_test, y_train, y_test = train_test_split(X_final, y_vars, test_size=0.5, random_state=20)
#CALIBRATED
rf_unfit1=RandomForestClassifier(n_estimators=1800, max_features='auto', max_depth=100, min_samples_split=2, min_samples_leaf=1, bootstrap=False)
rf_cv=CalibratedClassifierCV(rf_unfit1,method='sigmoid',cv=3)
rf_cv.fit(X_train, y_train)
joblib.dump(rf_cv, filename)
#make predictions
final_cv_probs=rf_cv.predict_proba(df)[:,1]
print(final_cv_probs)


Below are the prediction probabilities for the same employee. Input variables are Columns A – G. Highlighted columns show input variables with slight changes in variable values as days progress.

Below is the random grid search I conducted to select my hyperparameters

X_train, X_test, y_train, y_test = train_test_split(X_final, y_vars, test_size=0.3, random_state=9)
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
rf_dev= RandomForestClassifier(random_state=9)
rf_random=RandomizedSearchCV(estimator= rf_dev, param_distributions=random_grid, n_iter=100, cv=3,verbose=2,random_state=9, n_jobs=-1)


## Related Questions

### R: When do we use mean or median for the y axis in ggplot2 when doing analysis on property prices?

0  Asked on January 28, 2021 by chua-s-yang

### COCO evaluation – Negative values on AP and AR

0  Asked on January 28, 2021 by visionenthusiast

### How to make the regressor of LASSO consistent?

0  Asked on January 28, 2021 by zqq

### Suggestions for identifying the most “important” image labels

1  Asked on January 28, 2021 by nlapidot

### Any ideas on how to segment a 2D vector field?

0  Asked on January 28, 2021 by tricostume

### Binomial logistic regression for multiclass problems

1  Asked on January 27, 2021 by mathews24

### How is confidence defined in Expected Calibration Error?

0  Asked on January 26, 2021 by thecity2

### Why does the McNemar’s test use $chi^{2}$ and not the normal distribution?

2  Asked on January 26, 2021

### What algorithm can you use if you want clusters but only are interested in one group?

0  Asked on January 26, 2021 by bonesones

### Can I use an unknown number of variables to model my time-series?

0  Asked on January 26, 2021 by kplauritzen

### Variance of a stationary AR(2) model

2  Asked on January 26, 2021 by user369210

### Avoiding adjustments for time-varying controls in difference-in-differences (DID)?

0  Asked on January 26, 2021

### Removing the effect from structural breaks

1  Asked on January 25, 2021 by kiril-e-proykov

### Recommender System – Predict ratings with Random Forest Regressor or Classifier?

0  Asked on January 24, 2021 by oja-niva

### Nonparametric assessment of multiple predictors

0  Asked on January 24, 2021 by mephisto73

### Calculating measurement variance to achieve desired accuracy in estimation

0  Asked on January 23, 2021 by valjean

### Can large # of epochs or smaller batchsize compensate for smaller data size in training lstms

1  Asked on January 23, 2021 by tjt

### Probability that number of heads exceeds sum of die rolls

5  Asked on January 23, 2021 by user239903

### Combining Sub-Samples for Factor Analysis?

0  Asked on January 22, 2021

### Need to create a model to identify patterns in user details

0  Asked on January 21, 2021 by pooza