TransWikia.com

Logistic Regression optimal threshold is a negative value

Data Science Asked by user872009 on July 15, 2021

I run the code below:

import pandas as pd
import numpy as np
from sklearn.model_selection  import train_test_split 
from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from numpy import sqrt
from numpy import argmax
from sklearn.metrics import roc_curve
from sklearn.preprocessing import StandardScaler


def standardize(variable):
      return (variable - np.mean(variable)) / np.std(variable)
    
def normalize(x):
    return (x-x.min()/(x.max()- x.min()))

data.columns = np.arange(len(data.columns))

trainX, testX, trainy, testy=train_test_split(X,y,test_size=0.5,random_state=2, stratify=y)

# fit a model
model = LogisticRegression(solver='lbfgs')
model.fit(trainX, trainy)

#yhat = model.predict_proba(testX)
yhat = normalize(testX.values)
yhat = yhat[:, 0]
print(yhat)

# calculate roc curves
fpr, tpr, thresholds = roc_curve(testy, yhat)

#print(thresholds)
# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-mean=%.3f' % (thresholds[ix], gmeans[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.', label='Logistic')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()

The optimal threshold score is: Best Threshold= -0.049752, G-mean=0.889

Why is the optimal threshold a negative number? What does it mean? And why am I getting a negative number?

One Answer

I'm not familiar with the way you are obtaining the optimal threshold, but It might be a little bit easier.

What you are looking for is the leftmost point in the x-axis (false positive rate) and the rightmost point in the y-axis (true positive rate) So by calculating the difference between the two you will have so.

from sklearn.metrics import roc_curve
yhat = best_model.predict_proba(X_train)[:,1]

fpr, tpr, thresholds = roc_curve(y_train, yhat)
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]

This threshold will give you the lowest false positive rate and the highest true positive rate

EDIT

I just notice that you are passing the $P(Y = 0| X)$ i.e yhat = yhat[:, 0] try passing yhat = yhat[:, 1]

Answered by Julio Jesus on July 15, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP