我运行了逻辑回归模型,并对logit值进行了预测。我用它来获得ROC曲线上的点数:
from sklearn import metrics
fpr, tpr, thresholds = metrics.roc_curve(Y_test,p)
我知道metrics.roc_auc_score
给出了ROC曲线下的面积。谁能告诉我什么命令会找到最佳截止点(阈值)?
虽然回答很晚,但思想可能会有所帮助。您可以使用R epi
中的(here!)包来完成此操作,但是我在python中找不到类似的包或示例。
最佳截止点将是true positive rate
高且false positive rate
低的地方。基于这个逻辑,我在下面举了一个例子来找到最佳阈值。
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
from sklearn.metrics import roc_curve, auc
# read the data in
df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
# rename the 'rank' column because there is also a DataFrame method called 'rank'
df.columns = ["admit", "gre", "gpa", "prestige"]
# dummify rank
dummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige')
# create a clean data frame for the regression
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':])
# manually add the intercept
data['intercept'] = 1.0
train_cols = data.columns[1:]
# fit the model
result = sm.Logit(data['admit'], data[train_cols]).fit()
print result.summary()
# Add prediction to dataframe
data['pred'] = result.predict(data[train_cols])
fpr, tpr, thresholds =roc_curve(data['admit'], data['pred'])
roc_auc = auc(fpr, tpr)
print("Area under the ROC curve : %f" % roc_auc)
####################################
# The optimal cut off would be where tpr is high and fpr is low
# tpr - (1-fpr) is zero or near to zero is the optimal cut off point
####################################
i = np.arange(len(tpr)) # index for df
roc = pd.DataFrame({'fpr' : pd.Series(fpr, index=i),'tpr' : pd.Series(tpr, index = i), '1-fpr' : pd.Series(1-fpr, index = i), 'tf' : pd.Series(tpr - (1-fpr), index = i), 'thresholds' : pd.Series(thresholds, index = i)})
roc.ix[(roc.tf-0).abs().argsort()[:1]]
# Plot tpr vs 1-fpr
fig, ax = pl.subplots()
pl.plot(roc['tpr'])
pl.plot(roc['1-fpr'], color = 'red')
pl.xlabel('1-False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('Receiver operating characteristic')
ax.set_xticklabels([])
最佳截止点为0.317628,因此高于此值的任何值都可以标记为1,否则为0.您可以从输出/图表中看到tpr与1-fpr交叉的位置,tpr为63%,fpr为36%,tpr-( 1-fpr)在当前示例中最接近零。
1-fpr fpr tf thresholds tpr
171 0.637363 0.362637 0.000433 0.317628 0.637795
希望这是有帮助的。
为了简化和引入可重用性,我已经找到了找到最佳概率截止点的函数。
def Find_Optimal_Cutoff(target, predicted):
""" Find the optimal probability cutoff point for a classification model related to event rate
Parameters
----------
target : Matrix with dependent or target data, where rows are observations
predicted : Matrix with predicted data, where rows are observations
Returns
-------
list type, with optimal cutoff value
"""
fpr, tpr, threshold = roc_curve(target, predicted)
i = np.arange(len(tpr))
roc = pd.DataFrame({'tf' : pd.Series(tpr-(1-fpr), index=i), 'threshold' : pd.Series(threshold, index=i)})
roc_t = roc.ix[(roc.tf-0).abs().argsort()[:1]]
return list(roc_t['threshold'])
# Add prediction probability to dataframe
data['pred_proba'] = result.predict(data[train_cols])
# Find optimal probability threshold
threshold = Find_Optimal_Cutoff(data['admit'], data['pred_proba'])
print threshold
# [0.31762762459360921]
# Find prediction to the dataframe applying threshold
data['pred'] = data['pred_proba'].map(lambda x: 1 if x > threshold else 0)
# Print confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(data['admit'], data['pred'])
# array([[175, 98],
# [ 46, 81]])
给定tpr,fpr,来自问题的阈值,最佳阈值的答案就是:
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
香草Python实现Youden的J-Score
def cutoff_youdens_j(fpr,tpr,thresholds):
j_scores = tpr-fpr
j_ordered = sorted(zip(j_scores,thresholds))
return j_ordered[-1][1]
cgnorthcutt的帖子
给定tpr,fpr,来自问题的阈值,最佳阈值的答案就是:
optimal_idx = np.argmax(tpr - fpr)optimal_threshold = thresholds [optimal_idx]
几乎是正确的。必须采用abs值。
optimal_idx = np.argmin(np.abs(tpr - fpr)) // Edit: Change to argmin!
optimal_threshold = thresholds[optimal_idx]
根据提到的参考 - > http://www.medicalbiostatistics.com/roccurve.pdf p.6我发现了另一种可能性:
opt_idx = np.argmin(np.sqrt(np.square(1-tpr)+ np.square(fpr)))