我已经训练了 4 个不同的 xgboost ML 模型,我测试了它们并用这行代码得到了预测分数:(第一列是目标特征所以我在预测时排除它):
predict(model, as.matrix(test_set[,-1]), type = 'prob')
所以现在,我有一个数据框,其中的行作为测试集的样本,4 列显示每个模型中每个样本的预测分数。看起来像这样的东西:
structure(list(rows = c("aa78a",
"12200T", "1c2a5ac94492", "1d209304d988", "212T", "PBB",
"35XDS16T", "1234H", "39T", "3ec4d3fc8bd1", "3f78044299b5",
"4260a482", "30T", "43b757f5d8", "49c4c0e12POI"), model_meta= c(0.382992297410965,
0.460950464010239, 0.447804838418961, 0.447804838418961, 0.460950464010239,
0.447804838418961, 0.369836807250977, 0.447804838418961, 0.369836807250977,
0.447804838418961, 0.382992297410965, 0.447804838418961, 0.369836807250977,
0.447804838418961, 0.447804838418961), model_x2= c(0.460011065006256,
0.52004611492157, 0.253930300474167, 0.222006008028984, 0.302200853824615,
0.485153168439865, 0.20485857129097, 0.350892871618271, 0.331338971853256,
0.295754462480545, 0.185829699039459, 0.618589639663696, 0.291316270828247,
0.414723694324493, 0.210018843412399), model3= c(0.277256995439529,
0.425392180681229, 0.182383552193642, 0.253527283668518, 0.329186052083969,
0.305586904287338, 0.188975885510445, 0.238625407218933, 0.497761845588684,
0.342641144990921, 0.156761467456818, 0.306724846363068, 0.152404963970184,
0.428304076194763, 0.22887846827507), model4= c(0.565486133098602,
0.564990341663361, 0.164183273911476, 0.15946152806282, 0.234778091311455,
0.396436214447021, 0.172556579113007, 0.257463246583939, 0.43759897351265,
0.200696632266045, 0.122483171522617, 0.586755096912384, 0.348238885402679,
0.493290543556213, 0.252075374126434)), row.names = c("aa78a",
"12200T", "1c2a5ac94492", "1d209304d988", "212T", "PBB",
"35XDS16T", "1234H", "39T", "3ec4d3fc8bd1", "3f78044299b5",
"4260a482", "30T", "43b757f5d8", "49c4c0e12POI"), class = "data.frame")
所有这些预测都是二进制的。这意味着目标特征在所有模型中都是二元的。在所有模型中,我使用了相同的测试集(相同的样本),只是更改了功能。
我有两个问题:
1- 我如何知道预测分数中的哪一边?例如,如果它是 0.460950464010239,是否意味着它更可能是
0
类?我想我没有完全理解预测分数的含义。
2- 我如何使用这些分数在同一条 ROC 曲线上绘制所有 4 个模型,并显示每个模型的名称及其相应的 AUC 分数?也许通过情节的某个角落的一个漂亮的传说?
我觉得这样的东西看起来会很棒: