我想估计变量在解释响应变量时的相对重要性(“dep_var”,基于 4 点李克特量表的数值变量)。我主要对六个(相关)因素(“f1”到“f6”)的相对重要性感兴趣。 为了实现这一目标,我想依赖
partykit
中的 R
包,并按照 Strebel 等人的建议,使用 cforest
运行 varimp(conditional = T)
。 (2008)和其他地方。我的数据中有 NA
值,但根据文档,cforest
可以处理它们。 (相应地,它的默认na.action
是带有代理变量选择的na.pass
)。但是,当我运行 varimp
函数时,我收到一条错误消息而不是结果。
到目前为止我做了什么:
创建随机森林
我首先使用
partykit::cforest
函数生成了一个随机森林对象。
library(partykit)
library(tidyverse)
formula <- paste0("dep_var ~ ", paste(c("f1", "f2", "f3", "f4", "f5", "f6", "num1", "dummy1", "dummy2", "group"), collapse = " + ")) %>% as.formula()
set.seed(100)
y <- partykit::cforest(formula = formula, data = d, cores = 6, ntree = 1000, control = partykit::ctree_control(maxsurrogate = 4))
确定变量重要性
随后,我在这个森林对象上运行
partykit::varimp
函数。它应该为我提供 cforest 中提供的公式中包含的所有变量的相对重要性。
partykit::varimp(y, conditional = T)
错误
但是,当我对数据运行 cforests 和 varimp 时,我收到此错误消息:
Error in FUN(model = model, trafo = trafo, data = data, subset = subset, : NROW(Y) == length(ix) is not TRUE
重现错误的示例数据
library(tidyverse)
d <- structure(list(dep_var = c(4, 4, 2, 1, 2, 2, 1, 1, 1, 4, 2, 4,
3, 1, 4, 1, 1, 1, 2, 2, 2, 1, 2, 3, 4, 3, 4, 2, 3, 2, 4, 3, 2,
3, 4, 4, 3, 3, 2, 3, 4, 4, 1, 2, 3, 4, 2, 3, 4, 3, 3, 2, 2, 4,
4, 1, 2, 3, 2, 1, 4, 1, 2, 3, 2, 3, 3, 1, 4, 3, 2, 4, 3, 1, 1,
2, 1, 2, 1, 3, 4, 1, 4, 2, 3, 2, 1, 2, 4, 1, 2, 4, 2, 4, 2, 1,
4, 1, 3, 4, 3, 4, 3, 1, 3, 2, 4, 3, 3, 3, 2, 2, 2, 2, 2, 4, 3,
2, 4, 3, 3, 2, 3, 2, 4, 1, 3, 3, 2, 3, 2, 1, 3, 2, 1, 1, 4, 3,
2, 4, 4, 1, 4, 3, 3, 4, 2, 1, 2, 2, 2, 3, 4, 2, 3, 4, 1, 3, 3,
2, 2, 4, 2, 3, 1, 4, 4, 3, 2, 3, 1, 3, 2, 4, 2, 3, 3, 1, 2, 1,
2, 2, 1, 3, 2, 2, 3, 1, 3, 3, 3, 2, 2, 2, 3, 4, 3, 3, 3), f1 = structure(c(NA,
1L, 2L, 3L, 2L, 1L, 3L, NA, NA, NA, 3L, NA, 3L, 3L, NA, 2L, NA,
1L, NA, 3L, 3L, NA, 2L, 3L, 2L, 2L, NA, NA, 2L, NA, 3L, 2L, NA,
2L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, NA, 3L, 2L, 3L, 3L, 2L, 3L, NA,
NA, 3L, 3L, 2L, 3L, 2L, 3L, 3L, NA, 1L, NA, 3L, 1L, 3L, 2L, 3L,
3L, 2L, 3L, 3L, 1L, 2L, NA, 2L, NA, NA, 3L, 2L, NA, 3L, 3L, 3L,
2L, 3L, 2L, 2L, 2L, 2L, NA, 2L, 3L, 2L, 1L, 2L, NA, NA, 3L, 3L,
3L, 3L, 3L, 3L, NA, NA, NA, 1L, 3L, 3L, 1L, 3L, 3L, NA, 1L, NA,
2L, 1L, 2L, 3L, 1L, NA, 3L, 3L, NA, 2L, NA, NA, 2L, 3L, 3L, 1L,
3L, 2L, 3L, 3L, 2L, 2L, NA, NA, 2L, 3L, 3L, 3L, 3L, 2L, 2L, NA,
3L, 3L, NA, 3L, 2L, NA, 1L, NA, 3L, 2L, 2L, 3L, 2L, 3L, 3L, 3L,
3L, 2L, 2L, NA, 1L, 3L, 2L, 2L, 3L, 3L, NA, 3L, NA, 3L, 1L, 3L,
2L, 2L, 2L, 2L, 3L, NA, NA, 1L, 3L, 2L, 3L, NA, 2L, 2L, 3L, 3L,
NA, 3L, 2L, 3L, NA, 1L), levels = c("long", "middle", "short"
), class = "factor"), f2 = structure(c(1L, 1L, 2L, 3L, 3L, 1L,
1L, 3L, 1L, 2L, 2L, 1L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 2L, 3L,
2L, 2L, 2L, 2L, 3L, NA, 2L, 1L, 2L, 3L, 2L, 2L, 3L, 1L, 3L, 3L,
3L, 3L, 3L, NA, 3L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 3L, 1L, 3L, 3L,
3L, 1L, 2L, 1L, 2L, NA, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 2L, 3L, 1L,
2L, NA, 2L, 1L, 1L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 3L,
2L, NA, 3L, 3L, 3L, 2L, 3L, NA, NA, 3L, 3L, 2L, 2L, 3L, 3L, 1L,
2L, 3L, 2L, 1L, 3L, 1L, 3L, 3L, 1L, 2L, 1L, 3L, 1L, 2L, 3L, 1L,
1L, 3L, 3L, NA, 3L, 2L, 3L, 3L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 3L,
3L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 2L, 3L,
3L, 2L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 1L, 2L,
3L, 2L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 3L,
2L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 3L, 1L,
2L), levels = c("long", "middle", "short"), class = "factor"),
f3 = structure(c(1L, 3L, 1L, 1L, 1L, 2L, 3L, 3L, 2L, 3L,
3L, 3L, 3L, 1L, 3L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 1L, 3L,
3L, NA, 3L, 1L, 3L, 1L, 3L, 3L, 3L, 1L, 1L, 3L, 1L, 1L, 3L,
1L, 3L, 1L, 3L, 2L, 1L, 1L, 3L, 1L, 1L, 3L, 3L, 1L, 1L, 1L,
2L, 3L, 3L, 3L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 2L, 1L, 3L, 3L,
3L, 1L, 3L, 2L, 3L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, NA, 1L, 3L,
1L, 3L, 3L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 2L, 1L, 1L, 3L, 1L,
2L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 3L,
1L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 1L, 3L, 1L, 1L, 1L, 3L, 3L,
3L, NA, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L,
3L, 1L, 3L, 3L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 3L, 2L, 3L, 3L, 1L, 1L, 1L,
1L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 1L, 1L, 3L,
1L, 1L, 3L, 1L, 1L, 3L, 3L, 1L, 3L), levels = c("long", "middle",
"short"), class = "factor"), f4 = structure(c(3L, 3L, 3L,
3L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 3L, 2L, 1L,
3L, 1L, 3L, 3L, 1L, 2L, 2L, 3L, NA, 1L, 3L, 2L, 3L, 3L, 1L,
1L, 2L, 2L, 3L, 3L, 1L, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 3L, 2L,
1L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 3L, 1L, 2L,
2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 3L,
3L, 1L, 3L, 1L, 1L, 3L, 2L, 3L, 3L, 3L, 1L, 3L, 1L, 2L, 3L,
1L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 1L, 2L, 1L, 1L, 2L, 3L, 2L,
3L, 3L, 1L, 2L, 1L, 3L, 3L, 1L, 3L, 1L, 2L, 2L, 2L, 2L, 1L,
3L, 3L, 3L, 2L, 3L, 1L, 3L, 1L, 1L, 3L, 2L, 1L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 1L, 3L, 2L, 1L, 3L, 3L, 2L, 3L, 3L, 2L, 1L,
3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 2L, 1L, 2L, 2L,
1L, 3L, 2L, 1L, 3L, 2L, 1L, 1L, 1L, 3L, 3L, 1L, 2L, 2L, 3L,
3L, 2L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 2L, 1L,
3L), levels = c("long", "middle", "short"), class = "factor"),
f5 = structure(c(3L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 1L, 3L,
3L, 1L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 2L, 3L, 2L,
3L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 1L, 2L, 2L, 3L, 2L,
3L, 3L, 2L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, NA, 1L, 3L, 3L, 3L,
3L, 3L, 3L, 1L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L,
2L, 3L, 2L, 1L, 1L, 3L, 3L, 3L, NA, 3L, 3L, 1L, NA, 3L, 2L,
3L, 3L, 2L, 2L, 3L, 3L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
2L, 1L, 3L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 2L, 1L, 3L, 2L, 3L,
2L, 3L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 3L, 2L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 3L, 2L, 3L, 3L, 3L, NA, 3L, 3L, 2L,
3L, 2L, 3L, 2L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 2L, 3L, 3L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 1L,
2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), levels = c("long", "middle",
"short"), class = "factor"), f6 = structure(c(3L, 3L, 3L,
3L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 3L, 3L, 1L, 3L, 1L, 2L,
3L, 2L, 2L, 1L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 2L, NA, 3L, 2L,
2L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 3L, 2L, NA,
2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 1L, 3L, 2L, 2L,
3L, 2L, 3L, 2L, 3L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 3L,
1L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 3L, 2L,
2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 3L, 2L, 2L,
3L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, NA, 3L, 3L, 2L, 2L,
3L, 2L, 1L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 2L, 2L, NA, 2L,
2L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 3L,
2L, 1L, 3L, 2L, 3L, 3L, 1L, 2L, 3L, 3L, 2L, 3L, 3L, 2L, 1L,
2L), levels = c("long", "middle", "short"), class = "factor"),
num1 = c(55, 34, 59, 62, 20, 39, 68, 19, 39, 61, 84, 53,
40, 32, 32, 64, 30, 37, 54, 62, 71, 67, 56, 48, 44, 86, 79,
42, 60, 59, 30, 63, 39, 87, 54, 36, 34, 60, 70, 50, 69, 64,
44, 76, 53, 54, 39, 33, 88, 76, 20, 23, 40, 39, 60, 45, 60,
73, 45, 47, 41, 48, 40, 77, 36, NA, 39, 64, 62, 53, 47, 64,
56, 34, 69, 71, 75, 62, 19, 73, 62, 40, 46, 49, 57, 72, 61,
32, 33, 67, 72, 38, 64, 54, 58, 46, 61, 36, 36, 58, 51, 51,
40, 86, 27, 65, 43, 36, 43, 55, 47, 58, 73, 51, 46, 43, 49,
32, 42, 75, 57, 21, 68, 70, 51, 49, 54, 48, 23, 21, 66, 67,
56, 33, 68, 44, 72, 38, 58, 48, 72, 46, 81, 62, 68, 62, 32,
50, 25, 27, 41, 36, 63, 25, 46, 43, 76, 62, 69, 71, 57, 35,
41, 32, 69, 59, 65, 32, 61, 40, 52, 69, 55, 42, 30, 89, 71,
43, 35, 72, 37, 59, 30, 65, 50, 40, 49, 37, 67, 70, 68, 63,
32, 52, 73, 68, 36, 72, 35), dummy1 = structure(c(1L, 2L,
2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, NA, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L,
2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L,
1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L,
2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L,
2L, 1L), levels = c("1", "2"), class = "factor"), dummy2 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L,
1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L,
2L, 2L, 2L, 1L, 1L, NA, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L,
2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, NA, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L,
1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 2L, 1L, 2L, 2L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), levels = c("0", "1"), class = "factor"), group = structure(c(8L,
6L, 8L, 8L, 8L, 2L, 8L, 8L, 1L, 8L, 1L, 3L, 8L, 8L, 5L, 7L,
1L, 8L, 8L, 2L, 6L, 1L, 6L, 1L, 5L, 2L, 2L, NA, 3L, 2L, 8L,
8L, 3L, 8L, 3L, 3L, 6L, 4L, 8L, 2L, 5L, NA, 8L, 1L, 2L, 5L,
2L, 6L, 2L, 1L, 8L, 1L, 2L, 3L, 6L, 6L, 3L, 2L, 2L, 8L, NA,
1L, 8L, 6L, 6L, 8L, 1L, 6L, 8L, 1L, 2L, NA, 2L, 4L, 1L, 8L,
6L, 1L, NA, 2L, 2L, 8L, 8L, 8L, 2L, 8L, 8L, 6L, 3L, 8L, 1L,
1L, 2L, 3L, 2L, 8L, 1L, 1L, 3L, 8L, 2L, 8L, 3L, 2L, 4L, 1L,
3L, 4L, 5L, 1L, 8L, 8L, 3L, 6L, 3L, 3L, 8L, 3L, 6L, 1L, 2L,
1L, 8L, 8L, 1L, 1L, 7L, 8L, 3L, 5L, 4L, 3L, 6L, NA, 4L, 1L,
3L, 8L, 6L, 6L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 6L, 3L, 8L, 4L,
4L, 2L, 2L, 7L, 1L, 2L, 8L, 2L, 2L, 8L, 6L, 8L, 6L, 4L, 4L,
2L, 2L, 3L, 1L, 2L, 3L, 8L, 2L, 3L, 2L, 4L, 2L, 5L, 8L, 3L,
7L, 1L, 1L, 3L, NA, 4L, 2L, 8L, 2L, 1L, 8L, 6L, 4L, 1L, 8L,
3L, 1L, 8L), levels = c("1", "2", "3", "4", "5", "6", "7",
"8"), class = "factor")), row.names = c(NA, -199L), class = c("tbl_df",
"tbl", "data.frame"))
d <- d %>% mutate(across(!all_of(c("dep_var", "num1")), ~ .x %>% as.factor()))
三角测量问题的根源
1.) 我已经重新启动了
R
和我的电脑多次,没有任何效果。
2.) 我尝试排除数据集中的所有 NA – 关于响应变量的
NA
值以及因子。我通过操作数据集和 cforest 函数中的 na.action = na.exclude
选项来做到这一点。这主要解决了我的问题。
(不过,即使这样也不能可靠地解决我的问题:对于其他一些因变量和/或具有不同的 ntree 编号,我收到以下错误:Warning: scheduled core 2 encountered error in user code, all values of the job will be affectedError in colMeans(ret, na.rm = TRUE) : 'x' must be numeric
。此外,由于某种原因,在提供的示例代码中,仅报告森林中变量的子集的方差重要性。)
但是,排除具有
NA
值的观测值并不是我所追求的,因为它不必要地从分析中删除了观测值;这表明在不同的树木/森林中没有正确识别替代物。
如果有任何关于问题所在的建议,以便能够从具有
na.pass
和代理的随机森林中提取条件变量重要性,我将不胜感激。
以下带有
na.action = na.omit
的代码对我有用
set.seed(100)
y <- partykit::cforest(formula = formula, data = d, ntree = 1000, na.action = na.omit,
control = partykit::ctree_control(maxsurrogate = 4))
partykit::varimp(y, conditional = T)