partykit:在 cforest 上使用 varimp 获取包含 NA 值的数据集时出错

问题描述 投票:0回答:1

我想估计变量在解释响应变量时的相对重要性(“dep_var”,基于 4 点李克特量表的数值变量)。我主要对六个(相关)因素(“f1”到“f6”)的相对重要性感兴趣。 为了实现这一目标,我想依赖

partykit
中的
R
包,并按照 Strebel 等人的建议,使用
cforest
运行
varimp(conditional = T)
。 (2008)和其他地方。我的数据中有
NA
值,但根据文档,
cforest
可以处理它们。 (相应地,它的默认
na.action
是带有代理变量选择的
na.pass
)。但是,当我运行
varimp
函数时,我收到一条错误消息而不是结果。

到目前为止我做了什么:

创建随机森林
我首先使用

partykit::cforest
函数生成了一个随机森林对象。

library(partykit)
library(tidyverse)

formula <- paste0("dep_var ~ ", paste(c("f1", "f2", "f3", "f4", "f5", "f6", "num1", "dummy1", "dummy2", "group"), collapse = " + ")) %>% as.formula()

set.seed(100)

y <- partykit::cforest(formula = formula, data = d, cores = 6, ntree = 1000, control = partykit::ctree_control(maxsurrogate = 4))

确定变量重要性
随后,我在这个森林对象上运行

partykit::varimp
函数。它应该为我提供 cforest 中提供的公式中包含的所有变量的相对重要性。

partykit::varimp(y, conditional = T)

错误
但是,当我对数据运行 cforests 和 varimp 时,我收到此错误消息:

Error in FUN(model = model, trafo = trafo, data = data, subset = subset,  : NROW(Y) == length(ix) is not TRUE

重现错误的示例数据

library(tidyverse)

d <- structure(list(dep_var = c(4, 4, 2, 1, 2, 2, 1, 1, 1, 4, 2, 4, 
3, 1, 4, 1, 1, 1, 2, 2, 2, 1, 2, 3, 4, 3, 4, 2, 3, 2, 4, 3, 2, 
3, 4, 4, 3, 3, 2, 3, 4, 4, 1, 2, 3, 4, 2, 3, 4, 3, 3, 2, 2, 4, 
4, 1, 2, 3, 2, 1, 4, 1, 2, 3, 2, 3, 3, 1, 4, 3, 2, 4, 3, 1, 1, 
2, 1, 2, 1, 3, 4, 1, 4, 2, 3, 2, 1, 2, 4, 1, 2, 4, 2, 4, 2, 1, 
4, 1, 3, 4, 3, 4, 3, 1, 3, 2, 4, 3, 3, 3, 2, 2, 2, 2, 2, 4, 3, 
2, 4, 3, 3, 2, 3, 2, 4, 1, 3, 3, 2, 3, 2, 1, 3, 2, 1, 1, 4, 3, 
2, 4, 4, 1, 4, 3, 3, 4, 2, 1, 2, 2, 2, 3, 4, 2, 3, 4, 1, 3, 3, 
2, 2, 4, 2, 3, 1, 4, 4, 3, 2, 3, 1, 3, 2, 4, 2, 3, 3, 1, 2, 1, 
2, 2, 1, 3, 2, 2, 3, 1, 3, 3, 3, 2, 2, 2, 3, 4, 3, 3, 3), f1 = structure(c(NA, 
1L, 2L, 3L, 2L, 1L, 3L, NA, NA, NA, 3L, NA, 3L, 3L, NA, 2L, NA, 
1L, NA, 3L, 3L, NA, 2L, 3L, 2L, 2L, NA, NA, 2L, NA, 3L, 2L, NA, 
2L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, NA, 3L, 2L, 3L, 3L, 2L, 3L, NA, 
NA, 3L, 3L, 2L, 3L, 2L, 3L, 3L, NA, 1L, NA, 3L, 1L, 3L, 2L, 3L, 
3L, 2L, 3L, 3L, 1L, 2L, NA, 2L, NA, NA, 3L, 2L, NA, 3L, 3L, 3L, 
2L, 3L, 2L, 2L, 2L, 2L, NA, 2L, 3L, 2L, 1L, 2L, NA, NA, 3L, 3L, 
3L, 3L, 3L, 3L, NA, NA, NA, 1L, 3L, 3L, 1L, 3L, 3L, NA, 1L, NA, 
2L, 1L, 2L, 3L, 1L, NA, 3L, 3L, NA, 2L, NA, NA, 2L, 3L, 3L, 1L, 
3L, 2L, 3L, 3L, 2L, 2L, NA, NA, 2L, 3L, 3L, 3L, 3L, 2L, 2L, NA, 
3L, 3L, NA, 3L, 2L, NA, 1L, NA, 3L, 2L, 2L, 3L, 2L, 3L, 3L, 3L, 
3L, 2L, 2L, NA, 1L, 3L, 2L, 2L, 3L, 3L, NA, 3L, NA, 3L, 1L, 3L, 
2L, 2L, 2L, 2L, 3L, NA, NA, 1L, 3L, 2L, 3L, NA, 2L, 2L, 3L, 3L, 
NA, 3L, 2L, 3L, NA, 1L), levels = c("long", "middle", "short"
), class = "factor"), f2 = structure(c(1L, 1L, 2L, 3L, 3L, 1L, 
1L, 3L, 1L, 2L, 2L, 1L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 
2L, 2L, 2L, 2L, 3L, NA, 2L, 1L, 2L, 3L, 2L, 2L, 3L, 1L, 3L, 3L, 
3L, 3L, 3L, NA, 3L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 3L, 1L, 3L, 3L, 
3L, 1L, 2L, 1L, 2L, NA, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 2L, 3L, 1L, 
2L, NA, 2L, 1L, 1L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 
2L, NA, 3L, 3L, 3L, 2L, 3L, NA, NA, 3L, 3L, 2L, 2L, 3L, 3L, 1L, 
2L, 3L, 2L, 1L, 3L, 1L, 3L, 3L, 1L, 2L, 1L, 3L, 1L, 2L, 3L, 1L, 
1L, 3L, 3L, NA, 3L, 2L, 3L, 3L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 3L, 
3L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 2L, 3L, 
3L, 2L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 1L, 2L, 
3L, 2L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 3L, 
2L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 3L, 1L, 
2L), levels = c("long", "middle", "short"), class = "factor"), 
    f3 = structure(c(1L, 3L, 1L, 1L, 1L, 2L, 3L, 3L, 2L, 3L, 
    3L, 3L, 3L, 1L, 3L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 1L, 3L, 
    3L, NA, 3L, 1L, 3L, 1L, 3L, 3L, 3L, 1L, 1L, 3L, 1L, 1L, 3L, 
    1L, 3L, 1L, 3L, 2L, 1L, 1L, 3L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 
    2L, 3L, 3L, 3L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 2L, 1L, 3L, 3L, 
    3L, 1L, 3L, 2L, 3L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, NA, 1L, 3L, 
    1L, 3L, 3L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 2L, 1L, 1L, 3L, 1L, 
    2L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 3L, 
    1L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 1L, 3L, 1L, 1L, 1L, 3L, 3L, 
    3L, NA, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 
    3L, 1L, 3L, 3L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 3L, 2L, 3L, 3L, 1L, 1L, 1L, 
    1L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 1L, 1L, 3L, 
    1L, 1L, 3L, 1L, 1L, 3L, 3L, 1L, 3L), levels = c("long", "middle", 
    "short"), class = "factor"), f4 = structure(c(3L, 3L, 3L, 
    3L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 3L, 2L, 1L, 
    3L, 1L, 3L, 3L, 1L, 2L, 2L, 3L, NA, 1L, 3L, 2L, 3L, 3L, 1L, 
    1L, 2L, 2L, 3L, 3L, 1L, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 3L, 2L, 
    1L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 3L, 1L, 2L, 
    2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 
    3L, 1L, 3L, 1L, 1L, 3L, 2L, 3L, 3L, 3L, 1L, 3L, 1L, 2L, 3L, 
    1L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 1L, 2L, 1L, 1L, 2L, 3L, 2L, 
    3L, 3L, 1L, 2L, 1L, 3L, 3L, 1L, 3L, 1L, 2L, 2L, 2L, 2L, 1L, 
    3L, 3L, 3L, 2L, 3L, 1L, 3L, 1L, 1L, 3L, 2L, 1L, 2L, 2L, 2L, 
    3L, 3L, 3L, 3L, 1L, 3L, 2L, 1L, 3L, 3L, 2L, 3L, 3L, 2L, 1L, 
    3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 2L, 1L, 2L, 2L, 
    1L, 3L, 2L, 1L, 3L, 2L, 1L, 1L, 1L, 3L, 3L, 1L, 2L, 2L, 3L, 
    3L, 2L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 2L, 1L, 
    3L), levels = c("long", "middle", "short"), class = "factor"), 
    f5 = structure(c(3L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 1L, 3L, 
    3L, 1L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 
    3L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 1L, 2L, 2L, 3L, 2L, 
    3L, 3L, 2L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, NA, 1L, 3L, 3L, 3L, 
    3L, 3L, 3L, 1L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 
    2L, 3L, 2L, 1L, 1L, 3L, 3L, 3L, NA, 3L, 3L, 1L, NA, 3L, 2L, 
    3L, 3L, 2L, 2L, 3L, 3L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 
    2L, 1L, 3L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 2L, 1L, 3L, 2L, 3L, 
    2L, 3L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 3L, 2L, 3L, 
    3L, 3L, 3L, 2L, 2L, 2L, 3L, 2L, 3L, 3L, 3L, NA, 3L, 3L, 2L, 
    3L, 2L, 3L, 2L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 3L, 3L, 3L, 3L, 
    3L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 2L, 3L, 3L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 
    2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), levels = c("long", "middle", 
    "short"), class = "factor"), f6 = structure(c(3L, 3L, 3L, 
    3L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 3L, 3L, 1L, 3L, 1L, 2L, 
    3L, 2L, 2L, 1L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 2L, NA, 3L, 2L, 
    2L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 
    2L, 2L, 2L, 1L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 3L, 2L, NA, 
    2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 1L, 3L, 2L, 2L, 
    3L, 2L, 3L, 2L, 3L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 3L, 
    1L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 
    2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 3L, 2L, 2L, 
    3L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 3L, 
    3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, NA, 3L, 3L, 2L, 2L, 
    3L, 2L, 1L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 2L, 2L, NA, 2L, 
    2L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 3L, 
    2L, 1L, 3L, 2L, 3L, 3L, 1L, 2L, 3L, 3L, 2L, 3L, 3L, 2L, 1L, 
    2L), levels = c("long", "middle", "short"), class = "factor"), 
    num1 = c(55, 34, 59, 62, 20, 39, 68, 19, 39, 61, 84, 53, 
    40, 32, 32, 64, 30, 37, 54, 62, 71, 67, 56, 48, 44, 86, 79, 
    42, 60, 59, 30, 63, 39, 87, 54, 36, 34, 60, 70, 50, 69, 64, 
    44, 76, 53, 54, 39, 33, 88, 76, 20, 23, 40, 39, 60, 45, 60, 
    73, 45, 47, 41, 48, 40, 77, 36, NA, 39, 64, 62, 53, 47, 64, 
    56, 34, 69, 71, 75, 62, 19, 73, 62, 40, 46, 49, 57, 72, 61, 
    32, 33, 67, 72, 38, 64, 54, 58, 46, 61, 36, 36, 58, 51, 51, 
    40, 86, 27, 65, 43, 36, 43, 55, 47, 58, 73, 51, 46, 43, 49, 
    32, 42, 75, 57, 21, 68, 70, 51, 49, 54, 48, 23, 21, 66, 67, 
    56, 33, 68, 44, 72, 38, 58, 48, 72, 46, 81, 62, 68, 62, 32, 
    50, 25, 27, 41, 36, 63, 25, 46, 43, 76, 62, 69, 71, 57, 35, 
    41, 32, 69, 59, 65, 32, 61, 40, 52, 69, 55, 42, 30, 89, 71, 
    43, 35, 72, 37, 59, 30, 65, 50, 40, 49, 37, 67, 70, 68, 63, 
    32, 52, 73, 68, 36, 72, 35), dummy1 = structure(c(1L, 2L, 
    2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
    1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, NA, 1L, 1L, 
    1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 
    1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 
    1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 
    2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 
    1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 
    2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 
    2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 
    1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 
    2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 
    2L, 1L), levels = c("1", "2"), class = "factor"), dummy2 = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 
    1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 
    2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 
    1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 
    2L, 2L, 2L, 1L, 1L, NA, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 
    2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
    2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 
    1L, NA, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 
    1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 
    1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 
    1L, 2L, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 
    2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L), levels = c("0", "1"), class = "factor"), group = structure(c(8L, 
    6L, 8L, 8L, 8L, 2L, 8L, 8L, 1L, 8L, 1L, 3L, 8L, 8L, 5L, 7L, 
    1L, 8L, 8L, 2L, 6L, 1L, 6L, 1L, 5L, 2L, 2L, NA, 3L, 2L, 8L, 
    8L, 3L, 8L, 3L, 3L, 6L, 4L, 8L, 2L, 5L, NA, 8L, 1L, 2L, 5L, 
    2L, 6L, 2L, 1L, 8L, 1L, 2L, 3L, 6L, 6L, 3L, 2L, 2L, 8L, NA, 
    1L, 8L, 6L, 6L, 8L, 1L, 6L, 8L, 1L, 2L, NA, 2L, 4L, 1L, 8L, 
    6L, 1L, NA, 2L, 2L, 8L, 8L, 8L, 2L, 8L, 8L, 6L, 3L, 8L, 1L, 
    1L, 2L, 3L, 2L, 8L, 1L, 1L, 3L, 8L, 2L, 8L, 3L, 2L, 4L, 1L, 
    3L, 4L, 5L, 1L, 8L, 8L, 3L, 6L, 3L, 3L, 8L, 3L, 6L, 1L, 2L, 
    1L, 8L, 8L, 1L, 1L, 7L, 8L, 3L, 5L, 4L, 3L, 6L, NA, 4L, 1L, 
    3L, 8L, 6L, 6L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 6L, 3L, 8L, 4L, 
    4L, 2L, 2L, 7L, 1L, 2L, 8L, 2L, 2L, 8L, 6L, 8L, 6L, 4L, 4L, 
    2L, 2L, 3L, 1L, 2L, 3L, 8L, 2L, 3L, 2L, 4L, 2L, 5L, 8L, 3L, 
    7L, 1L, 1L, 3L, NA, 4L, 2L, 8L, 2L, 1L, 8L, 6L, 4L, 1L, 8L, 
    3L, 1L, 8L), levels = c("1", "2", "3", "4", "5", "6", "7", 
    "8"), class = "factor")), row.names = c(NA, -199L), class = c("tbl_df", 
"tbl", "data.frame"))

d <- d %>% mutate(across(!all_of(c("dep_var", "num1")), ~ .x %>% as.factor()))

三角测量问题的根源

1.) 我已经重新启动了

R
和我的电脑多次,没有任何效果。

2.) 我尝试排除数据集中的所有 NA – 关于响应变量的

NA
值以及因子。我通过操作数据集和 cforest 函数中的
na.action = na.exclude
选项来做到这一点。这主要解决了我的问题。 (不过,即使这样也不能可靠地解决我的问题:对于其他一些因变量和/或具有不同的 ntree 编号,我收到以下错误:
Warning: scheduled core 2 encountered error in user code, all values of the job will be affectedError in colMeans(ret, na.rm = TRUE) : 'x' must be numeric
。此外,由于某种原因,在提供的示例代码中,仅报告森林中变量的子集的方差重要性。)

但是,排除具有

NA
值的观测值并不是我所追求的,因为它不必要地从分析中删除了观测值;这表明在不同的树木/森林中没有正确识别替代物。

如果有任何关于问题所在的建议,以便能够从具有

na.pass
和代理的随机森林中提取条件变量重要性,我将不胜感激。

r random-forest feature-selection partykit
1个回答
0
投票

以下带有

na.action = na.omit
的代码对我有用

set.seed(100)
y <- partykit::cforest(formula = formula, data = d, ntree = 1000, na.action = na.omit,
                       control = partykit::ctree_control(maxsurrogate = 4))

partykit::varimp(y, conditional = T)
最新问题
© www.soinside.com 2019 - 2025. All rights reserved.