我有一个非常宽的数据框和一些命名子集:
df <- data.frame(matrix(ncol = 200, nrow = 200))
df <- as.data.frame(
apply(df, 1, function(x){return(sample(c(1,2,NA), size = 200, replace = TRUE))})
)
xHappy <- c("V22", "V39", "V17", "V2")
xSad <- c("V15", "V11", "V79", "V90", "V80", "V101")
xSilly <- c("V20", "V112")
对于每个子集,我想创建一个新变量来计算每行该子集的 NA 数量。
我可以将相同的方法复制粘贴三次:
df.new <- df %>% mutate(
missingHappy = rowSums(is.na(select(., all_of(xHappy))
missingSad = rowSums(is.na(select(., all_of(xSad))
missingSilly = rowSums(is.na(select(., all_of(xSilly))
)
但是我的真实数据有数千个变量和数十个子集,因此这既乏味又危险,我真的更喜欢更简洁的东西。
有没有一种方法可以让我只创建一个保存的
mutate()
方法,我所要做的就是调用类似 makeNAColumns(xSad, xHappy, XSilly)
甚至 df %>% mutate [insert magic one liner]
之类的东西?
使用
purrr
包,你可以做类似的事情:
library(dplyr)
library(purrr)
# Create list of your subset columns and define output column names
subs <- list(
missingHappy = xHappy,
missingSad = xSad,
missingSilly = xSilly
)
# Apply rowSums across each list in subs, bind_cols() to output
df.new <- df |>
bind_cols(
map2_dfc(subs, names(subs),
~ rowSums(is.na(select(df, all_of(.x))))
))
head(df.new[,c("V22", "V39", "V17", "V2","V15",
"V11", "V79", "V90", "V80", "V101",
"V20", "V112", "missingHappy", "missingSad","missingSilly")])
# V22 V39 V17 V2 V15 V11 V79 V90 V80 V101 V20 V112 missingHappy missingSad missingSilly
# 1 1 2 1 2 2 1 2 2 2 2 2 1 0 0 0
# 2 NA NA NA 2 1 2 NA 2 NA 1 1 2 3 2 0
# 3 2 NA NA NA 1 NA 2 NA 1 1 NA 1 3 2 1
# 4 2 NA 2 1 2 NA 2 2 2 2 1 1 1 1 0
# 5 2 1 NA NA NA 1 1 2 1 2 NA NA 2 1 2
# 6 2 1 1 1 1 2 1 1 NA 1 1 2 0 1 0
在 subs 对象中创建名称仍然很费力,我希望有一种更有效的方法来实现列命名,但这希望足以节省时间。