我想创建一个新列“X11”,根据选定数量的列中有多少个 NA 有条件地对所有 1 求和。在本例中,我查看 4 个变量:X1、X2、X3 和 X4。
例如:如果有 1 个 NA,那么我想查看其余 3 个具有值的变量并计算有多少个 1。如果有 2 个 NA,那么我想查看剩余的 2 个变量并计算有多少个 1。如果我有 3 个 NA,那么我想查看剩余的 1 个变量并确定它是否为 1。如果我有全部 4 个 NA,那么这将给我 0。
我有这个数据:
df <- data.frame(replicate(10,sample(0:2, 10, rep=TRUE)))
df <- replace(df, df == 0, NA)
我的数据框如下所示:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 1 NA 1 NA NA NA 1 1 2
2 NA 1 1 NA 2 NA 2 2 NA 1
3 1 NA 1 1 NA NA 1 2 NA 1
4 2 2 2 1 1 2 1 NA 2 2
5 NA 2 NA 2 NA 2 1 NA 1 1
6 2 2 1 1 2 NA 1 2 1 1
7 1 2 NA NA 2 1 1 NA NA 1
8 2 2 NA NA 1 NA NA 2 NA 1
9 1 NA 1 2 2 1 2 NA NA 1
10 NA 2 1 NA NA NA NA 2 2 NA
我希望我的输出看起来像这样:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
1 1 1 NA 1 NA NA NA 1 1 2 3
2 NA 1 1 NA 2 NA 2 2 NA 1 2
3 1 NA 1 1 NA NA 1 2 NA 1 3
4 2 2 2 1 1 2 1 NA 2 2 1
5 NA 2 NA 2 NA 2 1 NA 1 1 0
6 2 2 1 1 2 NA 1 2 1 1 2
7 1 2 NA NA 2 1 1 NA NA 1 1
8 2 2 NA NA 1 NA NA 2 NA 1 0
9 1 NA 1 2 2 1 2 NA NA 1 2
10 NA 2 1 NA NA NA NA 2 2 NA 1
这是我当前代码的示例:
vars <- c("X1", "X2", "X3", "X4")
df <- df %>%
mutate(missing_vars = rowSums(across(vars, ~is.na(.))),
nonmissing_vars = 7-vars)
df <- df %>%
mutate(zero_na = case_when(missing_vars == 0 & (X1 == 2 & X2 == 2 & X3 == 2 & X4 == 2) ~ 1,
(missing_vars == 0 & (X1 == 1 & X2 == 2 & X3 == 2 & X4 == 2) |
(X1 == 2 & X2 == 1 & X3 == 2 & X4 == 2) |
(X1 == 2 & X2 == 2 & X3 == 1 & X4 == 2) |
(X1 == 2 & X2 == 2 & X3 == 2 & X4 == 1)) ~ 2,
(missing_vars == 0 & (X1 == 1 & X2 == 1 & X3 == 2 & X4 == 2) |
(X1 == 1 & X2 == 2 & X3 == 1 & X4 == 2) |
(X1 == 1 & X2 == 2 & X3 == 2 & X4 == 1) |
(X1 == 2 & X2 == 1 & X3 == 1 & X4 == 2) |
(X1 == 2 & X2 == 2 & X3 == 1 & X4 == 1) |
(X1 == 2 & X2 == 1 & X3 == 2 & X4 == 1)) ~ 3,
(missing_vars == 0 & (X1 == 1 & X2 == 1 & X3 == 1 & X4 == 2) |
(X1 == 1 & X2 == 1 & X3 == 2 & X4 == 1) |
(X1 == 1 & X2 == 2 & X3 == 1 & X4 == 1) |
(X1 == 2 & X2 == 1 & X3 == 1 & X4 == 1)) ~ 4,
missing_vars == 0 & (X1 == 1 & X2 == 1 & X3 == 1 & X4 == 1) ~ 5))
brfss <- brfss %>%
mutate(one_na = case_when(missing_vars == 1 & (is.na(X1) & X2 == 2 & X3 == 2 & X4 == 2) ~ 1,
missing_vars == 1 & (X1 == 2 & is.na(X2) & X3 == 2 & X4 == 2) ~ 1,
missing_vars == 1 & (X1 == 2 & X2 == 2 & is.na(X3) & X4 == 2) ~ 1,
missing_vars == 1 & (X1 == 2 & X2 == 2 & X3 == 2 & is.na(X4)) ~ 1,
missing_vars == 1 & (is.na(X1) & X2 == 1 & X3 == 2 & X4 == 2) ~ 2,
missing_vars == 1 & (X1 == 1 & is.na(X2) & X3 == 2 & X4 == 2) ~ 2,
missing_vars == 1 & (X1 == 1 & X2 == 2 & is.na(X3) & X4 == 2) ~ 2,
missing_vars == 1 & (X1 == 1 & X2 == 2 & X3 == 2 & is.na(X4)) ~ 2,
missing_vars == 1 & (is.na(X1) & X2 == 2 & X3 == 1 & X4 == 2) ~ 2,
missing_vars == 1 & (X1 == 2 & is.na(X2) & X3 == 1 & X4 == 2) ~ 2,
missing_vars == 1 & (X1 == 2 & X2 == 1 & is.na(X3) & X4 == 2) ~ 2,
missing_vars == 1 & (X1 == 2 & X2 == 1 & X3 == 2 & is.na(X4)) ~ 2,
missing_vars == 1 & (is.na(X1) & X2 == 2 & X3 == 2 & X4 == 1) ~ 2,
missing_vars == 1 & (X1 == 2 & is.na(X2) & X3 == 2 & X4 == 1) ~ 2,
missing_vars == 1 & (X1 == 2 & X2 == 2 & is.na(X3) & X4 == 1) ~ 2,
missing_vars == 1 & (X1 == 2 & X2 == 2 & X3 == 1 & is.na(X4)) ~ 2,
missing_vars == 1 & (is.na(X1) & X2 == 1 & X3 == 1 & X4 == 2) ~ 3,
missing_vars == 1 & (X1 == 1 & is.na(X2) & X3 == 1 & X4 == 2) ~ 3,
missing_vars == 1 & (X1 == 1 & X2 == 1 & is.na(X3) & X4 == 2) ~ 3,
missing_vars == 1 & (X1 == 1 & X2 == 1 & X3 == 2 & is.na(X4)) ~ 3,
missing_vars == 1 & (is.na(X1) & X2 == 2 & X3 == 1 & X4 == 1) ~ 3,
missing_vars == 1 & (X1 == 2 & is.na(X2) & X3 == 1 & X4 == 1) ~ 3,
missing_vars == 1 & (X1 == 2 & X2 == 1 & is.na(X3) & X4 == 1) ~ 3,
missing_vars == 1 & (X1 == 2 & X2 == 1 & X3 == 1 & is.na(X4)) ~ 3,
missing_vars == 1 & (is.na(X1) & X2 == 1 & X3 == 2 & X4 == 1) ~ 3,
missing_vars == 1 & (X1 == 1 & is.na(X2) & X3 == 2 & X4 == 1) ~ 3,
missing_vars == 1 & (X1 == 1 & X2 == 2 & is.na(X3) & X4 == 1) ~ 3,
missing_vars == 1 & (X1 == 1 & X2 == 2 & X3 == 1 & is.na(X4)) ~ 3,
missing_vars == 1 & (is.na(X1) & X2 == 1 & X3 == 1 & X4 == 1) ~ 4,
missing_vars == 1 & (X1 == 1 & is.na(X2) & X3 == 1 & X4 == 1) ~ 4,
missing_vars == 1 & (X1 == 1 & X2 == 1 & is.na(X3) & X4 == 1) ~ 4,
missing_vars == 1 & (X1 == 1 & X2 == 1 & X3 == 1 & is.na(X4)) ~ 4))
我对 2 个 NA、3 个 NA、然后 4 个 NA 的每个组合重复此操作,然后对“zero_na”、“one_na”等求和以获得 X11 下值的最终计数。
但是,我目前有大约 300,000 个观测值,需要对 7 个具有不同数量的 NA、1 和 2 的不同变量执行此操作。我必须编写的组合数量非常可笑,我只是想知道是否有更有效的方法来编写此代码?
提前非常感谢!
试试这个:
df["X11"] = apply(df[,c(1:4)],1,\(s) sum(s==1,na.rm=T))
输出:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
1 1 1 NA 1 NA NA NA 1 1 2 3
2 NA 1 1 NA 2 NA 2 2 NA 1 2
3 1 NA 1 1 NA NA 1 2 NA 1 3
4 2 2 2 1 1 2 1 NA 2 2 1
5 NA 2 NA 2 NA 2 1 NA 1 1 0
6 2 2 1 1 2 NA 1 2 1 1 2
7 1 2 NA NA 2 1 1 NA NA 1 1
8 2 2 NA NA 1 NA NA 2 NA 1 0
9 1 NA 1 2 2 1 2 NA NA 1 2
10 NA 2 1 NA NA NA NA 2 2 NA 1