在R中,尝试在大型数据集上按组实现以下过滤逻辑:
每组内:
如果L超过1,则保留L值最低的行。
如果N个以上,则保留N值最高的行。
如果同时存在 L 和 N,则删除 N 高于 L 的任何行。
如果同时存在 L 和 N,请将 N 的最高值的行保留在 L 的最低值之下(除了 L 的最低值之外)。
保留 B 的所有值。
样本数据:
dat <- data.frame(group=c("AB","AB","AB","AB","BC","BC","B","B","AD","AD","AD","G"),
type=c("B","L","N","N","N","L","N","N","B","L","L","L"),
value=c(2,4,3,2,5,3,8,9,4,3,9,7))
所需输出:
desired_output <- data.frame(group=c("AB","AB","AB","BC","B","AD","AD","G"),
type=c("B","L","N","L","N","B","L","L"),
value=c(2,4,3,3,9,4,3,7))
寻找 dplyr/tidyr 解决方案。我已经尝试过在过滤器内的pivot_wider或case_when之后过滤逻辑,但我还没有非常接近。我原以为这很简单,但跨列应用逻辑让我难住了。
这符合我的想法,但它不会产生所需的输出(例如,L 在组内的所有类型上取最小值,而不是仅在 L 内):
df <- dat %>%
group_by(group) %>%
filter(type=="B"|type=="L" & value==min(value)|type=="N" & value==max(value))
你可以尝试一下:
### Packages
library(dplyr)
library(tidyr)
### Data
dat <- data.frame(group=c("AB","AB","AB","AB","BC","BC","B","B","AD","AD","AD","G"),
type=c("B","L","N","N","N","L","N","N","B","L","L","L"),
value=c(2,4,3,2,5,3,8,9,4,3,9,7))
### We add the number of L and N for each group
dat2=dat %>%
group_by(group) %>%
mutate(nb_L = sum (type == "L"),
nb_N = sum (type == "N")) %>%
ungroup()
### We create 3 dataframes that respect your conditions
a=dat2 %>% group_by(group) %>% filter(nb_L>1&type=="L") %>% slice_min(value,n = 1) %>% ungroup()
b=dat2 %>% group_by(group) %>% filter(nb_N>1&type=="N") %>% slice_max(value,n=1) %>% ungroup()
c=dat2 %>% group_by(group) %>% filter(type=="B"|(nb_L<=1&type=="L")|(nb_N<=1&type=="N")) %>% ungroup()
### We stack the dataframes
dat2=bind_rows(a,b,c) %>% ungroup()
### We add the value of L and N for each group
### We remove the rows regarding the rest of your criterias
dat2=dat2 %>%
group_by(group) %>%
mutate(val_L = ifelse(type == "L", value, NA_real_),
val_N = ifelse(type == "N", value, NA_real_)) %>%
fill(c(val_L,val_N), .direction = "downup") %>%
mutate(across(c(val_L,val_N),~replace_na(.x,0)),
keep=case_when(nb_L>0&type=="N"&val_N>val_L~"remove",.default = "keep")) %>%
filter(keep=="keep") %>%
select(group,type,value) %>%
arrange(group,type) %>%
ungroup()
输出:
# A tibble: 8 × 3
group type value
<chr> <chr> <dbl>
1 AB B 2
2 AB L 4
3 AB N 3
4 AD B 4
5 AD L 3
6 B N 9
7 BC L 3
8 G L 7