这非常棘手。比方说,我有一个第一个数据集df
:
sample id name
1 ID200,ID300,ID299 first
2 ID2,ID123 second
3 ID90 third
第二个数据集df_1
:
ids condition
ID200 y
ID300 n
ID299 n
ID2 y
ID123 y
ID90 n
我必须从第一个数据集中过滤掉所有ID值满足第二个表中条件的所有行,如y
。所以这个例子中的过滤应该给出:
sample id name
2 ID2,ID123 second
我在考虑使用类似的东西:
new_df = df %>%
filter(grepl('ID', id), df_1$condition == 'y')
但显然我需要一些不同的东西,你能给我一些线索吗?
编辑:正如我在评论中所说,如果我的df的id列填充了其他文本,会发生什么?
sample id name
1 ID = ID200,ID300,ID299,abcd first
2 ID = ID2,ID123, dfg second
3 ID = ID90, text third
也许有点不优雅,但这会给你每个样本的最终条件状态。
library(tidyverse)
df <- tibble(sample = c(1, 2, 3),
id = c("ID200,ID300,ID299", "ID2,ID123", "ID90"),
name = c("first", "second", "third"))
df_1 <- tibble(ids = c("ID200", "ID300", "ID299", "ID2", "ID123", "ID90"),
condition = c("y", "n", "n", "y", "y", "n"))
df2 <- df %>%
mutate(ids = str_split(id, ",")) %>%
unnest() %>%
inner_join(df_1, by = "ids") %>%
group_by(sample) %>%
summarise(condition = min(condition))
然后,您可以将其连接到原始数据框以进行过滤。
filtered <- inner_join(df, df2, by = "sample") %>%
filter(condition == "y")
我首先整理df
,因为id
每行包含一个观察:
library(tidyr)
library(dplyr)
df %>%
separate_rows(id)
sample id name
1 1 ID200 first
2 1 ID300 first
3 1 ID299 first
4 2 ID2 second
5 2 ID123 second
6 3 ID90 third
相同的操作,然后与df_1
连接:
df %>%
separate_rows(id) %>%
left_join(df_1, by = c("id" = "ids"))
sample id name condition
1 1 ID200 first y
2 1 ID300 first n
3 1 ID299 first n
4 2 ID2 second y
5 2 ID123 second y
6 3 ID90 third n
现在你可以对sample
进行分组并筛选出唯一条件为“y”的情况:
new_df <- df %>%
separate_rows(id) %>%
left_join(df_1, by = c("id" = "ids")) %>%
group_by(sample) %>%
filter(condition == "y",
n_distinct(condition) == 1) %>%
ungroup()
结果:
sample id name condition
<int> <chr> <chr> <chr>
1 2 ID2 second y
2 2 ID123 second y
如果您真的想要在列中使用逗号分隔的ID转换回原始格式:
library(purrr)
new_df %>%
nest(id) %>%
mutate(newid = map_chr(data, ~paste(.$id, collapse = ","))) %>%
select(sample, id = newid, name)
sample id name
<int> <chr> <chr>
1 2 ID2,ID123 second