我正在与 UpSetR 进行一些比较,我想保存落入每个交叉点的元素列表。这可能吗?我到处都找不到...
手动执行此操作会非常乏味(许多列表),而且由于它们无论如何都是计算出来的,因此无法保存它们是令人沮丧的
这是我提取不同交集及其中元素列表的方法。
主要思想是粘贴二进制表中的所有 0 和 1,为每个交叉点创建唯一标识符,并使用 dplyr::group_by 函数提取信息
data <- data.frame(
entry = paste0("Entry.", 1:10),
"A" = c(0,0,0,0,1,0,1,1,0,0),
"B" = c(1,0,0,0,1,1,1,1,1,0),
"C" = c(1,1,1,1,0,0,1,0,1,1)
)
# NOT REQUIRED. Only to confirm that upset works with these data
upset(data)
然后,您可以通过粘贴所有二进制列来识别交叉点。我为此使用了 Unity 便利函数:
注意:您可能需要根据您的数据是否具有行名称或具有名称的列来更改此设置
data_with_intersection <- data %>%
unite(col = "intersection", -c("entry"), sep = "")
从这里,您可以轻松计算每个交叉点的大小:
# Table of intersections and the number of entries
data_with_intersection %>%
group_by(intersection) %>%
summarise(n = n()) %>%
arrange(desc(n))
或者甚至提取每个交叉点中的条目/元素列表:
# List of intersections and their entries
data_with_intersection %>%
group_by(intersection) %>%
summarise(list = list(entry)) %>%
mutate(list = setNames(list, intersection)) %>%
pull(list)
这是我的解决方案。它与生物相关,但应该很容易转化到其他领域。 我从向量列表开始。在我的例子中,属于不同签名(不同集合)的基因列表(字符列表)。
str(list_filter)
List of 9
$ CellAge_Induces : chr [1:153] "AAK1" "ABI3" "ADCK5" "AGT" ...
$ CellAge_Inhibits : chr [1:121] "ACLY" "AKR1B1" "ASPH" "ATF7IP" ...
$ CLASSICAL_SASP : chr [1:38] "BGN" "CCL2" "CCL20" "COL1A1" ...
$ FRIDMAN_SENESCENCE_UP : chr [1:77] "ALDH1A3" "CCND1" "CD44" "CDKN1A" ...
$ ISM_SCORE : chr [1:128] "HSH2D" "OTOF" "TRIM69" "PSME1" ...
$ MOSERLE_IFNA_RESPONSE : chr [1:31] "CD274" "CMPK2" "CXCL10" "DDX58" ...
$ REACTOME_SENESCENCE_SASP: chr [1:110] "ANAPC1" "ANAPC10" "ANAPC11" "ANAPC15" ...
$ SAEPHIA_CURATED_SASP : chr [1:38] "IL1A" "IL1B" "CXCL10" "CXCL1" ...
$ senmayo : chr [1:125] "ACVR1B" "ANG" "ANGPT1" "ANGPTL4" ...
从这个列表中,我生成了两个表: 具有独特基因名称的一个
df2 <- data.frame(gene=unique(unlist(list_filter)))
head(df2)
gene
1 AAK1
2 ABI3
3 ADCK5
4 AGT
5 AKT1
6 ALOX15B
dim(df2)
[1] 671 1
其中一个只是列表的“数据框”版本。包含签名中的每个基因以及每个签名(集)的名称。
df1 <- lapply(list_filter,function(x){
data.frame(gene = x)
}) %>%
bind_rows(.id = "path")
head(df1)
path gene
1 CellAge_Induces AAK1
2 CellAge_Induces ABI3
3 CellAge_Induces ADCK5
4 CellAge_Induces AGT
5 CellAge_Induces AKT1
6 CellAge_Induces ALOX15B
dim(df1)
[1] 821 2
现在我迭代搜索每个唯一的基因名称并将签名的身份保存在一列中。
df_int <- lapply(df2$gene,function(x){
# pull the name of the intersections
intersection <- df1 %>%
dplyr::filter(gene==x) %>%
arrange(path) %>%
pull("path") %>%
paste0(collapse = "|")
# build the dataframe
data.frame(gene = x,int = intersection)
}) %>%
bind_rows()
head(df_int,n=20)
gene int
1 AAK1 CellAge_Induces
2 ABI3 CellAge_Induces
3 ADCK5 CellAge_Induces
4 AGT CellAge_Induces
5 AKT1 CellAge_Induces
6 ALOX15B CellAge_Induces
7 AR CellAge_Induces
8 ARPC1B CellAge_Induces
9 ASF1A CellAge_Induces
10 AXL CellAge_Induces|senmayo
11 BHLHE40 CellAge_Induces
12 BLK CellAge_Induces
13 BRAF CellAge_Induces
14 BRD7 CellAge_Induces
15 CAV1 CellAge_Induces
16 CCND1 CellAge_Induces|FRIDMAN_SENESCENCE_UP
17 CDK18 CellAge_Induces
18 CDKN1A CellAge_Induces|FRIDMAN_SENESCENCE_UP|REACTOME_SENESCENCE_SASP
19 CDKN1C CellAge_Induces|FRIDMAN_SENESCENCE_UP
20 CDKN1B CellAge_Induces|REACTOME_SENESCENCE_SASP
dim(df_int)
[1] 671 2
可以对数据帧进行汇总并与调用提供的输出进行比较
df_int %>%
group_by(int) %>%
summarise(n=n()) %>%
arrange(desc(n))
# A tibble: 47 × 2
int n
<chr> <int>
1 CellAge_Induces 126
2 CellAge_Inhibits 110
3 REACTOME_SENESCENCE_SASP 95
4 ISM_SCORE 93
5 senmayo 77
6 FRIDMAN_SENESCENCE_UP 44
7 ISM_SCORE|MOSERLE_IFNA_RESPONSE 27
8 CLASSICAL_SASP|senmayo 12
9 CLASSICAL_SASP 8
10 SAEPHIA_CURATED_SASP 8
# … with 37 more rows
# ℹ Use `print(n = ...)` to see more rows
upset(fromList(list_filter),nsets = 10)
目前还没有现成的 upSetR 函数用于此(还)。但是,可以提取它:
library(UpSetR)
# Example input as list, expected output is 1 and 5:
listInput <- list(one = c(1, 2, 3, 5, 7, 8, 11, 12, 13),
two = c(1, 2, 4, 5, 10),
three = c(1, 5, 6, 7, 8, 9, 10, 12, 13))
分配翻转时返回一个值,其中还包括数据:
x <- upset(fromList(listInput))
x$New_data
# one two three
# 1 1 1 1
# 2 1 1 0
# 3 1 0 0
# 4 1 1 1
# 5 1 0 1
# 6 1 0 1
# 7 1 0 0
# 8 1 0 1
# 9 1 0 1
# 10 0 1 0
# 11 0 1 1
# 12 0 0 1
# 13 0 0 1
从这里我们可以看到它是所有三组中的第一行和第四行。项目的顺序是根据它们在列表中出现的顺序定义的,请参阅:
x1 <- unlist(listInput, use.names = FALSE)
x1 <- x1[ !duplicated(x1) ]
x1
# [1] 1 2 3 5 7 8 11 12 13 4 10 6 9
现在我们知道列表中引用的“New_data”的行号。因此,由于我们有 3 列,因此过滤总和为 3 的行:
x1[ rowSums(x$New_data) == 3 ]
# [1] 1 5
或者我们可以只使用Reduce:
Reduce(intersect, listInput)
# [1] 1 5
我喜欢@Javier的答案(我实际上在我的代码中使用了它,所以谢谢!)。我需要对此进行进一步的工作,并且我找到了一种快速而肮脏的方法来做同样的事情
data <- data.frame(
entry = paste0("Entry.", 1:10),
"A" = c(0,0,0,0,1,0,1,1,0,0),
"B" = c(1,0,0,0,1,1,1,1,1,0),
"C" = c(1,1,1,1,0,0,1,0,1,1))
aggregate(data$Entry,by=list(data$A,data$B,data$C),FUN="length")