我的数据看起来像这样:
example <- data.table(
person_id = c("A1", "A1", "A1", "A1", "A2", "A2", "A3", "A3", "B1", "B1", "C1", "C1", "C2", "C2"),
year = c(2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016),
group_id = c("abc", "abc", "cdz", "cdz", "abc", "abc", "ghe", "ghe", "abc", "fjx", "ghe", "ghe", "cdz", "cdz")
)
我需要创建一个列,为每个人显示在特定年份与他们共享一个组的其他人。我所拥有的有效,但需要非常长的时间。我的数据集中有数十亿行。
res <- example %>%
group_by(year, group_id) %>%
mutate(connections = list(person_id)) %>%
group_by(year, person_id) %>%
summarise(connections = toString(setdiff(unlist(connections), person_id)))
我希望使用 data.table 的 group by 因为我读过它更快,但它没有给我相同的结果。所以我的问题有两个部分。为什么这些不能完成同样的事情?我需要在下面的代码中更改什么才能使其具有与上面的代码相同的结果?或者有更有效的方法来解决这个问题吗?
example <- example[, connections := list(person_id), by=list(year, group_id)]
res <- example[, .(connections = toString(setdiff(unlist(connections), person_id))), by=list(year, person_id)]
注意:我的 R 版本是 3.4.3,我没有安装 tidyverse(无法安装)。
A
data.table
等价物是
library(data.table)
example[, .(person_id, connections = list(person_id)),
by = .(year, group_id)][
, .(connections = toString(setdiff(unlist(connections), person_id))),
by = .(year, person_id)]
year person_id connections
<num> <char> <char>
1: 2015 A1 A2, B1, C2
2: 2015 A2 A1, B1
3: 2015 B1 A1, A2
4: 2016 A1 A2, C2
5: 2016 A2 A1
6: 2015 C2 A1
7: 2016 C2 A1
8: 2015 A3 C1
9: 2015 C1 A3
10: 2016 A3 C1
11: 2016 C1 A3
12: 2016 B1