data.tables 的 := 和 dpylr 的 mutate 何时不同?

问题描述 投票:0回答:1

我的数据看起来像这样:

example <- data.table(
person_id = c("A1", "A1", "A1", "A1", "A2", "A2", "A3", "A3", "B1", "B1", "C1", "C1", "C2", "C2"),
year = c(2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016),
group_id = c("abc", "abc", "cdz", "cdz", "abc", "abc", "ghe", "ghe", "abc", "fjx", "ghe", "ghe", "cdz", "cdz")
)

我需要创建一个列,为每个人显示在特定年份与他们共享一个组的其他人。我所拥有的有效,但需要非常长的时间。我的数据集中有数十亿行。

res <- example %>%
  group_by(year, group_id) %>%
  mutate(connections = list(person_id)) %>%
  group_by(year, person_id) %>%
  summarise(connections = toString(setdiff(unlist(connections), person_id)))

我希望使用 data.table 的 group by 因为我读过它更快,但它没有给我相同的结果。所以我的问题有两个部分。为什么这些不能完成同样的事情?我需要在下面的代码中更改什么才能使其具有与上面的代码相同的结果?或者有更有效的方法来解决这个问题吗?

example <- example[, connections := list(person_id), by=list(year, group_id)]
res <- example[, .(connections = toString(setdiff(unlist(connections), person_id))), by=list(year, person_id)]

注意:我的 R 版本是 3.4.3,我没有安装 tidyverse(无法安装)。

r data.table
1个回答
0
投票

A

data.table
等价物是

library(data.table)

example[, .(person_id, connections = list(person_id)), 
    by = .(year, group_id)][
  , .(connections = toString(setdiff(unlist(connections), person_id))), 
    by = .(year, person_id)]
     year person_id connections
    <num>    <char>      <char>
 1:  2015        A1  A2, B1, C2
 2:  2015        A2      A1, B1
 3:  2015        B1      A1, A2
 4:  2016        A1      A2, C2
 5:  2016        A2          A1
 6:  2015        C2          A1
 7:  2016        C2          A1
 8:  2015        A3          C1
 9:  2015        C1          A3
10:  2016        A3          C1
11:  2016        C1          A3
12:  2016        B1
© www.soinside.com 2019 - 2024. All rights reserved.