我有一张桌子,里面有人们和他们所属的群体。它的格式如下:
person_id <- c("A1", "A1", "A1", "A1", "A2", "A2", "A3", "A3", "B1", "B1", "C1", "C1", "C2", "C2")
year <- c(2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016)
group_id <- c("abc", "abc", "cdz", "cdz", "abc", "abc", "ghe", "ghe", "abc", "fjx", "ghe", "ghe", "cdz", "cdz")
example <- data.frame(person_id, group_id, year)
我想创建一个列,为每个人显示他们在特定年份与哪些其他人共享一个组。这就是我目前拥有的:
example <- within(example, {
connections <- ave(person_id, group_id, year,
FUN=function(x) paste(x, collapse=', ')
}))
res <- example %>%
group_by(person_id, year) %>%
summarise(joint=paste(connections, collapse=', '))
这接近我需要的,但我不想在每一行中包含 person_id。例如,我的代码创建一个新列,其第一个值为“A1,A2,B1,A1,C2”。我希望这个值是“A2,B1,C2”。在我的示例中,B1 在 2016 年没有与任何人共享组。我的代码生成“B1”的行值,但我希望此单元格为空字符串。我怎样才能实现这个目标?
此外,我正在处理的数据非常大,大约有 10 亿行。按两次分组似乎效率很低,但我不确定不这样做是否可以实现我想要做的事情。有没有更好的方法来解决这个问题?
注意:我无法使用 tidyr。
您可以首先在
aggregate
上group_id + year
。使用 c
或 list
,而不是 paste
。 merge
与原始数据,setdiff
与 person_id
将其排除。我们可以使用 replace
NA
清空单元格。
> aggregate(cbind(pid=person_id) ~ group_id + year, example, c) |>
+ merge(example) |>
+ within({
+ pid <- Vectorize(setdiff)(pid, person_id)
+ pid <- replace(pid, !lengths(pid), NA_character_)
+ })
group_id year pid person_id
1 abc 2015 A2, B1 A1
2 abc 2015 A1, B1 A2
3 abc 2015 A1, A2 B1
4 abc 2016 A2 A1
5 abc 2016 A1 A2
6 cdz 2015 C2 A1
7 cdz 2015 A1 C2
8 cdz 2016 C2 A1
9 cdz 2016 A1 C2
10 fjx 2016 NA B1
11 ghe 2015 C1 A3
12 ghe 2015 A3 C1
13 ghe 2016 C1 A3
14 ghe 2016 A3 C1
数据:
> dput(example)
structure(list(person_id = c("A1", "A1", "A1", "A1", "A2", "A2",
"A3", "A3", "B1", "B1", "C1", "C1", "C2", "C2"), group_id = c("abc",
"abc", "cdz", "cdz", "abc", "abc", "ghe", "ghe", "abc", "fjx",
"ghe", "ghe", "cdz", "cdz"), year = c(2015, 2016, 2015, 2016,
2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016)), class = "data.frame", row.names = c(NA,
-14L))
使用
tidyverse
:
example %>%
group_split(year, .keep = FALSE) %>%
map(~igraph::graph_from_data_frame(.,directed = FALSE) %>%
igraph::components()%>%
igraph::membership() %>%
enframe(name = 'person_id') %>%
right_join(example, 'person_id') %>%
mutate(value = list(person_id), .by = value) %>%
summarise(value = toString(setdiff(value[[1]], person_id)),
year = year[1], .by = person_id)) %>%
bind_rows()
# A tibble: 12 × 3
person_id value year
<chr> <chr> <dbl>
1 A1 "A2, B1, C2" 2015
2 A2 "A1, B1, C2" 2015
3 A3 "C1" 2015
4 B1 "A1, A2, C2" 2015
5 C1 "A3" 2015
6 C2 "A1, A2, B1" 2015
7 A1 "A2, C2" 2015
8 A2 "A1, C2" 2015
9 A3 "C1" 2015
10 B1 "" 2015
11 C1 "A3" 2015
12 C2 "A1, A2" 2015