如何对字符串进行分组和连接但跳过一组?

问题描述 投票:0回答:2

我有一张桌子,里面有人们和他们所属的群体。它的格式如下:

person_id <- c("A1", "A1", "A1", "A1", "A2", "A2", "A3", "A3", "B1", "B1", "C1", "C1", "C2", "C2")
year <- c(2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016)
group_id <- c("abc", "abc", "cdz", "cdz", "abc", "abc", "ghe", "ghe", "abc", "fjx", "ghe", "ghe", "cdz", "cdz")
example <- data.frame(person_id, group_id, year)

我想创建一个列,为每个人显示他们在特定年份与哪些其他人共享一个组。这就是我目前拥有的:

example <- within(example, {
  connections <- ave(person_id, group_id, year, 
                     FUN=function(x) paste(x, collapse=', ')
                     }))
res <- example %>%
  group_by(person_id, year) %>%
  summarise(joint=paste(connections, collapse=', '))

这接近我需要的,但我不想在每一行中包含 person_id。例如,我的代码创建一个新列,其第一个值为“A1,A2,B1,A1,C2”。我希望这个值是“A2,B1,C2”。在我的示例中,B1 在 2016 年没有与任何人共享组。我的代码生成“B1”的行值,但我希望此单元格为空字符串。我怎样才能实现这个目标?

此外,我正在处理的数据非常大,大约有 10 亿行。按两次分组似乎效率很低,但我不确定不这样做是否可以实现我想要做的事情。有没有更好的方法来解决这个问题?

注意:我无法使用 tidyr。

r dataframe aggregate
2个回答
0
投票

您可以首先在

aggregate
group_id + year
。使用
c
list
,而不是
paste
merge
与原始数据,
setdiff
person_id
将其排除。我们可以使用
replace
NA
清空单元格。

> aggregate(cbind(pid=person_id) ~ group_id + year, example, c) |> 
+   merge(example) |> 
+   within({
+     pid <- Vectorize(setdiff)(pid, person_id)
+     pid <- replace(pid, !lengths(pid), NA_character_)
+   })
   group_id year    pid person_id
1       abc 2015 A2, B1        A1
2       abc 2015 A1, B1        A2
3       abc 2015 A1, A2        B1
4       abc 2016     A2        A1
5       abc 2016     A1        A2
6       cdz 2015     C2        A1
7       cdz 2015     A1        C2
8       cdz 2016     C2        A1
9       cdz 2016     A1        C2
10      fjx 2016     NA        B1
11      ghe 2015     C1        A3
12      ghe 2015     A3        C1
13      ghe 2016     C1        A3
14      ghe 2016     A3        C1

数据:

> dput(example)
structure(list(person_id = c("A1", "A1", "A1", "A1", "A2", "A2", 
"A3", "A3", "B1", "B1", "C1", "C1", "C2", "C2"), group_id = c("abc", 
"abc", "cdz", "cdz", "abc", "abc", "ghe", "ghe", "abc", "fjx", 
"ghe", "ghe", "cdz", "cdz"), year = c(2015, 2016, 2015, 2016, 
2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016)), class = "data.frame", row.names = c(NA, 
-14L))

0
投票

使用

tidyverse

example %>%
  group_split(year, .keep = FALSE) %>%
  map(~igraph::graph_from_data_frame(.,directed = FALSE) %>%
    igraph::components()%>%
    igraph::membership() %>%
    enframe(name = 'person_id') %>%
    right_join(example, 'person_id') %>%
    mutate(value = list(person_id), .by = value) %>%
    summarise(value = toString(setdiff(value[[1]], person_id)),
              year = year[1], .by = person_id)) %>%
  bind_rows()
# A tibble: 12 × 3
   person_id value         year
   <chr>     <chr>        <dbl>
 1 A1        "A2, B1, C2"  2015
 2 A2        "A1, B1, C2"  2015
 3 A3        "C1"          2015
 4 B1        "A1, A2, C2"  2015
 5 C1        "A3"          2015
 6 C2        "A1, A2, B1"  2015
 7 A1        "A2, C2"      2015
 8 A2        "A1, C2"      2015
 9 A3        "C1"          2015
10 B1        ""            2015
11 C1        "A3"          2015
12 C2        "A1, A2"      2015
© www.soinside.com 2019 - 2024. All rights reserved.