我有一个data.frame,其中每一行都是一组人,在2到8之间。我想创建一个图表,显示所有人作为顶点,边缘显示这两个人在我的原始数据中出现在一行中。帧。图表不是问题,只是创建边缘列表。我的问题似乎与Creating an edgelist from Patent data in R 非常接近,但由于它是我在R的第一天,因此不能缩小与那里的差距。
我的data
看起来像这样:
name_1 name_2 name_3 name_4 name_5
jan tim
tom tim bernd
stefen tom tim jan bernd
marcel bernd
dput(data)
的输出:
structure(list(name_1 = structure(c(1L, 4L, 3L, 2L), .Label = c("jan",
"marcel", "stefen", "tom"), class = "factor"), name_2 = structure(c(2L,
2L, 3L, 1L), .Label = c("bernd", "tim", "tom"), class = "factor"),
name_3 = structure(c(1L, 2L, 3L, 1L), .Label = c("", "bernd",
"tim"), class = "factor"), name_4 = structure(c(1L, 1L, 2L,
1L), .Label = c("", "jan"), class = "factor"), name_5 = structure(c(1L,
1L, 2L, 1L), .Label = c("", "bernd"), class = "factor")), .Names = c("name_1",
"name_2", "name_3", "name_4", "name_5"), class = "data.frame", row.names = c(NA,
-4L))
如所需的输出我希望有类似的东西:
jan tim
tom tim
tom bernd
tim bernd
stefen tom
stefen tim
stefen jan
stefen bernd
tom tim
tom jan
tom bernd
tim jan
tim bernd
marcel bernd
重复对我来说没问题。
这里我们使用gtools包中的combinations
函数。
df1是给定的数据帧。
# convert factor columns to character
df1 <- sapply(df1, as.character)
# get names per row which are not blank
df1 <- apply(df1, 1, function(x) toString(x[x != '']))
# save output to answer
answer <- list()
# we append the combination of names
answer <- sapply(seq(df1), function(x) {
print(df1[x])
val <- unlist(strsplit(df1[x], split = ','))
answer[[x]] <- combinations(n = length(val), r = 2, v = val,repeats.allowed = F)
})
# convert the list to df
df2 <- do.call('rbind', answer)
print(df2)
[1,] " tim" "jan"
[2,] " bernd" " tim"
[3,] " bernd" "tom"
[4,] " tim" "tom"
[5,] " bernd" " jan"
[6,] " bernd" " tim"
[7,] " bernd" " tom"
[8,] " bernd" "stefen"
[9,] " jan" " tim"
[10,] " jan" " tom"
[11,] " jan" "stefen"
[12,] " tim" " tom"
[13,] " tim" "stefen"
[14,] " tom" "stefen"
[15,] " bernd" "marcel"
这里有一些选择。首先在基础R中,您可以使用apply
迭代行和combn
来获得组合。因为你的数据处于一种尴尬的配置,你需要大量的代码将它们拼接在一起,例如,
df <- data.frame(name_1 = c("jan", "tom", "stefen", "marcel"),
name_2 = c("tim", "tim", "tom", "bernd"),
name_3 = c("", "bernd", "tim", ""),
name_4 = c("", "", "jan", ""),
name_5 = c("", "", "bernd", ""))
as.data.frame(
do.call(rbind,
apply(df, 1, function(x){
x <- x[x != ''];
t(combn(x, 2))
})),
stringsAsFactors = FALSE)
#> V1 V2
#> 1 jan tim
#> 2 tom tim
#> 3 tom bernd
#> 4 tim bernd
#> 5 stefen tom
#> 6 stefen tim
#> 7 stefen jan
#> 8 stefen bernd
#> 9 tom tim
#> 10 tom jan
#> 11 tom bernd
#> 12 tim jan
#> 13 tim bernd
#> 14 jan bernd
#> 15 marcel bernd
从技术上讲,使用Map
比使用apply
(强制转换为矩阵)更好,但这只需要更多胶水代码:
as.data.frame(do.call(rbind,
do.call(Map,
c(function(...){
x <- c(...);
x <- x[x != ''];
t(combn(x, 2))
},
lapply(df, as.character)))),
stringsAsFactors = FALSE)
#> V1 V2
#> 1 jan tim
#> 2 tom tim
#> 3 tom bernd
#> 4 tim bernd
#> 5 stefen tom
#> 6 stefen tim
#> 7 stefen jan
#> 8 stefen bernd
#> 9 tom tim
#> 10 tom jan
#> 11 tom bernd
#> 12 tim jan
#> 13 tim bernd
#> 14 jan bernd
#> 15 marcel bernd
这两种方法在矩阵和数据帧之间交替,这很难跟踪。您可以编写一种完全避免数据框架的方法,但它的时间更长。
更漂亮的选择是使用tidyverse。首先,将数据整理成长形:
library(tidyverse)
df_tidy <- df %>%
mutate_all(as.character) %>%
mutate_all(na_if, '') %>%
rowid_to_column() %>%
gather(col, name, -rowid) %>%
drop_na(name)
df_tidy
#> rowid col name
#> 1 1 name_1 jan
#> 2 2 name_1 tom
#> 3 3 name_1 stefen
#> 4 4 name_1 marcel
#> 5 1 name_2 tim
#> 6 2 name_2 tim
#> 7 3 name_2 tom
#> 8 4 name_2 bernd
#> 10 2 name_3 bernd
#> 11 3 name_3 tim
#> 15 3 name_4 jan
#> 19 3 name_5 bernd
好多了。从这一点来看,甚至基本方法也更容易,例如
as.data.frame(do.call(rbind,
aggregate(name ~ rowid, df_tidy,
function(x){list(t(combn(x, 2)))})$name))
或者,继续使用tidyverse,使用combn
:
df_tidy %>%
group_by(rowid) %>%
summarise(name = list(combn(name, 2, compose(as_data_frame, t), simplify = FALSE)),
name = map(name, bind_rows)) %>%
unnest(name)
#> # A tibble: 15 x 3
#> rowid V1 V2
#> <int> <chr> <chr>
#> 1 1 jan tim
#> 2 2 tom tim
#> 3 2 tom bernd
#> 4 2 tim bernd
#> 5 3 stefen tom
#> 6 3 stefen tim
#> 7 3 stefen jan
#> 8 3 stefen bernd
#> 9 3 tom tim
#> 10 3 tom jan
#> 11 3 tom bernd
#> 12 3 tim jan
#> 13 3 tim bernd
#> 14 3 jan bernd
#> 15 4 marcel bernd
...或者效率较低但不太复杂的方法:
df_tidy %>%
group_by(rowid) %>%
mutate(name2 = list(name)) %>%
unnest() %>%
filter(name < name2)
#> # A tibble: 15 x 4
#> # Groups: rowid [4]
#> rowid col name name2
#> <int> <chr> <chr> <chr>
#> 1 1 name_1 jan tim
#> 2 3 name_1 stefen tom
#> 3 3 name_1 stefen tim
#> 4 2 name_2 tim tom
#> 5 4 name_2 bernd marcel
#> 6 2 name_3 bernd tom
#> 7 2 name_3 bernd tim
#> 8 3 name_3 tim tom
#> 9 3 name_4 jan stefen
#> 10 3 name_4 jan tom
#> 11 3 name_4 jan tim
#> 12 3 name_5 bernd stefen
#> 13 3 name_5 bernd tom
#> 14 3 name_5 bernd tim
#> 15 3 name_5 bernd jan
tidyr::complete
可以以类似的方式使用。