我有一个包含两列的数据框,它们将用作查询的主键。我想提取包含一对感兴趣的标识符的行,并获取关联的值。例如
df <- t(combn(LETTERS, 2))
df <- data.frame(term1 = df[,1], term2 = df[,2], value = sample(10, nrow(df), T))
如果我想得到对“ C”和“ Z”对的值,那么我想到的唯一方法是
cz <- intersect(union(which(df[,1] == "C"), which(df[,2] == "C")), union(which(df[,1] == "Z"), which(df[,2] == "Z")))
df[cz,]
是否有更有效的方法?我的数据框大约有50,000行,我需要执行此操作至少几百万次。所以我想尽可能地高效。
谢谢
如果您担心速度,则data.table应该更快。如果您像我一样,并且不熟悉data.table语法,则dtplyr使其变得容易。在下面的基准测试中,dtplyr看起来比上面的基本R选项快3-5倍。而且,至少对我来说,它更容易阅读。
library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
library(microbenchmark)
# Creating our test table
df <- tibble(
term1 = sample(LETTERS, 50000, replace = T),
term2 = sample(LETTERS, 50000, replace = T),
value = sample(10, 50000, T)
)
# lazy version of the test table is for dtplyr
df_lazy <- lazy_dt(df)
# answer proposed above
cz <- intersect(union(which(df[,1] == "C"), which(df[,2] == "C")), union(which(df[,1] == "Z"), which(df[,2] == "Z")))
df[cz,]
# a dtplyr answer
cz_dtplyr <- df_lazy %>%
filter((term1 == "C" & term2 == "Z") | (term1 == "Z" & term2 == "C"))
#benchmarking the 2 options
benchmarks <- microbenchmark(
"base_union" = intersect(union(which(df[,1] == "C"), which(df[,2] == "C")), union(which(df[,1] == "Z"), which(df[,2] == "Z"))),
"dtplyr" = df_lazy %>%
filter((term1 == "C" & term2 == "Z") | (term1 == "Z" & term2 == "C"))
)
benchmarks
Unit: microseconds
expr min lq mean median uq max neval
base_union 1669.9 1703.15 2127.677 1755.45 2046.40 6121.8 100
dtplyr 666.8 692.70 744.486 722.10 779.65 1042.2 100