R-为给定对查询一对列

问题描述 投票:0回答:1

我有一个包含两列的数据框,它们将用作查询的主键。我想提取包含一对感兴趣的标识符的行,并获取关联的值。例如

df <- t(combn(LETTERS, 2))
df <- data.frame(term1 = df[,1], term2 = df[,2], value = sample(10, nrow(df), T))

如果我想得到对“ C”和“ Z”对的值,那么我想到的唯一方法是

cz <- intersect(union(which(df[,1] == "C"), which(df[,2] == "C")), union(which(df[,1] == "Z"), which(df[,2] == "Z")))
df[cz,]

是否有更有效的方法?我的数据框大约有50,000行,我需要执行此操作至少几百万次。所以我想尽可能地高效。

谢谢

r search
1个回答
0
投票

如果您担心速度,则data.table应该更快。如果您像我一样,并且不熟悉data.table语法,则dtplyr使其变得容易。在下面的基准测试中,dtplyr看起来比上面的基本R选项快3-5倍。而且,至少对我来说,它更容易阅读。

library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
library(microbenchmark)

# Creating our test table
df <- tibble(
  term1 = sample(LETTERS, 50000, replace = T),
  term2 = sample(LETTERS, 50000, replace = T),
  value = sample(10, 50000, T)    
)

# lazy version of the test table is for dtplyr
df_lazy <- lazy_dt(df)

# answer proposed above
cz <- intersect(union(which(df[,1] == "C"), which(df[,2] == "C")), union(which(df[,1] == "Z"), which(df[,2] == "Z")))
df[cz,]

# a dtplyr answer
cz_dtplyr <- df_lazy %>%
  filter((term1 == "C" & term2 == "Z") | (term1 == "Z" & term2 == "C"))

#benchmarking the 2 options
benchmarks <- microbenchmark(
  "base_union" = intersect(union(which(df[,1] == "C"), which(df[,2] == "C")), union(which(df[,1] == "Z"), which(df[,2] == "Z"))),
  "dtplyr" = df_lazy %>%
    filter((term1 == "C" & term2 == "Z") | (term1 == "Z" & term2 == "C"))
)

benchmarks

Unit: microseconds
       expr    min      lq     mean  median      uq    max neval
 base_union 1669.9 1703.15 2127.677 1755.45 2046.40 6121.8   100
     dtplyr  666.8  692.70  744.486  722.10  779.65 1042.2   100
© www.soinside.com 2019 - 2024. All rights reserved.