R-为给定对查询一对列

Question

我有一个包含两列的数据框，它们将用作查询的主键。我想提取包含一对感兴趣的标识符的行，并获取关联的值。例如

df <- t(combn(LETTERS, 2))
df <- data.frame(term1 = df[,1], term2 = df[,2], value = sample(10, nrow(df), T))

如果我想得到对“ C”和“ Z”对的值，那么我想到的唯一方法是

cz <- intersect(union(which(df[,1] == "C"), which(df[,2] == "C")), union(which(df[,1] == "Z"), which(df[,2] == "Z")))
df[cz,]

是否有更有效的方法？我的数据框大约有50,000行，我需要执行此操作至少几百万次。所以我想尽可能地高效。

谢谢

Answer 1

如果您担心速度，则data.table应该更快。如果您像我一样，并且不熟悉data.table语法，则dtplyr使其变得容易。在下面的基准测试中，dtplyr看起来比上面的基本R选项快3-5倍。而且，至少对我来说，它更容易阅读。

library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
library(microbenchmark)

# Creating our test table
df <- tibble(
  term1 = sample(LETTERS, 50000, replace = T),
  term2 = sample(LETTERS, 50000, replace = T),
  value = sample(10, 50000, T)    
)

# lazy version of the test table is for dtplyr
df_lazy <- lazy_dt(df)

# answer proposed above
cz <- intersect(union(which(df[,1] == "C"), which(df[,2] == "C")), union(which(df[,1] == "Z"), which(df[,2] == "Z")))
df[cz,]

# a dtplyr answer
cz_dtplyr <- df_lazy %>%
  filter((term1 == "C" & term2 == "Z") | (term1 == "Z" & term2 == "C"))

#benchmarking the 2 options
benchmarks <- microbenchmark(
  "base_union" = intersect(union(which(df[,1] == "C"), which(df[,2] == "C")), union(which(df[,1] == "Z"), which(df[,2] == "Z"))),
  "dtplyr" = df_lazy %>%
    filter((term1 == "C" & term2 == "Z") | (term1 == "Z" & term2 == "C"))
)

benchmarks

Unit: microseconds
       expr    min      lq     mean  median      uq    max neval
 base_union 1669.9 1703.15 2127.677 1755.45 2046.40 6121.8   100
     dtplyr  666.8  692.70  744.486  722.10  779.65 1042.2   100

R-为给定对查询一对列

问题描述投票：0回答：1

1个回答

最新问题

R-为给定对查询一对列

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1