处理因子时，Base R grep-family 比 `stringr` 变体快得多

Question

我一直在使用

stringr

，因为它应该更快，但今天我发现它在处理因子项时要慢得多。我没有看到任何警告表明会出现这种情况，也没有看到任何警告。

例如：

string_options = c("OneWord", "TwoWords", "ThreeWords")

sample_chars = sample(string_options, 1e6, replace = TRUE)
sample_facts = as_factor(sample_chars)

当使用

character

项时，基本 R 比

stringr

慢，正如预期的那样。但在处理

factor

项时，基本 R 速度快了 30 倍。

bench::mark(
    base_chars = grepl("Two", sample_chars),
    stringr_chars = str_detect(sample_chars, "Two"),
    base_facts = grepl("Two", sample_facts),
    stringr_facts = str_detect(sample_facts, "Two")
)

# A tibble: 4 × 13
#  expression         min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result            memory             time             gc      
#  <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>            <list>             <list>           <list>  
#1 base_chars     116.1ms 116.38ms      8.58    3.81MB        0     5     0      583ms <lgl [1,000,000]> <Rprofmem [1 × 3]> <bench_tm [5]>   <tibble>
#2 stringr_chars  86.04ms   88.2ms     11.3     3.81MB        0     6     0      532ms <lgl [1,000,000]> <Rprofmem [2 × 3]> <bench_tm [6]>   <tibble>
#3 base_facts      3.59ms   3.65ms    271.     11.44MB        0   136     0      501ms <lgl [1,000,000]> <Rprofmem [3 × 3]> <bench_tm [136]> <tibble>
#4 stringr_facts  90.71ms  91.29ms     10.9    11.44MB        0     6     0      549ms <lgl [1,000,000]> <Rprofmem [3 × 3]> <bench_tm [6]>   <tibble>

看起来

stringr

与

factor

项没有做任何不同的事情，但基础 R 正在显着优化它。这是预期的行为吗？我应该将此报告为

stringr

问题吗？是否有一些

stringr

设置我完全丢失了？我不想考虑数据的格式来确定我使用的是

stringr

还是基本 R。

Answer 1

你有一个很长的向量，只有很少的唯一值

正如 MrFlick 在评论中指出的那样，发生这种情况是因为

grepl()

中有一种特殊情况，它与因子的水平相匹配。您有一个包含一百万个值的向量，但只有三个唯一值，因此可以通过这种方式节省大量时间。

我认为

stringr

没有参数告诉它您正在处理具有许多重复值的数据。但是，您可以编写一个函数，将相同的逻辑应用于具有

stringr

的因子：

str_detect_factor <- function(fct, pattern, negate = FALSE) {
    out <- stringr::str_detect(
        c(levels(fct), NA_character_), pattern, negate
    )
    outna <- out[length(out)]
    out <- out[fct]
    out[is.na(fct)] <- outna
    out
}

不幸的是，

stringr::str_detect()

不是泛型，但我们可以编写一个小包装器来在有因子的情况下应用此函数：

str_detect2 <- function(string, pattern, negate = FALSE) {
    if (is.factor(string)) {
        str_detect_factor(string, pattern, negate)
    } else {
        stringr::str_detect(string, pattern, negate)
    }
}

然后基准测试：

bench::mark(
    base_facts = grepl("Two", sample_facts),
    stringr_facts = str_detect(sample_facts, "Two"),
    stringr_facts2 = str_detect2(sample_facts, "Two")
)

#   expression          min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 base_facts       3.23ms   3.81ms    247.      11.4MB    51.7     86    18      348ms
# 2 stringr_facts  179.92ms 185.12ms      5.40    11.4MB     2.70     2     1      370ms
# 3 stringr_facts2   3.25ms   3.93ms    238.      11.4MB    39.7     96    16      403ms

所以现在它比基础快得多。

对字符向量应用相同的逻辑

您的字符向量也有许多重复值，因此您可以应用相同的逻辑。这不会像因子那样快，因子不必应用

unique()

来查找所有不同的值，因为它们已经存储在对象中。但是，它应该明显更快：

str_detect_rep <- function(string, pattern, negate = FALSE) {
    unique_string <- c(unique(string), NA_character_)
    out <- stringr::str_detect(
        unique_string, pattern, negate
    ) |> setNames(unique_string)
    outna <- out[length(out)]
    out <- out[string]
    out[is.na(string)] <- outna
    unname(out)
}

我们可以看到它比基本 R 或

stringr

版本要快得多：

bench::mark(
    base_chars = grepl("Two", sample_chars),
    stringr_chars = str_detect(sample_chars, "Two"),
    stringr_chars2 = str_detect_rep(sample_chars, "Two")
)

# A tibble: 3 × 9
#   expression          min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 base_chars      131.2ms  173.1ms      5.80    3.81MB     0        4     0    689.6ms
# 2 stringr_chars     213ms  217.5ms      4.60    3.81MB     2.30     2     1    434.9ms
# 3 stringr_chars2   37.3ms   37.3ms     26.8    42.33MB   295.       1    11     37.3ms

处理因子时，Base R grep-family 比 `stringr` 变体快得多

问题描述投票：0回答：1

1个回答

你有一个很长的向量，只有很少的唯一值

对字符向量应用相同的逻辑

最新问题

处理因子时，Base R grep-family 比 `stringr` 变体快得多

问题描述 投票：0回答：1

1个回答

你有一个很长的向量，只有很少的唯一值

对字符向量应用相同的逻辑

最新问题

问题描述投票：0回答：1