一种更有效的方法来检查多个测试字符串中存在哪些子字符串？

Question

我想编写一个 R 函数，它接受一个字符串作为输入，检查该字符串中是否存在数千个子字符串，并返回在测试字符串中找到的子字符串向量。

我编写了代码来执行此操作，但如果我必须为 10,000-20,000 个不同的“tested_string”值多次调用该函数，速度会慢得令人无法接受：

# The function
# This function would actually do many other things to the tested string too, but for now I'm just testing the slow step
check_substrings <- function(tested_string,substrings) {
  result <- sapply(substrings,function(substring,tested_string) { return(any(grepl(substring,tested_string,fixed=T)))},tested_string=tested_string)
  return(names(result)[result])
}

# Testing speed
library('stringi') # Used for generating random strings for testing

# Setting up to test it
# Make random substrings of varying length
set.seed(5)
substrings <- unique(c(stri_rand_strings(20000,6,pattern='[A-Z]'),
                stri_rand_strings(30000,7,pattern='[A-Z]'),
                stri_rand_strings(40000,8,pattern='[A-Z]')))
# Pre-generate random tested strings so they won't be part of the timing below
set.seed(5)
teststrings <- unique(stri_rand_strings(100,20,pattern='[A-Z]'))
teststrings_1k <- stri_rand_strings(1000,20,pattern='[A-Z]')

# Time how long it takes to check 100 tested strings this way
# (I'll actually need to do 10,000 to 20,000 tested strings)
system.time(
  for(tstring in teststrings) {
    x <- check_substrings(tstring,substrings) 
  }
)
# user  system elapsed 
# 12.457   0.046  12.499
# At a rate of 12.4 seconds per 100 test strings, it would take 41.3 minutes to do 20k test strings

我知道有几个函数，例如

grepl()

和

stri_detect()

可以解决相反的问题（检查单个模式的多个字符串向量）。把问题转向一边，检查每种模式的所有可能的测试字符串，一次一种模式可以大大加快速度：

system.time(
  {
    # Initial check for which substrings are in which test strings
    res_matrix <- sapply(substrings,grepl,x=teststrings_1k,fixed=T)
    # Turn result into a list of which substrings are in which test strings
    rownames(res_matrix) <- teststrings_1k
    res_list <- apply(res_matrix,1,function(x) { return(names(x)[x])})
  }
)
# user  system elapsed 
# 3.641   0.227   3.904
# This is much better - at a rate of 3.9 seconds per 1000 test strings it will take
# take 1.3 minutes to do 20,000 test strings

如果必须的话，我可以使用第二种方法，但这将是一个有点混乱的解决方案，因为我必须提前对所有子字符串和测试字符串进行预先执行此操作（而不是让函数执行一堆操作）每个特定测试字符串的内容都会动态检查该字符串）。

是否有更好/更有效的方法来执行第一个示例中的

check_substrings()

函数（检查一个字符串中的多个子字符串），或者我最好坚持使用第二个示例，尽管它会导致其他复杂情况？

Answer 1

/../
stri_detect()
可以解决相反的问题（检查单个模式的多个字符串向量）/../

stringi

的回收是双向的：

# many strings - one pattern
stringi::stri_detect_fixed(str = c("ABC", "BCD", "CDE"), pattern = "A")
#> [1]  TRUE FALSE FALSE

# one string - many patterns
stringi::stri_detect_fixed(str = "ABC", pattern = c("A", "B", "D"))
#> [1]  TRUE  TRUE FALSE

因此，您可以使用

stri_extract_first_fixed()

或

stri_detect_fixed()

+ 子集重写该检查（比从 stri_extract 结果中省略

NA

值快一点），看起来像这样：

library(stringi)
check_substrings_stringi <- \(tested_string, substrings) substrings[stri_detect_fixed(tested_string, substrings)]

单个

tested_string

和绝对时间值：

bench::mark(
  check_substrings(teststrings[1], substrings),
  check_substrings_stringi(teststrings[1], substrings),
  iterations = 10
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 check_substrings(teststrings[1… 226.98ms 236.64ms      4.15    3.75MB     26.1
#> 2 check_substrings_stringi(tests…   4.36ms   4.44ms    210.    709.75KB      0

全面扫描

teststrings

和相对执行时间：

bench::mark(
  check_substrings         = lapply(teststrings, check_substrings, substrings = substrings),
  check_substrings_stringi = lapply(teststrings, check_substrings_stringi, substrings = substrings),
  relative = TRUE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                 min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>               <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 check_substrings          40.8   40.8       1        5.46     5.51
#> 2 check_substrings_stringi   1      1        40.8      1        1

一种更有效的方法来检查多个测试字符串中存在哪些子字符串？

问题描述投票：0回答：1

1个回答

最新问题

一种更有效的方法来检查多个测试字符串中存在哪些子字符串？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1