我想编写一个 R 函数,它接受一个字符串作为输入,检查该字符串中是否存在数千个子字符串,并返回在测试字符串中找到的子字符串向量。
我编写了代码来执行此操作,但如果我必须为 10,000-20,000 个不同的“tested_string”值多次调用该函数,速度会慢得令人无法接受:
# The function
# This function would actually do many other things to the tested string too, but for now I'm just testing the slow step
check_substrings <- function(tested_string,substrings) {
result <- sapply(substrings,function(substring,tested_string) { return(any(grepl(substring,tested_string,fixed=T)))},tested_string=tested_string)
return(names(result)[result])
}
# Testing speed
library('stringi') # Used for generating random strings for testing
# Setting up to test it
# Make random substrings of varying length
set.seed(5)
substrings <- unique(c(stri_rand_strings(20000,6,pattern='[A-Z]'),
stri_rand_strings(30000,7,pattern='[A-Z]'),
stri_rand_strings(40000,8,pattern='[A-Z]')))
# Pre-generate random tested strings so they won't be part of the timing below
set.seed(5)
teststrings <- unique(stri_rand_strings(100,20,pattern='[A-Z]'))
teststrings_1k <- stri_rand_strings(1000,20,pattern='[A-Z]')
# Time how long it takes to check 100 tested strings this way
# (I'll actually need to do 10,000 to 20,000 tested strings)
system.time(
for(tstring in teststrings) {
x <- check_substrings(tstring,substrings)
}
)
# user system elapsed
# 12.457 0.046 12.499
# At a rate of 12.4 seconds per 100 test strings, it would take 41.3 minutes to do 20k test strings
我知道有几个函数,例如
grepl()
和 stri_detect()
可以解决相反的问题(检查单个模式的多个字符串向量)。把问题转向一边,检查每种模式的所有可能的测试字符串,一次一种模式可以大大加快速度:
system.time(
{
# Initial check for which substrings are in which test strings
res_matrix <- sapply(substrings,grepl,x=teststrings_1k,fixed=T)
# Turn result into a list of which substrings are in which test strings
rownames(res_matrix) <- teststrings_1k
res_list <- apply(res_matrix,1,function(x) { return(names(x)[x])})
}
)
# user system elapsed
# 3.641 0.227 3.904
# This is much better - at a rate of 3.9 seconds per 1000 test strings it will take
# take 1.3 minutes to do 20,000 test strings
如果必须的话,我可以使用第二种方法,但这将是一个有点混乱的解决方案,因为我必须提前对所有子字符串和测试字符串进行预先执行此操作(而不是让函数执行一堆操作)每个特定测试字符串的内容都会动态检查该字符串)。
是否有更好/更有效的方法来执行第一个示例中的
check_substrings()
函数(检查一个字符串中的多个子字符串),或者我最好坚持使用第二个示例,尽管它会导致其他复杂情况?
/../
可以解决相反的问题(检查单个模式的多个字符串向量)/../stri_detect()
stringi
的回收是双向的:
# many strings - one pattern
stringi::stri_detect_fixed(str = c("ABC", "BCD", "CDE"), pattern = "A")
#> [1] TRUE FALSE FALSE
# one string - many patterns
stringi::stri_detect_fixed(str = "ABC", pattern = c("A", "B", "D"))
#> [1] TRUE TRUE FALSE
因此,您可以使用
stri_extract_first_fixed()
或 stri_detect_fixed()
+ 子集重写该检查(比从 stri_extract
结果中省略 NA值快一点),看起来像这样:
library(stringi)
check_substrings_stringi <- \(tested_string, substrings) substrings[stri_detect_fixed(tested_string, substrings)]
单个
tested_string
和绝对时间值:
bench::mark(
check_substrings(teststrings[1], substrings),
check_substrings_stringi(teststrings[1], substrings),
iterations = 10
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 check_substrings(teststrings[1… 226.98ms 236.64ms 4.15 3.75MB 26.1
#> 2 check_substrings_stringi(tests… 4.36ms 4.44ms 210. 709.75KB 0
全面扫描
teststrings
和相对执行时间:
bench::mark(
check_substrings = lapply(teststrings, check_substrings, substrings = substrings),
check_substrings_stringi = lapply(teststrings, check_substrings_stringi, substrings = substrings),
relative = TRUE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 check_substrings 40.8 40.8 1 5.46 5.51
#> 2 check_substrings_stringi 1 1 40.8 1 1