假设有很多二进制字符串
x <- c("0100100010101010", "0100110010101010","0111001000","010111")
我正在 R 中寻找一种快速方法来输出一个矩阵,该矩阵包含从一开始的成对(不包括自匹配)最长公共子串。例如,解决方案可能如下所示
> mySolution(x)
[,1] [,2] [,3] [,4]
[1,] "" "01001" "01" "010"
[2,] "01001" "" "01" "010"
[3,] "01" "01" "" "01"
[4,] "010" "010" "01" ""
例如,位置 [1,2] 处的矩阵是从
x[1]
和 x[2]
开始的最长公共子串。请注意,所得矩阵是对称的。所以,我们只需要计算一半即可。
我知道这可以与像
substr,sub,grepl
这样的函数一起使用,但由于我有很多字符串,我正在寻找一种非常有效的解决方案,需要很少的计算时间。也许可以选择转换为二进制数以提高性能?
f <- function(x) {
helper <- function(a, b) {
l <- min(nchar(a), nchar(b))
aa <- utf8ToInt(a)[1:l]
bb <- utf8ToInt(b)[1:l]
intToUtf8(aa[cumsum(aa != bb) == 0])
}
`diag<-`(outer(x, x, Vectorize(helper)), "")
}
这样
> f(x)
[,1] [,2] [,3] [,4]
[1,] "" "01001" "01" "010"
[2,] "01001" "" "01" "010"
[3,] "01" "01" "" "01"
[4,] "010" "010" "01" ""