假设我们有一个分数表 -
score_matrix = t(data.frame('A' = c('A' = 1,'T' = 1,'G' = 2,'C' = 2),
'T' = c('A' = 1,'T' = 1,'G' = 2,'C' = 2),
'G' = c('A' = 2,'T' = 2,'G' = 1,'C' = 1),
'C' = c('A' = 2,'T' = 2,'G' = 1,'C' = 1)))
>
A T G C
A 1 1 2 2
T 1 1 2 2
G 2 2 1 1
C 2 2 1 1
现在我想比较等长的多个字符串...
Query = rawToChar(as.raw(sample(c(65,67,71,84), 25, replace=T)))
Subject = rawToChar(as.raw(sample(c(65,67,71,84), 25, replace=T)))
> Query
[1] "TTATACCAGTGTATGATGAGCCTCG"
> Subject
[1] "GTAGCTCACGAATATATGAACCTCA"
...并将主题字符串与查询匹配,并根据上面的矩阵将其转换为一系列分数,即 -
2 1 1 2 2 2 1 1 1 2 2 1 1 1 2 1 1 1 1 2 1 1 1 1 2
上面比较的代码-
for (i in 1:length(unlist(strsplit(Query,"")))) { temp = cat(score_mat[unlist(strsplit(Query,""))[i],unlist(strsplit(Subject,""))[i]],"") }
我的实际集合会是一个更大的集合,例如矩阵格式 -
data_matrix = matrix(unlist(strsplit(Query,"")),nrow = 1)
data_matrix = rbind(data_matrix,matrix(unlist(strsplit(Subject,"")),nrow = 1))
for(i in 1:23) {
data_matrix = rbind(data_matrix,
matrix(unlist(strsplit(rawToChar(as.raw(sample(c(65,67,71,84), 25, replace=T))),"")),
nrow = 1)) }
> dim(data_matrix)
[1] 25 25
我可以在嵌套循环中单独比较字母,但效率很低。我试过这样的东西 -
for (i in 2:nrow(data_matrix)) {
for (j in 1:ncol(data_matrix)) {
data_matrix[i,j] = score_matrix[data_matrix[i,j],data_matrix[1,j]] }
但是对于大约 5000 X 5000 矩阵的真实数据,这非常慢。作为参考,她是该矩阵 25 X 25 的基准。我的数据集将花费指数级更长的时间。
microbenchmark(for (i in 2:nrow(data_matrix)) {
for (j in 1:ncol(data_matrix)) {
temp2[i,j] = score_mat[data_matrix[i,j],data_matrix[1,j]]}})
Unit: milliseconds
expr
<the command above>
min lq mean median uq max neval
133.0899 159.8918 189.5858 173.7305 208.5522 348.1761 100
解决这个问题的更有效方法是什么?
> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_UK.utf8 LC_CTYPE=English_UK.utf8 LC_MONETARY=English_UK.utf8 LC_NUMERIC=C LC_TIME=English_UK.utf8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.4.9 stringi_1.7.8 doParallel_1.0.17 iterators_1.0.14 foreach_1.5.2 fs_1.5.2
[7] S4Vectors_0.34.0 data.table_1.14.6 forcats_0.5.2 stringr_1.5.0 dplyr_1.0.10 purrr_0.3.5
[13] readr_2.1.3 tidyr_1.2.1 tibble_3.1.8 ggplot2_3.4.0 tidyverse_1.3.2