R 文本库中 textSimilarity() 的性能

Question

我有一个很大的

data.frame

，大约有 400 万行和 2 列。

这两列包含长字符串，代表菜谱的文本。

对于每一行，我使用 R 中的

text

库中的 textSimilarity() 来比较 A 列和 B 列中食谱的相似性。

性能非常慢。有没有办法加快这个速度？还是我编码错误？

示例数据 - 文本更短：

df <- data.frame(
columnA= c("tomato sauce is very tasty to use", "without garlic, this dish is not chinese", "British food is as tasteless as it can get"), 
columnB= c("pizza is the source of life", "a nice xiaolongbao is steamed until it is soft", "braised pork can be very healthy if prepared well")
)

> df
                                     columnA                                           columnB
1          tomato sauce is very tasty to use                       pizza is the source of life
2   without garlic, this dish is not chinese    a nice xiaolongbao is steamed until it is soft
3 British food is as tasteless as it can get braised pork can be very healthy if prepared will

为了获得相似性，我使用：

df$sim <- textSimilarity(textEmbed(df$columnA)$texts$texts , textEmbed(df$columnB)$texts$texts)

在当前的设置中，这个过程需要几天的时间，而不是我们的时间。如何加快这个速度？或者有其他选择吗？

Answer 1

stringdist 包中的

stringdist

 函数非常快。要获得两个向量

a

和

b

 之间的相似度，您可以执行

1 - stringdist(a, b, method = 'jaccard')

 （因为使用 'jaccard' 方法的

stringdist

 可以为您提供从 0 到 1 范围内每个字符串之间的差异，因此相似度为1 减去距离）。

我创建了一个包含 4,000,000 条记录和两个字符串列的虚拟数据集

library(text)
library(microbenchmark)
# Generate random strings for vectors 'a' and 'b'
set.seed(123)  # Setting seed for reproducibility
num_rows <- 4*10^6
a <- replicate(num_rows, paste(sample(LETTERS, 10, replace = TRUE), collapse = ""))
b <- replicate(num_rows, paste(sample(LETTERS, 10, replace = TRUE), collapse = ""))

# Create the dataframe
df <- data.frame(a, b)
# Add similarity scores
df$similarity <- 1 - stringdist::stringdist(df$a, df$b, method = "jaccard")


# Test the time by running the similarity test 10 times.
microbenchmark(1 - stringdist::stringdist(df$a, df$b, method = "jaccard"), times = 10)

在我的计算机上运行相似性测试平均只需要 694 微秒！

R 文本库中 textSimilarity() 的性能

问题描述投票：0回答：1

1个回答

最新问题

R 文本库中 textSimilarity() 的性能

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1