我有一个很大的
data.frame
,大约有 400 万行和 2 列。
这两列包含长字符串,代表菜谱的文本。
对于每一行,我使用 R 中的
text
库中的 textSimilarity() 来比较 A 列和 B 列中食谱的相似性。
性能非常慢。有没有办法加快这个速度?还是我编码错误?
示例数据 - 文本更短:
df <- data.frame(
columnA= c("tomato sauce is very tasty to use", "without garlic, this dish is not chinese", "British food is as tasteless as it can get"),
columnB= c("pizza is the source of life", "a nice xiaolongbao is steamed until it is soft", "braised pork can be very healthy if prepared well")
)
> df
columnA columnB
1 tomato sauce is very tasty to use pizza is the source of life
2 without garlic, this dish is not chinese a nice xiaolongbao is steamed until it is soft
3 British food is as tasteless as it can get braised pork can be very healthy if prepared will
为了获得相似性,我使用:
df$sim <- textSimilarity(textEmbed(df$columnA)$texts$texts , textEmbed(df$columnB)$texts$texts)
在当前的设置中,这个过程需要几天的时间,而不是我们的时间。如何加快这个速度?或者有其他选择吗?
stringdist
函数非常快。要获得两个向量
a
和
b
之间的相似度,您可以执行
1 - stringdist(a, b, method = 'jaccard')
(因为使用 'jaccard' 方法的
stringdist
可以为您提供从 0 到 1 范围内每个字符串之间的差异,因此相似度为1 减去距离)。我创建了一个包含 4,000,000 条记录和两个字符串列的虚拟数据集
library(text)
library(microbenchmark)
# Generate random strings for vectors 'a' and 'b'
set.seed(123) # Setting seed for reproducibility
num_rows <- 4*10^6
a <- replicate(num_rows, paste(sample(LETTERS, 10, replace = TRUE), collapse = ""))
b <- replicate(num_rows, paste(sample(LETTERS, 10, replace = TRUE), collapse = ""))
# Create the dataframe
df <- data.frame(a, b)
# Add similarity scores
df$similarity <- 1 - stringdist::stringdist(df$a, df$b, method = "jaccard")
# Test the time by running the similarity test 10 times.
microbenchmark(1 - stringdist::stringdist(df$a, df$b, method = "jaccard"), times = 10)
在我的计算机上运行相似性测试平均只需要 694 微秒!