是否有一个R函数用于在某个“字距离”内查找关键字?

问题描述 投票:1回答:1

我需要的是在某个“单词距离”内找到单词的功能。 “包”和“工具”这个词在一句话中很有意思“他车上装了一袋工具。”

使用Quanteda kwic功能,我可以单独找到“包”和“工具”,但这通常会让我产生过多的结果。我需要例如'bag'和'tools'在五个单词之内。

r quanteda
1个回答
0
投票

您可以使用fcm()函数计算固定窗口内的共现,例如5个单词。这创建了“特征共现矩阵”,并且可以为任何大小的令牌范围或整个文档的上下文定义。

对于您的示例,或者至少基于我对您的问题的解释的示例,这将看起来像:

library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

txt <- c(
  d1 = "He had a bag of tools in his car",
  d2 = "bag other other other other tools other"
)
fcm(txt, context = "window", window = 5)
## Feature co-occurrence matrix of: 10 by 10 features.
## 10 x 10 sparse Matrix of class "fcm"
##         features
## features He had a bag of tools in his car other
##    He     0   1 1   1  1     1  0   0   0     0
##    had    0   0 1   1  1     1  1   0   0     0
##    a      0   0 0   1  1     1  1   1   0     0
##    bag    0   0 0   0  1     2  1   1   1     4
##    of     0   0 0   0  0     1  1   1   1     0
##    tools  0   0 0   0  0     0  1   1   1     5
##    in     0   0 0   0  0     0  0   1   1     0
##    his    0   0 0   0  0     0  0   0   1     0
##    car    0   0 0   0  0     0  0   0   0     0
##    other  0   0 0   0  0     0  0   0   0    10

这里,术语包在第一个文档中的工具的5个标记内发生一次。在第二个文件中,它们相距超过5个令牌,因此不计算在内。

© www.soinside.com 2019 - 2024. All rights reserved.