在包含数百万个字符的字符串中查找完全相同且相似的短语

Question

我有一个短语列表和一个语料库，该语料库是一串包含数百万单词的文本。对于短语列表中的每个短语，我想查找并记录在语料库字符串中找到的最相似的短语。出于我的目的，我需要使用 SBERT 相似性，并发现句子转换器库是最好的。我的问题是，虽然存在用于查找两个列表之间相似性的文档，但我找不到任何用于查找大字符串中的短语列表的文档。我尝试将字符串拆分为句子列表，但计算时间非常长，因为对于短语列表中的每个短语，我需要循环遍历每个句子（而且有很多），然后将所有匹配项附加到我正在创建的字典中.

phrase_list = ['Gregor Samse', 'in his bed into', 'horrible creature'...] # mine has 156 phrases

very_long_string= 'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections. The bedding was hardly able to cover it and seemed ready to slide off any moment. His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked...' 
# mine has around 11 million words 

# convert text corpus string to list of sentences
string_to_sent_list = very_long_string.split(".")

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-mpnet-base-v2')


phrase_embeddings = model.encode(phrase_list, convert_to_tensor=True)
sent_embeddings = model.encode(string_to_sent_list, convert_to_tensor=True)

similarity_dict = {}
for i, phraselist_phrase in enumerate(phrase_list):
    similarities = util.cos_sim(phrase_embeddings[i], sent_embeddings)
    matches = [string_to_sent_list[j] for j, sim in enumerate(similarities[0]) if sim > 0.65] 
    similarity_dict[phraselist_phrase] = matches

print(similarity_dict)

similarity_dict.to_csv('similarity dict.csv')

是否有其他方法可以实现此目的，无论是通过不同的方式循环还是使用不同的库？对任何想法持开放态度。我的目标非常简单，为我的短语列表中的每个短语保存语料库中所有相应的短语匹配。

附注任何人都知道此方法是否保存为同一短语的多个版本的匹配，例如，对于短语“必须完成这个”，它会考虑“我们必须完成这个项目”和“我们必须”、“必须完成”、“完成这个” ' 即使匹配，但它们来自同一个句子？

Answer 1

你可以使用一组算法来解决这个问题。当然，问题复杂，解决方案也复杂。

首先将字符串和短语分解为单词并记录索引。现在你已经把它们都作为文字了。
循环遍历短语中的单词和字符串中的单词，并检查是否有完全相似的单词。如果这样做，那么您“可能”在字符串的关闭窗口内有一个可能的输出。在这里，您可以创建一个窗口，然后查看您的短语是否在窗口中，“完全”还是“部分”。对于“部分”，您可以使用 Levenshtein，对于“精确”，您可以使用 KMP。

我没有编写整个算法的代码，需要一些时间来完成。不过，我做了部分，剩下的你可以做。


import re, string, collections


def _matches(s, p):
    return [(match.group(0), match.start(), match.end()) for match in re.finditer(p, s)]


def get_lps(p):
    lps, j = [0] * len(p), 0
    for i in range(1, len(p)):
        while j and p[i] != p[j]:
            j = lps[j - 1]
        if p[i] == p[j]:
            j += 1
        lps[i] = j
    return lps


def kmp(s, p):
    lps, indices, j = get_lps(p), [], 0
    for i in range(len(s)):
        while j and s[i] != p[j]:
            j = lps[j - 1]
        if s[i] == p[j]:
            j += 1
        if j == len(p):
            indices.append(i - len(p) + 1)
            j = lps[j - 1]
    return indices


def levenshtein_edit_distance(A, B):
    dp = [[0] * (len(B) + 1) for _ in range(len(A) + 1)]
    for i in range(len(A) + 1):
        dp[i][0] = i
    for j in range(len(B) + 1):
        dp[0][j] = j

    for i in range(1, len(A) + 1):
        for j in range(1, len(B) + 1):
            if A[i - 1] == B[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = 1 + min(dp[i - 1][j - 1], dp[i - 1][j], dp[i][j - 1])

    return dp[-1][-1]


phrase_list = ['Gregor Samse', 'in his bed into', 'horrible creature']

phs = list(set(phrase_list))
s = ' ' * 50 + \
    """
One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. 
He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections. The bedding was hardly able to cover it and seemed ready to slide off any moment. 
His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked...""" \
+ ' ' * 50
p = r'(?:\b[^\r\n\s\t,?:;\'"!_-]+?\b)'

s_matches = _matches(s, p)
print(s_matches)

d = {ph: _matches(ph, p) for ph in phs}
print(d)


res = []
possibles = collections.defaultdict(list)
for k, val in d.items():
    for (word_a, st_a, end_a) in val:
        for (word_s, st_s, end_s) in s_matches:
            if word_a == word_s:
                search_window = s[st_s - 10:end_s + 10]
                found = kmp(search_window, word_a)[0]
                print(st_s, found, search_window)
                possibles[k] += [(st_s, found, search_window)]
                # break
print(possibles)

在包含数百万个字符的字符串中查找完全相同且相似的短语

问题描述投票：0回答：1

1个回答

编辑距离

Knuth–Morris–Pratt 算法

最新问题

在包含数百万个字符的字符串中查找完全相同且相似的短语

问题描述 投票：0回答：1

1个回答

编辑距离

Knuth–Morris–Pratt 算法

最新问题

问题描述投票：0回答：1