我有一个短语列表和一个语料库,该语料库是一串包含数百万单词的文本。对于短语列表中的每个短语,我想查找并记录在语料库字符串中找到的最相似的短语。 出于我的目的,我需要使用 SBERT 相似性,并发现句子转换器库是最好的。 我的问题是,虽然存在用于查找两个列表之间相似性的文档,但我找不到任何用于查找大字符串中的短语列表的文档。我尝试将字符串拆分为句子列表,但计算时间非常长,因为对于短语列表中的每个短语,我需要循环遍历每个句子(而且有很多),然后将所有匹配项附加到我正在创建的字典中.
phrase_list = ['Gregor Samse', 'in his bed into', 'horrible creature'...] # mine has 156 phrases
very_long_string= 'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections. The bedding was hardly able to cover it and seemed ready to slide off any moment. His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked...'
# mine has around 11 million words
# convert text corpus string to list of sentences
string_to_sent_list = very_long_string.split(".")
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-mpnet-base-v2')
phrase_embeddings = model.encode(phrase_list, convert_to_tensor=True)
sent_embeddings = model.encode(string_to_sent_list, convert_to_tensor=True)
similarity_dict = {}
for i, phraselist_phrase in enumerate(phrase_list):
similarities = util.cos_sim(phrase_embeddings[i], sent_embeddings)
matches = [string_to_sent_list[j] for j, sim in enumerate(similarities[0]) if sim > 0.65]
similarity_dict[phraselist_phrase] = matches
print(similarity_dict)
similarity_dict.to_csv('similarity dict.csv')
是否有其他方法可以实现此目的,无论是通过不同的方式循环还是使用不同的库?对任何想法持开放态度。我的目标非常简单,为我的短语列表中的每个短语保存语料库中所有相应的短语匹配。
附注任何人都知道此方法是否保存为同一短语的多个版本的匹配,例如,对于短语“必须完成这个”,它会考虑“我们必须完成这个项目”和“我们必须”、“必须完成”、“完成这个” ' 即使匹配,但它们来自同一个句子?
你可以使用一组算法来解决这个问题。当然,问题复杂,解决方案也复杂。
我没有编写整个算法的代码。我做了部分,剩下的你可以做。
import re, string, collections
def _matches(s, p):
return [(match.group(0), match.start(), match.end()) for match in re.finditer(p, s)]
def get_lps(p):
lps, j = [0] * len(p), 0
for i in range(1, len(p)):
while j and p[i] != p[j]:
j = lps[j - 1]
if p[i] == p[j]:
j += 1
lps[i] = j
return lps
def kmp(s, p):
lps, indices, j = get_lps(p), [], 0
for i in range(len(s)):
while j and s[i] != p[j]:
j = lps[j - 1]
if s[i] == p[j]:
j += 1
if j == len(p):
indices.append(i - len(p) + 1)
j = lps[j - 1]
return indices
def levenshtein_edit_distance(A, B):
dp = [[0] * (len(B) + 1) for _ in range(len(A) + 1)]
for i in range(len(A) + 1):
dp[i][0] = i
for j in range(len(B) + 1):
dp[0][j] = j
for i in range(1, len(A) + 1):
for j in range(1, len(B) + 1):
if A[i - 1] == B[j - 1]:
dp[i][j] = dp[i - 1][j - 1]
else:
dp[i][j] = 1 + min(dp[i - 1][j - 1], dp[i - 1][j], dp[i][j - 1])
return dp[-1][-1]
phrase_list = ['Gregor Samse', 'in his bed into', 'horrible creature']
phs = list(set(phrase_list))
s = ' ' * 50 + \
"""
One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections. The bedding was hardly able to cover it and seemed ready to slide off any moment.
His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked...""" \
+ ' ' * 50
p = r'(?:\b[^\r\n\s\t,?:;\'"!_-]+?\b)'
s_matches = _matches(s, p)
print(s_matches)
d = {ph: _matches(ph, p) for ph in phs}
print(d)
res = []
possibles = collections.defaultdict(list)
for k, val in d.items():
for (word_a, st_a, end_a) in val:
for (word_s, st_s, end_s) in s_matches:
if word_a == word_s:
search_window = s[st_s - 10:end_s + 10]
found = kmp(search_window, word_a)[0]
print(st_s, found, search_window)
possibles[k] += [(st_s, found, search_window)]
# break
print(possibles)