我有一个大型文档语料库,如果它们有明显的 n-gram 重叠(在我的例子中,我正在考虑二元语法),我想合并它们。考虑集合列表:
corpus = [{'example', 'bigram'}, {'another', 'example'}, {'some', 'outlier'}]
我有以下代码来计算相似度矩阵:
import numpy as np
sim_matrix = np.zeros((len(corpus), len(corpus)))
for i in range(len(corpus)):
for j in range(i+1, len(corpus)):
sim_matrix[i][j] = get_ngram_overlap(corpus[i], corpus[j])
np.fill_diagonal(sim_matrix, 1)
其中
get_ngram_overlap
函数定义如下:
def get_ngram_overlap(ngrams_s1, ngrams_s2):
if min(len(ngrams_s1), len(ngrams_s2)) == 0:
return 0
common_ngrams = ngrams_s1 & ngrams_s2
return len(common_ngrams)/min(len(ngrams_s1), len(ngrams_s2))
结果是一个 MxM 矩阵,其中 M 是我语料库中 n-gram 的数量
print(sim_matrix)
>> [[1. 0.5 0.]
[0. 1. 0.]
[0. 0. 1.]]
问题是,当 M 变大时,代码真的很慢,而且这在计算上变得非常昂贵,因为需要比较所有唯一对。 有没有更有效地计算它的方法? 比如使用优化的距离矩阵计算和自定义相似性度量,适用于字符串集。任何帮助表示赞赏!
itertools.combinations
(获取连续的双字母对)和数组的triangular计算(np.triu_indices
):
from itertools import combinations
def get_ngram_overlap(ngrams_s1, ngrams_s2):
min_len = min(len(ngrams_s1), len(ngrams_s2))
if min_len == 0:
return 0
return len(ngrams_s1 & ngrams_s2) / min_len
corpus = [{'example', 'bigram'}, {'another', 'example'}, {'some', 'outlier'}]
sim_matrix = np.zeros((len(corpus), len(corpus)))
# fill upper triangle with calculated overlaps
sim_matrix[np.triu_indices(len(sim_matrix), 1)] = \
list(get_ngram_overlap(*c) for c in combinations(corpus, 2))
np.fill_diagonal(sim_matrix, 1)
print(sim_matrix)
[[1. 0.5 0. ]
[0. 1. 0. ]
[0. 0. 1. ]]
您可以加快计算速度,首先建立索引并仅对具有常用词的组合运行
get_ngram_overlap()
:
import numpy as np
from itertools import combinations
# make the items in corpus frozensets (to make them hashable)
corpus = [frozenset({'example', 'bigram'}), frozenset({'another', 'example'}), frozenset({'some', 'outlier'})]
def get_ngram_overlap(ngrams_s1, ngrams_s2):
mn = min(len(ngrams_s1), len(ngrams_s2))
if mn == 0:
return 0
common_ngrams = ngrams_s1 & ngrams_s2
return len(common_ngrams)/mn
index, nums = {}, {}
for i, b in enumerate(corpus):
for word in b:
index.setdefault(word, []).append(b)
nums.setdefault(b, []).append(i)
index2 = {}
for k, v in index.items():
for c in combinations(v, 2):
index2[c] = get_ngram_overlap(*c)
sim_matrix = np.zeros((len(corpus), len(corpus)))
for a, b in index2:
for x in nums[a]:
for y in nums[b]:
sim_matrix[(x, y) if x < y else (y, x)] = index2[(a, b)]
np.fill_diagonal(sim_matrix, 1)
print(sim_matrix)
印花:
[[1. 0.5 0. ]
[0. 1. 0. ]
[0. 0. 1. ]]
具有 500 个不同单词的 10k 双字母组的基准:
import random
import numpy as np
from timeit import timeit
from itertools import combinations
random.seed(123)
corpus = [frozenset({f'word{random.randint(1, 500)}', f'word{random.randint(1, 500)}'}) for _ in range(10_000)]
def get_ngram_overlap(ngrams_s1, ngrams_s2):
mn = min(len(ngrams_s1), len(ngrams_s2))
if mn == 0:
return 0
common_ngrams = ngrams_s1 & ngrams_s2
return len(common_ngrams)/mn
def fn1():
sim_matrix = np.zeros((len(corpus), len(corpus)))
for i in range(len(corpus)):
for j in range(i+1, len(corpus)):
sim_matrix[i][j] = get_ngram_overlap(corpus[i], corpus[j])
np.fill_diagonal(sim_matrix, 1)
return sim_matrix
def fn2():
index, nums = {}, {}
for i, b in enumerate(corpus):
for word in b:
index.setdefault(word, []).append(b)
nums.setdefault(b, []).append(i)
index2 = {}
for k, v in index.items():
for c in combinations(v, 2):
index2[c] = get_ngram_overlap(*c)
sim_matrix = np.zeros((len(corpus), len(corpus)))
for a, b in index2:
for x in nums[a]:
for y in nums[b]:
sim_matrix[(x, y) if x < y else (y, x)] = index2[(a, b)]
np.fill_diagonal(sim_matrix, 1)
return sim_matrix
assert np.array_equal(fn1(), fn2())
t1 = timeit(fn1, number=1)
t2 = timeit(fn2, number=1)
print(t1)
print(t2)
在我的机器上打印(Python 3.10/AMD Ryzen 5700x):
19.253960204077885
0.37714757514186203