需要一个为字符串列表创建编辑距离矩阵的函数,对大小写和词序不敏感。例如,字符串“Hello World”和“world hello”之间的编辑距离必须为 0。 在我的函数中,我使用“FuzzyWuzzy”库,但它需要对结果矩阵进行额外的转换,因为“FuzzyWuzzy”函数的结果不是正确的编辑距离。 你能建议我可以使用另一个库来实现我的功能吗?
import numpy as np
from fuzzywuzzy import fuzz
from sklearn.cluster import AffinityPropagation
lst_words = ['Hello word', 'Hello word', 'all hello', 'peace word', 'Word hello', 'thin paper', 'paper thin']
def affinity_propagation_clustering_algorithm_1(lst_words):
words = np.asarray(lst_words)
lev_similarity = np.array([[(fuzz.token_sort_ratio(w1, w2)) - 100 for w1 in words] for w2 in words])
print(lev_similarity)
lst_transformed_numbers = []
obj_ind = []
for ind in range(lev_similarity.shape[1]):
try:
x = lev_similarity[:, ind].astype(np.float32)
lst_transformed_numbers.append(x)
except:
obj_ind.append(ind)
affprop = AffinityPropagation(affinity="precomputed")
affprop.fit(lst_transformed_numbers)
#print('affprop.labels_', affprop.labels_)
for cluster_id in np.unique(affprop.labels_):
print('cluster_id', cluster_id)
cluster = np.unique(words[np.nonzero(affprop.labels_ == cluster_id)])
print(cluster)
return
if __name__ == "__main__":
lst_words = ['Hello word', 'Hello word', 'all hello', 'peace word', 'Word hello', 'thin paper', 'paper thin']
affinity_propagation_clustering_algorithm_1(lst_words)
您需要规范化字符串:
def normalize_string(s):
return ' '.join(sorted(s.lower().split()))
在距离矩阵计算中也将
fuzz.token_sort_ratio
替换为Levenshtein.distance()
像这样:
import Levenshtein as lev
...
lev_similarity = np.array([[(lev.distance(normalize_string(w1), normalize_string(w2))) for w1 in words] for w2 in words])