为字符串列表创建编辑距离矩阵的函数,对大小写和词序不敏感

问题描述 投票:0回答:1

需要一个为字符串列表创建编辑距离矩阵的函数,对大小写和词序不敏感。例如,字符串“Hello World”和“world hello”之间的编辑距离必须为 0。 在我的函数中,我使用“FuzzyWuzzy”库,但它需要对结果矩阵进行额外的转换,因为“FuzzyWuzzy”函数的结果不是正确的编辑距离。 你能建议我可以使用另一个库来实现我的功能吗?


import numpy as np
from fuzzywuzzy import fuzz
from sklearn.cluster import AffinityPropagation

lst_words = ['Hello word', 'Hello word', 'all hello', 'peace word', 'Word hello', 'thin paper', 'paper thin']

def affinity_propagation_clustering_algorithm_1(lst_words):
    words = np.asarray(lst_words)

    lev_similarity = np.array([[(fuzz.token_sort_ratio(w1, w2)) - 100 for w1 in words] for w2 in words])
    print(lev_similarity)
    lst_transformed_numbers = []
    obj_ind = []
    for ind in range(lev_similarity.shape[1]):
        try:
            x = lev_similarity[:, ind].astype(np.float32)
            lst_transformed_numbers.append(x)
        except:
            obj_ind.append(ind)

    affprop = AffinityPropagation(affinity="precomputed")
    affprop.fit(lst_transformed_numbers)
    #print('affprop.labels_', affprop.labels_)
    for cluster_id in np.unique(affprop.labels_):
        print('cluster_id', cluster_id)
        cluster = np.unique(words[np.nonzero(affprop.labels_ == cluster_id)])
        print(cluster)
    return

if __name__ == "__main__":
    lst_words = ['Hello word', 'Hello word', 'all hello', 'peace word', 'Word hello', 'thin paper', 'paper thin']
    affinity_propagation_clustering_algorithm_1(lst_words)
python levenshtein-distance
1个回答
0
投票

您需要规范化字符串:

def normalize_string(s):
    return ' '.join(sorted(s.lower().split()))

在距离矩阵计算中也将

fuzz.token_sort_ratio
替换为
Levenshtein.distance()

像这样:

import Levenshtein as lev

...

lev_similarity = np.array([[(lev.distance(normalize_string(w1), normalize_string(w2))) for w1 in words] for w2 in words])

© www.soinside.com 2019 - 2024. All rights reserved.