减少Python中2个列表之间的余弦相似度计算的运行时间

Question

我正在使用Python组装Twitter主题标签字典。密钥是主题标签本身，相应的条目是大量推文的集合，这些推文包含端到端附加的主题标签。我有所有无标签推文的单独列表，并根据余弦相似度将它们添加到字典条目中。一切正常，但速度非常慢（4000条推文需要几个小时）。嵌套的for循环为我提供O（N ^ 2）运行时。有人对我如何改善运行时间有任何想法吗？任何建议将不胜感激！

taglessVects = normalize(vectorizer.transform(needTags))
    dictVects = normalize(vectorizer.transform(newDict))

   #newDict contains: newDict[hashtag]: "tweets that used that hashtag"
   #needTags is a list of all the tweets that didn;t use a hashtag
    for dVect, entry in zip(dictVects, newDict):
        for taglessVect, tweet in zip(taglessVects, needTags):
            if cosine_similarity(taglessVect, dVect) > .9:
                newDict[entry] = newDict[entry] + ' ' + tweet


    return newDict

Answer 1

您已使用余弦距离作为度量标准，创建了蛮力最近邻算法。 sklearn docs on this topic很好。

Sklearn实现了更多的优化版本，应该更快。您可以使用它们，但需要更改字典。您需要某种方法来将向量映射到相应的tweet。

from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(1, metric='cosine')
neigh.fit(dictVects)  

nearest = neigh.kneighbors(taglessVects, return_distance=False)
for x, y in zip(taglessVects, nearest):
    z = y[x][0]
    # z is the nearest tweet vector to the tagless vector x

减少Python中2个列表之间的余弦相似度计算的运行时间

问题描述投票：0回答：1

1个回答

最新问题

减少Python中2个列表之间的余弦相似度计算的运行时间

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1