优化从输入文本中查找单词关联强度的过程

Question

我编写了以下（粗略）代码来查找给定文本中的单词之间的关联强度。

import re

## The first paragraph of Wikipedia's article on itself - you can try with other pieces of text with preferably more words (to produce more meaningful word pairs)
text = "Wikipedia was launched on January 15, 2001, by Jimmy Wales and Larry Sanger.[10] Sanger coined its name,[11][12] as a portmanteau of wiki[notes 3] and 'encyclopedia'. Initially an English-language encyclopedia, versions in other languages were quickly developed. With 5,748,461 articles,[notes 4] the English Wikipedia is the largest of the more than 290 Wikipedia encyclopedias. Overall, Wikipedia comprises more than 40 million articles in 301 different languages[14] and by February 2014 it had reached 18 billion page views and nearly 500 million unique visitors per month.[15] In 2005, Nature published a peer review comparing 42 science articles from Encyclopadia Britannica and Wikipedia and found that Wikipedia's level of accuracy approached that of Britannica.[16] Time magazine stated that the open-door policy of allowing anyone to edit had made Wikipedia the biggest and possibly the best encyclopedia in the world and it was testament to the vision of Jimmy Wales.[17] Wikipedia has been criticized for exhibiting systemic bias, for presenting a mixture of 'truths, half truths, and some falsehoods',[18] and for being subject to manipulation and spin in controversial topics.[19] In 2017, Facebook announced that it would help readers detect fake news by suitable links to Wikipedia articles. YouTube announced a similar plan in 2018."
text = re.sub("[\[].*?[\]]", "", text)     ## Remove brackets and anything inside it.
text=re.sub(r"[^a-zA-Z0-9.]+", ' ', text)  ## Remove special characters except spaces and dots
text=str(text).lower()                     ## Convert everything to lowercase
## Can add other preprocessing steps, depending on the input text, if needed.







from nltk.corpus import stopwords
import nltk

stop_words = stopwords.words('english')

desirable_tags = ['NN'] # We want only nouns - can also add 'NNP', 'NNS', 'NNPS' if needed, depending on the results

word_list = []

for sent in text.split('.'):
    for word in sent.split():
        '''
        Extract the unique, non-stopword nouns only
        '''
        if word not in word_list and word not in stop_words and nltk.pos_tag([word])[0][1] in desirable_tags:
            word_list.append(word)





'''
Construct the association matrix, where we count 2 words as being associated 
if they appear in the same sentence.

Later, I'm going to define associations more properly by introducing a 
window size (say, if 2 words seperated by at most 5 words in a sentence, 
then we consider them to be associated)
'''

table = np.zeros((len(word_list),len(word_list)), dtype=int)

for sent in text.split('.'):
    for i in range(len(word_list)):
        for j in range(len(word_list)):
            if word_list[i] in sent and word_list[j] in sent:
                table[i,j]+=1

df = pd.DataFrame(table, columns=word_list, index=word_list)







# Count the number of occurrences of each word from word_list in the text

all_words = pd.DataFrame(np.zeros((len(df), 2)), columns=['Word', 'Count'])
all_words.Word = df.index

for sent in text.split('.'):
    count=0
    for word in sent.split():
        if word in word_list:
            all_words.loc[all_words.Word==word,'Count'] += 1







# Sort the word pairs in decreasing order of their association strengths

df.values[np.triu_indices_from(df, 0)] = 0 # Make the upper triangle values 0

assoc_df = pd.DataFrame(columns=['Word 1', 'Word 2', 'Association Strength (Word 1 -> Word 2)'])
for row_word in df:
    for col_word in df:
        '''
        If Word1 occurs 10 times in the text, and Word1 & Word2 occur in the same sentence 3 times,
        the association strength of Word1 and Word2 is 3/10 - Please correct me if this is wrong.
        '''
        assoc_df = assoc_df.append({'Word 1': row_word, 'Word 2': col_word, 
                                        'Association Strength (Word 1 -> Word 2)': df[row_word][col_word]/all_words[all_words.Word==row_word]['Count'].values[0]}, ignore_index=True)

assoc_df.sort_values(by='Association Strength (Word 1 -> Word 2)', ascending=False)

这会生成单词关联，如下所示：

        Word 1          Word 2          Association Strength (Word 1 -> Word 2)
330     wiki            encyclopedia    3.0
895     encyclopadia    found           1.0
1317    anyone          edit            1.0
754     peer            science         1.0
755     peer            encyclopadia    1.0
756     peer            britannica      1.0
...
...
...

但是，代码包含许多阻碍其运行时间的for循环。特别是最后一部分（sort the word pairs in decreasing order of their association strengths）消耗了大量的时间，因为它计算了n^2单词对/组合的关联强度，其中n是我们感兴趣的单词数量（在我上面的代码中的word_list中）。

那么，以下是我想要的一些帮助：

如何对代码进行矢量化，或者使其更有效？
而不是在最后一步中生成n^2组合/单词对，有没有办法在生成它们之前修剪其中的一些？无论如何，我将通过检查修剪一些无用/无意义的对。
另外，我知道这不属于编码问题的范围，但我很想知道我的逻辑中是否有任何错误，特别是在计算单词关联强度时。

Answer 1

由于您询问了特定代码，因此我不会进入备用库。我将主要关注你问题的第1）和第2点：

而不是遍历整个单词ist两次（i和j），你可以通过在j和列表末尾之间迭代i + i来缩短处理时间。这将删除重复对（索引24和42以及索引42和24）以及相同的对（索引42和42）。

for sent in text.split('.'):
    for i in range(len(word_list)):
        for j in range(i+1, len(word_list)):
            if word_list[i] in sent and word_list[j] in sent:
                table[i,j]+=1

不过要小心。 in运算符也将匹配部分单词（如and中的hand）当然，您也可以通过首先过滤单词列表中的所有单词然后将它们配对来完全删除j迭代：

word_list = set()    # Using set instead of list makes lookups faster since this is a hashed structure

for sent in text.split('.'):
    for word in sent.split():
        '''
        Extract the unique, non-stopword nouns only
        '''
        if word not in word_list and word not in stop_words and nltk.pos_tag([word])[0][1] in desirable_tags:
            word_list.add(word)

(...)
for sent in text.split('.'):
    found_words = [word for word in sent.split() if word in word_list]    # list comprehensions are usually faster than pure for loops
    # If you want to count duplicate words, then leave the whole line below out.
    found_words = tuple(frozenset(found_words)) #  make every word unique using a set and then iterable by index again by converting it into a tuple. 
    for i in range(len(found_words):
        for j in range(i+1, len(found_words):
            table[i, j] += 1

一般来说，你应该考虑在大多数情况下使用外部库。由于你的问题的一些评论已经指出，分裂.可能会得到错误的结果，分裂在空格上的计数相同，例如用-分隔的单词或后面跟着,的单词。

优化从输入文本中查找单词关联强度的过程

问题描述投票：1回答：1

1个回答

最新问题

优化从输入文本中查找单词关联强度的过程

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1