如何使用NLTK或pywsd进行lemmatization

问题描述 投票:0回答:2

我知道我的解释很长,但我认为有必要。希望有人耐心和乐于助人的灵魂:)我正在做一个情绪分析项目atm,并且停留在预处理部分。我进行了csv文件的导入,使其成为数据框,然后将变量/列转换为正确的数据类型。然后我像这样进行标记化,在这里我选择要在数据帧(df_tweet1)中标记化(tweet内容)的变量:

# Tokenization
tknzr = TweetTokenizer()
tokenized_sents = [tknzr.tokenize(str(i)) for i in df_tweet1['Tweet Content']]
for i in tokenized_sents:
    print(i)

输出是带有单词(标记)的列表的列表。

然后我执行停用词删除:

# Stop word removal
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
#add words that aren't in the NLTK stopwords list
new_stopwords = ['!', ',', ':', '&', '%', '.', '’']
new_stopwords_list = stop_words.union(new_stopwords)

clean_sents = []
for m in tokenized_sents:
    stop_m = [i for i in m if str(i).lower() not in new_stopwords_list]
    clean_sents.append(stop_m)

输出是相同的,但没有停用词

接下来的两个步骤让我感到困惑(词性标记和词义化)。我尝试了两件事:

1)将先前的输出转换为字符串列表

new_test = [' '.join(x) for x in clean_sents]

因为我认为这将使我能够使用此代码同时完成两个步骤:

from pywsd.utils import lemmatize_sentence

text = new_test
lemm_text = lemmatize_sentence(text, keepWordPOS=True)

我收到此错误:TypeError:预期的字符串或类似字节的对象

2)分别执行POS和lemmatizaion。使用clean_sents作为输入的第一个POS:

# PART-OF-SPEECH        
def process_content(clean_sents):
    try:
        tagged_list = []  
        for lst in clean_sents[:500]: 
            for item in lst:
                words = nltk.word_tokenize(item)
                tagged = nltk.pos_tag(words)
                tagged_list.append(tagged)
        return tagged_list

    except Exception as e:
        print(str(e))

output_POS_clean_sents = process_content(clean_sents)

输出是带有单词并带有标签的列表列表然后我要对输出进行词法化,但是如何呢?我尝试了两个模块,但是都给了我错误:

from pywsd.utils import lemmatize_sentence

lemmatized= [[lemmatize_sentence(output_POS_clean_sents) for word in s]
              for s in output_POS_clean_sents]

# AND

from nltk.stem.wordnet import WordNetLemmatizer

lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in s]
              for s in output_POS_clean_sents]
print(lemmatized)

错误分别是:

TypeError:预期的字符串或类似字节的对象

[AttributeError:'tuple'对象没有属性'endswith']

python nltk sentiment-analysis lemmatization part-of-speech
2个回答
0
投票

在第一部分中,new_test是字符串列表。 lemmatize_sentence需要一个字符串,因此传递new_test会引发类似于您得到的错误。您将必须分别传递每个字符串,然后根据每个经过修饰的字符串创建一个列表。因此:

text = new_test
lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]

应创建一个经过修饰的句子列表。

我实际上曾经做过一个看起来与您正在做的项目相似的项目。我做了以下函数来对字符串进行非限定词:

import lemmy, re

def remove_stopwords(lst):
    with open('stopwords.txt', 'r') as sw:
        #read the stopwords file 
        stopwords = sw.read().split('\n')
        return [word for word in lst if not word in stopwords]

def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):
    """Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.

    -- body_text: string or list of strings
    -- language: language of the passed string(s), e.g. 'en', 'da' etc.
    """

    if isinstance(body_text, str):
        body_text = [body_text] #Convert whatever passed to a list to support passing of single string

    if not hasattr(body_text, '__iter__'):
        raise TypeError('Passed argument should be a sequence.')

    lemmatizer = lemmy.load(language) #load lemmatizing dictionary

    lemma_list = [] #list to store each lemmatized string 

    word_regex = re.compile('[a-zA-Z0-9æøåÆØÅ]+') #All charachters and digits i.e. all possible words

    for string in body_text:
        #remove punctuation and split words
        matches = word_regex.findall(string)

        #split words and lowercase them unless they are all caps
        lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]

        #remove words that are in the stopwords file
        if remove_stopwords_:
            lemmatized_string = remove_stopwords(lemmatized_string)

        #lemmatize each word and choose the shortest word of suggested lemmatizations
        lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]

        #remove words that are in the stopwords file
        if remove_stopwords_:
            lemmatized_string = remove_stopwords(lemmatized_string)

        lemma_list.append(' '.join(lemmatized_string))

    return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string

如果愿意,您可以看一下,但不要觉得有义务。如果能帮助您获得任何想法,我将非常高兴,我花了很多时间自己弄清楚!

让我知道:-)


0
投票

如果使用的是数据框,建议您将预处理步骤的结果存储在新列中。这样,您始终可以检查输出,并且始终可以在馈送模型之前在一行代码中执行创建列表列表以用作模型输入的操作。

关于您的代码,可以对其进行优化(例如,您可以同时执行停用词删除和标记化操作),而我对执行的步骤感到有些困惑。例如,您使用不同的库执行多次lemmatisation,这样做没有任何意义。

# I won't write all the imports, you get them from your code
# define new column to store the processed tweets
df_tweet1['Tweet Content Clean'] = pd.Series(index=df_tweet1.index)

tknzr = TweetTokenizer()
lmtzr = WordNetLemmatizer()

stop_words = set(stopwords.words("english"))
new_stopwords = ['!', ',', ':', '&', '%', '.', '’']
new_stopwords_list = stop_words.union(new_stopwords)

# iterate through each tweet
for ind, row in df_tweet1.iterrows():

    # get initial tweet: ['This is the initial tweet']
    tweet = row['Tweet Content']

    # tokenisation, stopwords removal and lemmatisation all at once
    # out: ['initial', 'tweet']
    tweet = [lmtzr.lemmatize(i) for i in tknzr.tokenize(tweet) if i.lower() not in new_stopwords_list]

    # pos tag, no need to lemmatise again after.
    # out: [('initial', 'JJ'), ('tweet', 'NN')]
    tweet = nltk.pos_tag(tweet)

    # save processed tweet into the new column
    df_tweet1.loc[ind, 'Tweet Content Clean'] = tweet

如果需要,您可以创建一个列表列表,其中包含数据框中的所有推文:

# out: [[('initial', 'JJ'), ('tweet', 'NN')], [second tweet], [third tweet]]
all_tweets = [tweet for tweet in df_tweet1['Tweet Content Clean']]
© www.soinside.com 2019 - 2024. All rights reserved.