Python 文本数据预处理中的停用词问题

Question

我想用Python进行主题建模。为此，我使用了自己的停用词列表、我在GitHub上找到的停用词列表以及nltk的停用词列表来清理停用词。然而，当我检查结果时，我发现我之前指定为停用词的单词没有被删除，而是被放置在主题内。我在下面给出了用于数据预处理的代码。这些代码运行时没有任何错误。我还添加了一个我看到的主题作为示例。我不明白为什么停用词没有完全清除。如果你能帮助我，我会很高兴。


import pandas as pd
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import requests


nltk.download('punkt')
nltk.download('stopwords')
nlp = spacy.load('en_core_web_sm')



# GitHub stopwords
url = "https://gist.githubusercontent.com/rg089/35e00abf8941d72d419224cfd5b5925d/raw/12d899b70156fd0041fa9778d657330b024b959c/stopwords.txt"
github_stopwords = set(requests.get(url).text.splitlines())

# my stopwords 
with open('/content/s.txt') as f:
    my_stopwords = set(line.strip() for line in f)

# NLTK stopwords 
nltk_stopwords = set(stopwords.words('english'))

# All stopwords 
all_stopwords = nltk_stopwords.union(github_stopwords, my_stopwords)

# clean function
def preprocess_text(text):
    # 1. Lower
    text = text.lower()
    
    # 2. Tokenization 
    words = word_tokenize(text)
    
    # 3. keep alphanumeric words
    words = [word for word in words if word.isalpha() and word not in all_stopwords]
  
    print(f"remaining words:{words}")
    
     # 4. Stopwords filter
    filtered_words = [word for word in words if word not in all_stopwords]

    # 5. Lemmatization (spaCy used)
    doc = nlp(" ".join( filtered_words))
    lemmatized_words = [token.lemma_ for token in doc]

    print(f"Lemmatized text: {' '.join(lemmatized_words)}\n")
    
    # Return sanitized text
    return " ".join(lemmatized_words)

输出：

 For example, a few words in this topic (challenge, theoretical)

 Topic 25: challenge, framework, theoretical, provide, offer

Answer 1

建议您可以检查非索引字表以确保它仅包含小写单词。该问题也可能源于您传递到主题建模模型中的参数。

因为预处理代码对我来说效果很好（即使有两个相同的停用词删除步骤）：

import pandas as pd
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import requests


nltk.download('punkt')
nltk.download('stopwords')
nlp = spacy.load('en_core_web_sm')



# GitHub stopwords
url = "https://gist.githubusercontent.com/rg089/35e00abf8941d72d419224cfd5b5925d/raw/12d899b70156fd0041fa9778d657330b024b959c/stopwords.txt"
github_stopwords = set(requests.get(url).text.splitlines())

my_stopwords={ "challenge", "theoretical"}
# NLTK stopwords 
nltk_stopwords = set(stopwords.words('english'))

# All stopwords 
all_stopwords = nltk_stopwords.union(github_stopwords, my_stopwords)

# clean function
def preprocess_text(text):
    # 1. Lower
    text = text.lower()
    
    # 2. Tokenization 
    words = word_tokenize(text)
    
    # 3. keep alphanumeric words
    words = [word for word in words if word.isalpha() and word not in all_stopwords]
  
    print(f"remaining words:{words}")
    
     # 4. Stopwords filter
    filtered_words = [word for word in words if word not in all_stopwords]

    # 5. Lemmatization (spaCy used)
    doc = nlp(" ".join( filtered_words))
    lemmatized_words = [token.lemma_ for token in doc]

    print(f"Lemmatized text: {' '.join(lemmatized_words)}\n")
    
    # Return sanitized text
    return " ".join(lemmatized_words)

test_sentence = "This is a theoretical challenge to offer some insights and provide a useful framework."

preprocessed_test = preprocess_text(test_sentence)
print(preprocessed_test)

输出：

offer insight provide framework

Python 文本数据预处理中的停用词问题

问题描述投票：0回答：1

1个回答

最新问题

Python 文本数据预处理中的停用词问题

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1