在 TfidfVectorizer 标记化后删除二元组

问题描述 投票:0回答:1

我正在尝试删除由

TfidfVectorizer
创建的二元语法。 我正在使用
text.TfidfVectorizer
,这样我就可以使用自己的预处理器函数。

测试字符串和预处理器功能:

doc2 = ['this is a test past performance here is another that has aa aa adding builing cat dog horse hurricane', 
        'another that has aa aa and start date and hurricane hitting south carolina']

def remove_bigrams(doc):
    gram_2 = ['past performance', 'start date', 'aa aa']
    res = []
    for record in doc:
        the_string = record
        for phrase in gram_2:
            the_string = the_string.replace(phrase, "")
        res.append(the_string)
    return res

remove_bigrams(doc2)

我的

TfidfVectorizer
实例化和
fit_transform

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as stop_words
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text

custom_stop_words = [i for i in stop_words]

vec = text.TfidfVectorizer(stop_words=custom_stop_words,
                           analyzer='word',
                           ngram_range=(2, 2),
                           preprocessor=remove_bigrams,
                          )

features = vec.fit_transform(doc2)

这是我的错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [49], in <cell line: 5>()
      3 #t3_cv = CountVectorizer(t2, stop_words = stop_words)
      4 vec = text.TfidfVectorizer(stop_words=custom_stop_words, analyzer='word', ngram_range = (2,2), preprocessor = remove_bigrams)
----> 5 features = vec.fit_transform(doc2)

File c:\Development_Solutions\Sandbox\SBVE\lib\site-packages\sklearn\feature_extraction\text.py:2079, in TfidfVectorizer.fit_transform(self, raw_documents, y)
   2072 self._check_params()
   2073 self._tfidf = TfidfTransformer(
   2074     norm=self.norm,
   2075     use_idf=self.use_idf,
   2076     smooth_idf=self.smooth_idf,
   2077     sublinear_tf=self.sublinear_tf,
   2078 )
-> 2079 X = super().fit_transform(raw_documents)
   2080 self._tfidf.fit(X)
   2081 # X is already a transformed view of raw_documents so
   2082 # we set copy to False

File c:\Development_Solutions\Sandbox\SBVE\lib\site-packages\sklearn\feature_extraction\text.py:1338, in CountVectorizer.fit_transform(self, raw_documents, y)
   1330             warnings.warn(
   1331                 "Upper case characters found in"
   1332                 " vocabulary while 'lowercase'"
   1333                 " is True. These entries will not"
   1334                 " be matched with any documents"
   1335             )
   1336             break
-> 1338 vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_)
   1340 if self.binary:
   1341     X.data.fill(1)

File c:\Development_Solutions\Sandbox\SBVE\lib\site-packages\sklearn\feature_extraction\text.py:1209, in CountVectorizer._count_vocab(self, raw_documents, fixed_vocab)
   1207 for doc in raw_documents:
   1208     feature_counter = {}
-> 1209     for feature in analyze(doc):
   1210         try:
   1211             feature_idx = vocabulary[feature]

File c:\Development_Solutions\Sandbox\SBVE\lib\site-packages\sklearn\feature_extraction\text.py:113, in _analyze(doc, analyzer, tokenizer, ngrams, preprocessor, decoder, stop_words)
    111     doc = preprocessor(doc)
    112 if tokenizer is not None:
--> 113     doc = tokenizer(doc)
    114 if ngrams is not None:
    115     if stop_words is not None:

TypeError: expected string or bytes-like object

如何解决?

python scikit-learn nlp preprocessor tfidfvectorizer
1个回答
0
投票

预处理器应该处理文档,而不是整个语料库。

这应该可以解决它:

def remove_bigrams(doc: str) -> str:
    """Remove certain bi-grams from a document."""
    gram_2 = ['past performance', 'start date', 'aa aa']
    for phrase in gram_2:
        doc = doc.replace(phrase, "")
    return doc
© www.soinside.com 2019 - 2024. All rights reserved.