尽管经过多次尝试进行特征选择，但用于电影评论的朴素贝叶斯分类器的准确性仍很低

Question

我对机器学习还很陌生，并且受命建立机器学习模型，以预测评论是好（1）还是差（0）。我已经尝试过使用RandomForestClassifier来输出50％的精度。我切换到了朴素贝叶斯分类器，但是即使进行了网格搜索，仍然没有任何改善。

我的数据看起来像这样（我很乐意与任何人分享数据）：

                                                 Reviews  Labels
0      For fans of Chris Farley, this is probably his...       1
1      Fantastic, Madonna at her finest, the film is ...       1
2      From a perspective that it is possible to make...       1
3      What is often neglected about Harold Lloyd is ...       1
4      You'll either love or hate movies such as this...       1
                                              ...     ...
14995  This is perhaps the worst movie I have ever se...       0
14996  I was so looking forward to seeing this film t...       0
14997  It pains me to see an awesome movie turn into ...       0
14998  "Grande Ecole" is not an artful exploration of...       0
14999  I felt like I was watching an example of how n...       0
[15000 rows x 2 columns]

我的代码要进行预处理，然后输入文本并在训练分类器之前使用TfidfVectorizer：

vect = TfidfVectorizer(stop_words=stopwords, max_features=5000)
X_train =vect.fit_transform(all_train_set['Reviews'])
y_train = all_train_set['Labels']

clf = MultinomialNB()
clf.fit(X_train, y_train)

X_test = vect.transform(all_test_set['Reviews'])
y_test = all_test_set['Labels']

print(classification_report(y_test, clf.predict(X_test), digits=4))

分类报告的结果似乎表明，尽管一个标签的预测非常好，但另一个标签却非常差，使整个产品都降低了。

              precision    recall  f1-score   support
           0     0.5000    0.8546    0.6309      2482
           1     0.5000    0.1454    0.2253      2482
    accuracy                         0.5000      4964
   macro avg     0.5000    0.5000    0.4281      4964
weighted avg     0.5000    0.5000    0.4281      4964

我现在尝试按照这8种不同的教程进行尝试，并尝试了每种不同的编码方式，但是我似乎无法将其提高到50％以上，这使我认为这可能是我的功能存在问题。

如果有人有任何想法或建议，我将不胜感激。

好的，我在这里添加了几个预处理步骤，包括删除html标签，删除标点符号和单个字母以及从下面的代码中删除多个空格：

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

def preprocess_text(sen):
    # Removing html tags
    sentence = remove_tags(sen)
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)
    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)
    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)
    return sentence

我相信TfidfVectorizer会自动将所有内容都转换为小写并对其进行词形化。最终结果仍然只有0.5

Answer 1

文本预处理在这里非常重要。仅去除停用词是不够的，我认为您应该考虑以下因素：

将文本转换为小写
删除标点符号
撇号查找（“'ll”->“将”'，“'ve”->“有”）]
删除号码
限制词和/或词干以供审查
等

看看文本预处理方法。

尽管经过多次尝试进行特征选择，但用于电影评论的朴素贝叶斯分类器的准确性仍很低

问题描述投票：0回答：1

1个回答

最新问题

尽管经过多次尝试进行特征选择，但用于电影评论的朴素贝叶斯分类器的准确性仍很低

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1