我对机器学习还很陌生,并且受命建立机器学习模型,以预测评论是好(1)还是差(0)。我已经尝试过使用RandomForestClassifier
来输出50%的精度。我切换到了朴素贝叶斯分类器,但是即使进行了网格搜索,仍然没有任何改善。
我的数据看起来像这样(我很乐意与任何人分享数据):
Reviews Labels
0 For fans of Chris Farley, this is probably his... 1
1 Fantastic, Madonna at her finest, the film is ... 1
2 From a perspective that it is possible to make... 1
3 What is often neglected about Harold Lloyd is ... 1
4 You'll either love or hate movies such as this... 1
... ...
14995 This is perhaps the worst movie I have ever se... 0
14996 I was so looking forward to seeing this film t... 0
14997 It pains me to see an awesome movie turn into ... 0
14998 "Grande Ecole" is not an artful exploration of... 0
14999 I felt like I was watching an example of how n... 0
[15000 rows x 2 columns]
我的代码要进行预处理,然后输入文本并在训练分类器之前使用TfidfVectorizer
:
vect = TfidfVectorizer(stop_words=stopwords, max_features=5000)
X_train =vect.fit_transform(all_train_set['Reviews'])
y_train = all_train_set['Labels']
clf = MultinomialNB()
clf.fit(X_train, y_train)
X_test = vect.transform(all_test_set['Reviews'])
y_test = all_test_set['Labels']
print(classification_report(y_test, clf.predict(X_test), digits=4))
分类报告的结果似乎表明,尽管一个标签的预测非常好,但另一个标签却非常差,使整个产品都降低了。
precision recall f1-score support
0 0.5000 0.8546 0.6309 2482
1 0.5000 0.1454 0.2253 2482
accuracy 0.5000 4964
macro avg 0.5000 0.5000 0.4281 4964
weighted avg 0.5000 0.5000 0.4281 4964
我现在尝试按照这8种不同的教程进行尝试,并尝试了每种不同的编码方式,但是我似乎无法将其提高到50%以上,这使我认为这可能是我的功能存在问题。
如果有人有任何想法或建议,我将不胜感激。
好的,我在这里添加了几个预处理步骤,包括删除html标签,删除标点符号和单个字母以及从下面的代码中删除多个空格:
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
def preprocess_text(sen):
# Removing html tags
sentence = remove_tags(sen)
# Remove punctuations and numbers
sentence = re.sub('[^a-zA-Z]', ' ', sentence)
# Single character removal
sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)
# Removing multiple spaces
sentence = re.sub(r'\s+', ' ', sentence)
return sentence
我相信TfidfVectorizer
会自动将所有内容都转换为小写并对其进行词形化。最终结果仍然只有0.5
文本预处理在这里非常重要。仅去除停用词是不够的,我认为您应该考虑以下因素:
看看文本预处理方法。