如何使用 Sklearn 获取文本格式的词袋和词频?

问题描述 投票:0回答:1

我想使用 Sklearn 的

CountVectorizer
打印出语料库中每个文档的单词列表(即词袋)及其各自的术语频率(以文本格式)。我怎样才能做到这一点?

这是我的代码:

from sklearn.feature_extraction.text import CountVectorizer  

#instantiate vectorizer
vectorizer=CountVectorizer()   

#Document creation 
document1='this is a sunny day';document2= 'today is a very very very pleasant day and we have fun fun fun';document3= 'this is an amazin experience'

#list 
list_of_words= [document1,document2,document3]

#bag of words
bag_of_words = vectorizer.fit(list_of_words)

#verify vocabulary of repeated word 
print (vectorizer.vocabulary_.get('very')) 

print (vectorizer.vocabulary_.get('fun'))

#transform
bag_of_words=vectorizer.transform(list_of_words)

print(bag_of_words)>>>>
(0, 3) 1 (0, 7) 1 (0, 9) 1 (0, 10) 1 (1, 2) 1 (1, 3) 1 (1, 5) 3 (1, 6) 1 (1, 7) 1 (1, 8) 1 (1, 11) 1 (1, 12) 3 (1, 13) 1 (2, 0) 1 (2, 1) 1 (2, 4) 1 (2, 7) 1 (2, 10) 1
python scikit-learn nlp corpus countvectorizer
1个回答
0
投票

您可以使用

get_feature_names()
toarray()
方法,分别获取单词列表和每个术语的频率。使用 Pandas
DataFrame
,您可以将两个列表导出到
.csv
文件或控制台中。
stopwords
提供的
nltk
列表可以选择用于从文档中删除任何
stopwords
(要使用更多停用词扩展当前列表,请查看这个答案)。

示例

from sklearn.feature_extraction.text import CountVectorizer  
import pandas as pd
import nltk
from nltk.corpus import stopwords

# You need to run this only once, in order to download the stopwords list
nltk.download('stopwords') 

# Load the stopwords list
stop_words_list = stopwords.words('english')

# The documents 
document1='Hope you have a pleasant day. Have fun.'
document2= 'Today is a very pleasant day and we will have fun fun fun'
document3= 'This event has been amazing. We had a lot of fun the whole day'

# List of documents
list_of_documents= [document1, document2, document3]

# Instantiate CountVectorizer
cv = CountVectorizer(stop_words=stop_words_list)

# Fit and transform
cv_fit = cv.fit_transform(list_of_documents)
word_list = cv.get_feature_names()
count_list = cv_fit.toarray()

# Create a dataframe with words and their respective frequency 
# Each row represents a document starting from document1
df = pd.DataFrame(data=count_list, columns=word_list)

# Print out the df
print(df)

# Optionally, save the df to a csv file
df.to_csv("bag_of_words.csv") 

输出:

Bag of words

要输出整个语料库的术语频率(即总结所有文档的结果),您可以使用以下内容(除了上面的示例之外):

import numpy as np

d = dict(zip(word_list, np.asarray(cv_fit.sum(axis=0))[0]))
sorted_d = dict(sorted(d.items(), key=lambda item: item[1], reverse=True))
print(sorted_d)

# Optionally, create a DataFrame
df = pd.DataFrame.from_dict(data=sorted_d, orient='index')
print(df)
df.to_csv("total_freq.csv")

输出:

Total Frequency Results

© www.soinside.com 2019 - 2024. All rights reserved.