我想使用 Sklearn 的
CountVectorizer
打印出语料库中每个文档的单词列表(即词袋)及其各自的术语频率(以文本格式)。我怎样才能做到这一点?
这是我的代码:
from sklearn.feature_extraction.text import CountVectorizer
#instantiate vectorizer
vectorizer=CountVectorizer()
#Document creation
document1='this is a sunny day';document2= 'today is a very very very pleasant day and we have fun fun fun';document3= 'this is an amazin experience'
#list
list_of_words= [document1,document2,document3]
#bag of words
bag_of_words = vectorizer.fit(list_of_words)
#verify vocabulary of repeated word
print (vectorizer.vocabulary_.get('very'))
print (vectorizer.vocabulary_.get('fun'))
#transform
bag_of_words=vectorizer.transform(list_of_words)
print(bag_of_words)>>>>
(0, 3) 1 (0, 7) 1 (0, 9) 1 (0, 10) 1 (1, 2) 1 (1, 3) 1 (1, 5) 3 (1, 6) 1 (1, 7) 1 (1, 8) 1 (1, 11) 1 (1, 12) 3 (1, 13) 1 (2, 0) 1 (2, 1) 1 (2, 4) 1 (2, 7) 1 (2, 10) 1
您可以使用
get_feature_names()
和 toarray()
方法,分别获取单词列表和每个术语的频率。使用 Pandas DataFrame
,您可以将两个列表导出到 .csv
文件或控制台中。 stopwords
提供的nltk
列表可以选择用于从文档中删除任何stopwords
(要使用更多停用词扩展当前列表,请查看这个答案)。
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import nltk
from nltk.corpus import stopwords
# You need to run this only once, in order to download the stopwords list
nltk.download('stopwords')
# Load the stopwords list
stop_words_list = stopwords.words('english')
# The documents
document1='Hope you have a pleasant day. Have fun.'
document2= 'Today is a very pleasant day and we will have fun fun fun'
document3= 'This event has been amazing. We had a lot of fun the whole day'
# List of documents
list_of_documents= [document1, document2, document3]
# Instantiate CountVectorizer
cv = CountVectorizer(stop_words=stop_words_list)
# Fit and transform
cv_fit = cv.fit_transform(list_of_documents)
word_list = cv.get_feature_names()
count_list = cv_fit.toarray()
# Create a dataframe with words and their respective frequency
# Each row represents a document starting from document1
df = pd.DataFrame(data=count_list, columns=word_list)
# Print out the df
print(df)
# Optionally, save the df to a csv file
df.to_csv("bag_of_words.csv")
要输出整个语料库的术语频率(即总结所有文档的结果),您可以使用以下内容(除了上面的示例之外):
import numpy as np
d = dict(zip(word_list, np.asarray(cv_fit.sum(axis=0))[0]))
sorted_d = dict(sorted(d.items(), key=lambda item: item[1], reverse=True))
print(sorted_d)
# Optionally, create a DataFrame
df = pd.DataFrame.from_dict(data=sorted_d, orient='index')
print(df)
df.to_csv("total_freq.csv")