如何使用 Sklearn 获取文本格式的词袋和词频？

Question

我想使用 Sklearn 的

CountVectorizer

打印出语料库中每个文档的单词列表（即词袋）及其各自的术语频率（以文本格式）。我怎样才能做到这一点？

这是我的代码：

from sklearn.feature_extraction.text import CountVectorizer  

#instantiate vectorizer
vectorizer=CountVectorizer()   

#Document creation 
document1='this is a sunny day';document2= 'today is a very very very pleasant day and we have fun fun fun';document3= 'this is an amazin experience'

#list 
list_of_words= [document1,document2,document3]

#bag of words
bag_of_words = vectorizer.fit(list_of_words)

#verify vocabulary of repeated word 
print (vectorizer.vocabulary_.get('very')) 

print (vectorizer.vocabulary_.get('fun'))

#transform
bag_of_words=vectorizer.transform(list_of_words)

print(bag_of_words)>>>>
(0, 3) 1 (0, 7) 1 (0, 9) 1 (0, 10) 1 (1, 2) 1 (1, 3) 1 (1, 5) 3 (1, 6) 1 (1, 7) 1 (1, 8) 1 (1, 11) 1 (1, 12) 3 (1, 13) 1 (2, 0) 1 (2, 1) 1 (2, 4) 1 (2, 7) 1 (2, 10) 1

Answer 1

您可以使用

get_feature_names()

和

toarray()

方法，分别获取单词列表和每个术语的频率。使用 Pandas

DataFrame

，您可以将两个列表导出到

.csv

文件或控制台中。

stopwords

提供的

nltk

列表可以选择用于从文档中删除任何

stopwords

（要使用更多停用词扩展当前列表，请查看这个答案）。

示例

from sklearn.feature_extraction.text import CountVectorizer  
import pandas as pd
import nltk
from nltk.corpus import stopwords

# You need to run this only once, in order to download the stopwords list
nltk.download('stopwords') 

# Load the stopwords list
stop_words_list = stopwords.words('english')

# The documents 
document1='Hope you have a pleasant day. Have fun.'
document2= 'Today is a very pleasant day and we will have fun fun fun'
document3= 'This event has been amazing. We had a lot of fun the whole day'

# List of documents
list_of_documents= [document1, document2, document3]

# Instantiate CountVectorizer
cv = CountVectorizer(stop_words=stop_words_list)

# Fit and transform
cv_fit = cv.fit_transform(list_of_documents)
word_list = cv.get_feature_names()
count_list = cv_fit.toarray()

# Create a dataframe with words and their respective frequency 
# Each row represents a document starting from document1
df = pd.DataFrame(data=count_list, columns=word_list)

# Print out the df
print(df)

# Optionally, save the df to a csv file
df.to_csv("bag_of_words.csv")

输出：

要输出整个语料库的术语频率（即总结所有文档的结果），您可以使用以下内容（除了上面的示例之外）：

import numpy as np

d = dict(zip(word_list, np.asarray(cv_fit.sum(axis=0))[0]))
sorted_d = dict(sorted(d.items(), key=lambda item: item[1], reverse=True))
print(sorted_d)

# Optionally, create a DataFrame
df = pd.DataFrame.from_dict(data=sorted_d, orient='index')
print(df)
df.to_csv("total_freq.csv")

如何使用 Sklearn 获取文本格式的词袋和词频？

问题描述投票：0回答：1

1个回答

示例

输出：

输出：

最新问题

如何使用 Sklearn 获取文本格式的词袋和词频？

问题描述 投票：0回答：1

1个回答

示例

输出：

输出：

最新问题

问题描述投票：0回答：1