我有一张如下表:
URN Firm_Name
0 104472 R.X. Yah & Co
1 104873 Big Building Society
2 109986 St James's Society
3 114058 The Kensington Society Ltd
4 113438 MMV Oil Associates Ltd
我想计算 Firm_Name 列中所有单词的频率,以获得如下输出:
我尝试过以下代码:
import pandas as pd
import nltk
data = pd.read_csv("X:\Firm_Data.csv")
top_N = 20
word_dist = nltk.FreqDist(data['Firm_Name'])
print('All frequencies')
print('='*60)
rslt=pd.DataFrame(word_dist.most_common(top_N),columns=['Word','Frequency'])
print(rslt)
print ('='*60)
但是,以下代码不会产生唯一的字数统计。
IIUIC,使用
value_counts()
In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts()
Out[3361]:
Society 3
Ltd 2
James's 1
R.X. 1
Yah 1
Associates 1
St 1
Kensington 1
MMV 1
Big 1
& 1
The 1
Co 1
Oil 1
Building 1
dtype: int64
或者,
pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts()
或者,
pd.Series(' '.join(df.Firm_Name).split()).value_counts()
对于前 N 个,例如 3
In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3]
Out[3379]:
Society 3
Ltd 2
James's 1
dtype: int64
详情
In [3380]: df
Out[3380]:
URN Firm_Name
0 104472 R.X. Yah & Co
1 104873 Big Building Society
2 109986 St James's Society
3 114058 The Kensington Society Ltd
4 113438 MMV Oil Associates Ltd
str.cat
和 lower
将所有值连接为 1 string
,然后需要 word_tokenize
,最后使用您的解决方案:
top_N = 4
#if not necessary all lower
a = data['Firm_Name'].str.lower().str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(a)
word_dist = nltk.FreqDist(words)
print (word_dist)
<FreqDist with 17 samples and 20 outcomes>
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
Word Frequency
0 society 3
1 ltd 2
2 the 1
3 co 1
如有必要,也可以删除
lower
:
top_N = 4
a = data['Firm_Name'].str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(a)
word_dist = nltk.FreqDist(words)
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
Word Frequency
0 Society 3
1 Ltd 2
2 MMV 1
3 Kensington 1
这样更快:
df.Firm_Name.str.split().explode().value_counts()
这个答案也可以使用 - 从 Pandas 数据框中计算不同的单词。它利用 Counter 方法并将其应用于每一行。
from collections import Counter
c = Counter()
df = pd.DataFrame(
[[104472,"R.X. Yah & Co"],
[104873,"Big Building Society"],
[109986,"St James's Society"],
[114058,"The Kensington Society Ltd"],
[113438,"MMV Oil Associates Ltd"]
], columns=["URN","Firm_Name"])
df.Firm_Name.str.split().apply(c.update)
Counter({'R.X.': 1,
'Yah': 1,
'&': 1,
'Co': 1,
'Big': 1,
'Building': 1,
'Society': 3,
'St': 1,
"James's": 1,
'The': 1,
'Kensington': 1,
'Ltd': 2,
'MMV': 1,
'Oil': 1,
'Associates': 1})
将整个系列连接成一个字符串,将其拆分为单词,然后使用
Counter
中的 collections
:
from collections import Counter
Counter(df['Firm_Name'].str.cat(sep='\n').split())
输出:
Counter({'Society': 3,
'Ltd': 2,
'R.X.': 1,
'Yah': 1,
'&': 1,
'Co': 1,
'Big': 1,
'Building': 1,
'St': 1,
"James's": 1,
'The': 1,
'Kensington': 1,
'MMV': 1,
'Oil': 1,
'Associates': 1})
可以选择调用
.most_common()
来获取元组的排序列表 [(w1, c1), (w2, c2), ...]