Pandas列表中的单词频率

Question

我在pandas df中有一列标记化的，带有词典的文本。我正在尝试创建一个字频率矩阵，以便我可以继续减少维数。

我一直遇到一个错误，Python期待一个字符串，但得到一个列表。 TypeError: sequence item 0: expected str instance, list found

我尝试了一些方法，但每次都会遇到错误。我不知道如何计算清单。

以下是我尝试过的一些方法：

选项1 ：

from collections import Counter
df['new_col'] = Counter()
for token in df['col']:
    counts[token.orth_] += 1

这产生了ValueError: Length of values does not match length of index

选项2：

Counter(' '.join(df['col']).split()).most_common()

生成：TypeError: sequence item 0: expected str instance, list found

选项3：

pd.Series(values = ','.join([(i) for i in df['col']]).lower().split()).value_counts()[:]

再次产生：TypeError: sequence item 0: expected str instance, list found

编辑：示例数据：

col
[indicate, after, each, action, step, .]
[during, september, and, october, please, refrain]
[the, work, will, be, ongoing, throughout, the]
[professional, development, session, will, be]

Answer 1

Easy Answer

鉴于你告诉我们的内容，9提到的最佳解决方案是使用scikit-learn的CountVectorizer。我在这里做了一些关于你想要数据的格式的假设，但是这里有什么能得到一个doc x token数据帧，其中值是文档中令牌的计数。它假设df['col']是一个大熊猫系列，其中值是列表。

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> cv = CountVectorizer(analyzer=lambda x: x)
>>> counted_values = cv.fit_transform(df['col']).toarray()
>>> df = pd.DataFrame(counted_values, columns=cv.get_feature_names())
>>> df.iloc[0:5, 0:5]
   .  action  after  and  be
0  1       1      1    0   0
1  0       0      0    1   0
2  0       0      0    0   1
3  0       0      0    0   1

CountVectorizer可以为你标记，并且默认情况下，所以我们将一个身份lambda函数传递给analyzer参数，告诉它我们的文档是预先标记的。

Suboptimal Answer

我不推荐这个，但我认为如果你想了解计数器的工作原理会有所帮助。由于您的值是列表，因此您可以在系列的每一行使用.apply。

>>> counted_values = df['col'].apply(lambda x: Counter(x))
>>> counted_values
0    {'.': 1, 'after': 1, 'indicate': 1, 'action': ...
1    {'during': 1, 'and': 1, 'october': 1, 'please'...
2    {'will': 1, 'ongoing': 1, 'work': 1, 'the': 2,...
3    {'development': 1, 'professional': 1, 'session...
dtype: object

所以现在你有一系列的决定，这不是很有帮助。您可以将此转换为类似于上面的数据框，具有以下内容：

>>> suboptimal_df = pd.DataFrame(counted_values.tolist())
>>> suboptimal_df.iloc[0:5, 0:5]
     .  action  after  and   be
0  1.0     1.0    1.0  NaN  NaN
1  NaN     NaN    NaN  1.0  NaN
2  NaN     NaN    NaN  NaN  1.0
3  NaN     NaN    NaN  NaN  1.0

我不推荐这个，因为apply很慢，加上它已经有点傻了，我们将列表存储为系列值，dicts同样愚蠢。 DataFrame最适合作为数字或字符串值（想想电子表格）的结构化容器，而不是不同的容器类型。

Pandas列表中的单词频率

问题描述投票：1回答：1

1个回答

Easy Answer

Suboptimal Answer

最新问题

Pandas列表中的单词频率

问题描述 投票：1回答：1

1个回答

Easy Answer

Suboptimal Answer

最新问题

问题描述投票：1回答：1