Python：组合低频因素/类别计数

Question

R 中有一个很棒的解决方案。

我的

df.column

看起来像：

Windows
Windows
Mac
Mac
Mac
Linux
Windows
...

我想用这个

df.column

 向量中的“其他”替换低频类别。例如，我需要我的

df.column

 看起来像

Windows
Windows
Mac
Mac
Mac
Linux -> Other
Windows
...

我想重命名这些罕见的类别，以减少回归中的因素数量。这就是为什么我需要原始向量。在Python中，运行命令获取频率表后我得到：

pd.value_counts(df.column)


Windows          26083
iOS              19711
Android          13077
Macintosh         5799
Chrome OS          347
Linux              285
Windows Phone      167
(not set)           22
BlackBerry          11

我想知道是否有一种方法可以将“Chrome OS”、“Linux”（低频数据）重命名为另一个类别（例如类别“其他”），并以有效的方式进行。

Answer 1

通过查找占用百分比进行掩码，即：

series = pd.value_counts(df.column)
mask = (series/series.sum() * 100).lt(1)
# To replace df['column'] use np.where I.e 
df['column'] = np.where(df['column'].isin(series[mask].index),'Other',df['column'])

要使用 sum 更改索引：

new = series[~mask]
new['Other'] = series[mask].sum()

Windows      26083
iOS          19711
Android      13077
Macintosh     5799
Other          832
Name: 1, dtype: int64

如果你想替换索引：

series.index = np.where(series.index.isin(series[mask].index),'Other',series.index)

Windows      26083
iOS          19711
Android      13077
Macintosh     5799
Other          347
Other          285
Other          167
Other           22
Other           11
Name: 1, dtype: int64

解释

(series/series.sum() * 100) # This will give you the percentage i.e 

Windows          39.820158
iOS              30.092211
Android          19.964276
Macintosh         8.853165
Chrome OS         0.529755
Linux             0.435101
Windows Phone     0.254954
(not set)         0.033587
BlackBerry        0.016793
Name: 1, dtype: float64

.lt(1)

 相当于小于 1。这为您提供了一个布尔掩码，基于该掩码索引并分配数据

Answer 2

这是对你的问题的（迟来的）延伸；它将低频类别（比例小于

min_freq

）组合到整个数据帧的列的基本原理。它基于@Bharath 的回答。

def condense_category(col, min_freq=0.01, new_name='other'):
    series = pd.value_counts(col)
    mask = (series/series.sum()).lt(min_freq)
    return pd.Series(np.where(col.isin(series[mask].index), new_name, col))

一个简单的应用示例：

df_toy = pd.DataFrame({'x': [1, 2, 3, 4] + [5]*100, 'y': [5, 6, 7, 8] + [0]*100})
df_toy = df_toy.apply(condense_category, axis=0)
print(df_toy)

#          x      y
# 0    other  other
# 1    other  other
# 2    other  other
# 3    other  other
# 4        5      0
# ..     ...    ...
# 99       5      0
# 100      5      0
# 101      5      0
# 102      5      0
# 103      5      0
# 
# [104 rows x 2 columns]

Answer 3

Bharath 答案使用 np.where()

，在我拥有 500 万行数据集的本地计算机上，它比

df.where()

 慢 27%（分别为 449 毫秒和 328 毫秒）。 API 更改也会导致某个函数被弃用。因此，我对

Bharath 代码进行了现代化/改进，并考虑到效率（正如您所要求的）：

col = 'column' # column to combine
threshold = 1  # percentage threshold for rare categories
val_freq = df[col].value_counts(normalize=True).mul(100)
mask = val_freq.lt(threshold)
rare_cats = val_freq.index[mask]
df[col] = df[col].where(~df[col].isin(rare_cats), 'Other')

变更详情：

Series.value_counts()
```
 替换已弃用的 
```
pd.value_counts()

normalize=True

返回百分比（而不是手动计算）

df.mul(100)
```
 代替 
```
* 100
```
（与 
```
.lt(100)
一致）

df.index[mask]

 代替

df[mask].index

 -- 快 40%（索引大小为 110 时为 91.5 μs vs 129 μs）

pd.where()

——仅在条件= False时替换（

~

反转条件，因此非稀有类别保持原样）

Python：组合低频因素/类别计数

问题描述投票：0回答：3

3个回答

最新问题

Python：组合低频因素/类别计数

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3