如何将具有重复文本的CSV列拆分为每个可能的文本变体的拆分0-1列?

问题描述 投票:1回答:3

我有一个像列一样的CSV

LABEL
a
b
a
a
c
n o
ye s

我想把它分成以下几样:

LABEL_a LABEL_b LABEL_c LABEL_n_o LABEL_ye_s
   1       0       0         0        0
   0       1       0         0        0
   1       0       0         0        0
   1       0       0         0        0
   0       0       1         0        0
   0       0       0         1        0
   0       0       0         0        1

如何用熊猫做这样的事情?

python pandas csv
3个回答
3
投票

让我们使用带有参数pd.get_dummmiesprefix

#Using @Lambda setup
label = ["a", "b", "a", "a", "c", "n o", "ye s"]
s = pd.Series(label)

pd.get_dummies(s, prefix='label')

输出:

   label_a  label_b  label_c  label_n o  label_ye s
0        1        0        0          0           0
1        0        1        0          0           0
2        1        0        0          0           0
3        1        0        0          0           0
4        0        0        1          0           0
5        0        0        0          1           0
6        0        0        0          0           1

时序:

for keys loop method

> %%timeit for key in keys:
>     df[("label_%s" % key).replace(" ", "_")] = (s == key).astype(int)

100个循环,最佳3:6.7 ms每循环

String accessor get_dummies method

> %timeit s.str.get_dummies().add_prefix('label_')

100个循环,最佳3:每循环6.03毫秒

pd.get_dummies with prefix parameter:

> %timeit pd.get_dummies(s, prefix='label')

1000循环,最佳3:每循环1.77毫秒


3
投票

使用get_dummies

s.str.get_dummies().add_prefix('label_')
Out[19]: 
   label_a  label_b  label_c  label_n o  label_ye s
0        1        0        0          0           0
1        0        1        0          0           0
2        1        0        0          0           0
3        1        0        0          0           0
4        0        0        1          0           0
5        0        0        0          1           0
6        0        0        0          0           1

1
投票
import pandas as pd

label = ["a", "b", "a", "a", "c", "n o", "ye s"]
s = pd.Series(label)
keys = s.unique()

df = pd.DataFrame()
for key in keys:
    df[("label_%s" % key).replace(" ", "_")] = (s == key).astype(int)
© www.soinside.com 2019 - 2024. All rights reserved.