这个问题在这里已有答案:
我有一个名为df_base
的Dataframe,看起来像这样。如你所见,有一个名为Sex
的列是male
或female
。我想分别将这些值映射到0和1。
+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
| | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.05 | NaN | S |
+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
我在StackOverflow上看到了一些方法,但我想知道执行以下映射最有效的方法是:
+---------+---------+
| Old Sex | New Sex |
+---------+---------+
| male | 0 |
| female | 1 |
| female | 1 |
| female | 1 |
| male | 0 |
+---------+---------+
我正在使用这个:
df_base['Sex'].replace(['male','female'],[0,1],inplace=True)
......但我不禁觉得这有点粗制滥造。有没有更好的方法呢?还有使用.loc
,但它围绕Dataframe的行循环,因此效率较低,对吧?
我的本能建议使用.map()
,但我根据具有1500个随机男性/女性值的数据框对您的解决方案和地图进行了比较。
%timeit df_base['Sex_new'] = df_base['Sex'].map({'male': 0,'female': 1})
1000 loops, best of 3: 653 µs per loop
编辑基于coldspeeds评论,并且因为重新分配它是与其他人更好的比较:
%timeit df_base['Sex_new'] = df_base['Sex'].replace(['male','female'],[0,1])
1000 loops, best of 3: 968 µs per loop
所以实际上慢.map()
......!
所以基于这个例子,你的'劣质'解决方案似乎比.map()
更快......
编辑
Pogo的解决方案:
%timeit df_base['Sex_new'] = np.where(df_base['Sex'] == 'male', 0, 1)
1000 loops, best of 3: 331 µs per loop
那么快!
Jezrael与.astype(int)
的解决方案:
%timeit df_base['Sex_new'] = (df_base['Sex'] == 'female').astype(int)
1000 loops, best of 3: 388 µs per loop
因此也比.map()
和.replace()
更快。
我认为如果只有map
和male
存在于female
列中,这里更好/更快地使用Sex
字典:
df_base['Sex'] = df_base['Sex'].map(dict(zip(['male','female'],[0,1]))
有什么相同的:
df_base['Sex'] = df_base['Sex'].map({'male': 0,'female': 1})
解决方案如果只存在female
和male
值,则将boolean mask转换为整数True/False
到1,0
:
df_base['Sex'] = (df_base['Sex'] == 'female').astype(int)
性能:
np.random.seed(2019)
import perfplot
def ma(df):
df = df.copy()
df['Sex_new'] = df['Sex'].map({'male': 0,'female': 1})
return df
def rep1(df):
df = df.copy()
df['Sex'] = df['Sex'].replace(['male','female'],[0,1])
return df
def nwhere(df):
df = df.copy()
df['Sex_new'] = np.where(df['Sex'] == 'male', 0, 1)
return df
def mask1(df):
df = df.copy()
df['Sex_new'] = (df['Sex'] == 'female').astype(int)
return df
def mask2(df):
df = df.copy()
df['Sex_new'] = (df['Sex'].values == 'female').astype(int)
return df
def make_df(n):
df = pd.DataFrame({'Sex': np.random.choice(['male','female'], size=n)})
return df
perfplot.show(
setup=make_df,
kernels=[ma, rep1, nwhere, mask1, mask2],
n_range=[2**k for k in range(2, 18)],
logx=True,
logy=True,
equality_check=False, # rows may appear in different order
xlabel='len(df)')
结论:
如果只更换2个值是最慢的replace
,numpy.where, map and mask
是相似的。为了提高性能,比较numpy数组和.values
。
所有这些都取决于数据,因此最好用实际数据进行测试。
您可以使用np.where
的另一个解决方案:
只是一个示例DataFrame:
>>> df
Sex
0 male
1 female
2 female
3 female
4 male
根据条件创建新列new_Sex
>>> df['new_Sex'] = np.where(df['Sex'] == 'male', 0, 1)
>>> df
Sex new_Sex
0 male 0
1 female 1
2 female 1
3 female 1
4 male 0
要么:
>>> df['new_Sex'] = np.where(df['Sex'] != 'male', 1, 0)
>>> df
Sex new_Sex
0 male 0
1 female 1
2 female 1
3 female 1
4 male 0