在Pandas数据帧中将值(例如使用Gender)映射到字符串到int [duplicate]

问题描述 投票:0回答:3

这个问题在这里已有答案:

我有一个名为df_base的Dataframe,看起来像这样。如你所见,有一个名为Sex的列是malefemale。我想分别将这些值映射到0和1。

+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
|   | PassengerId | Survived | Pclass |                       Name                        |  Sex   | Age | SibSp | Parch |      Ticket      |  Fare   | Cabin | Embarked |
+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
| 0 |           1 |        0 |      3 | Braund, Mr. Owen Harris                           | male   |  22 |     1 |     0 | A/5 21171        |    7.25 | NaN   | S        |
| 1 |           2 |        1 |      1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female |  38 |     1 |     0 | PC 17599         | 71.2833 | C85   | C        |
| 2 |           3 |        1 |      3 | Heikkinen, Miss. Laina                            | female |  26 |     0 |     0 | STON/O2. 3101282 |   7.925 | NaN   | S        |
| 3 |           4 |        1 |      1 | Futrelle, Mrs. Jacques Heath (Lily May Peel)      | female |  35 |     1 |     0 | 113803           |    53.1 | C123  | S        |
| 4 |           5 |        0 |      3 | Allen, Mr. William Henry                          | male   |  35 |     0 |     0 | 373450           |    8.05 | NaN   | S        |
+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+

我在StackOverflow上看到了一些方法,但我想知道执行以下映射最有效的方法是:

+---------+---------+
| Old Sex | New Sex |
+---------+---------+
| male    |       0 |
| female  |       1 |
| female  |       1 |
| female  |       1 |
| male    |       0 |
+---------+---------+

我正在使用这个:

df_base['Sex'].replace(['male','female'],[0,1],inplace=True)

......但我不禁觉得这有点粗制滥造。有没有更好的方法呢?还有使用.loc,但它围绕Dataframe的行循环,因此效率较低,对吧?

python pandas mapping
3个回答
2
投票

我的本能建议使用.map(),但我根据具有1500个随机男性/女性值的数据框对您的解决方案和地图进行了比较。

%timeit df_base['Sex_new'] = df_base['Sex'].map({'male': 0,'female': 1})
1000 loops, best of 3: 653 µs per loop

编辑基于coldspeeds评论,并且因为重新分配它是与其他人更好的比较:

%timeit df_base['Sex_new'] = df_base['Sex'].replace(['male','female'],[0,1])
1000 loops, best of 3: 968 µs per loop

所以实际上慢.map() ......!

所以基于这个例子,你的'劣质'解决方案似乎比.map()更快......

编辑

Pogo的解决方案:

%timeit df_base['Sex_new'] = np.where(df_base['Sex'] == 'male', 0, 1)
1000 loops, best of 3: 331 µs per loop

那么快!

Jezrael与.astype(int)的解决方案:

%timeit df_base['Sex_new'] = (df_base['Sex'] == 'female').astype(int)
1000 loops, best of 3: 388 µs per loop

因此也比.map().replace()更快。


2
投票

我认为如果只有mapmale存在于female列中,这里更好/更快地使用Sex字典:

df_base['Sex'] = df_base['Sex'].map(dict(zip(['male','female'],[0,1]))

有什么相同的:

df_base['Sex'] = df_base['Sex'].map({'male': 0,'female': 1})

解决方案如果只存在femalemale值,则将boolean mask转换为整数True/False1,0

df_base['Sex'] = (df_base['Sex'] == 'female').astype(int)

性能:

np.random.seed(2019)

import perfplot    

def ma(df):
    df = df.copy()
    df['Sex_new'] = df['Sex'].map({'male': 0,'female': 1})
    return df

def rep1(df):
    df = df.copy()
    df['Sex'] = df['Sex'].replace(['male','female'],[0,1])
    return df

def nwhere(df):
    df = df.copy()
    df['Sex_new'] = np.where(df['Sex'] == 'male', 0, 1)
    return df

def mask1(df):
    df = df.copy()
    df['Sex_new'] = (df['Sex'] == 'female').astype(int)
    return df

def mask2(df):
    df = df.copy()
    df['Sex_new'] = (df['Sex'].values == 'female').astype(int)
    return df


def make_df(n):
    df = pd.DataFrame({'Sex': np.random.choice(['male','female'], size=n)})

    return df

perfplot.show(
    setup=make_df,
    kernels=[ma,  rep1, nwhere, mask1, mask2],
    n_range=[2**k for k in range(2, 18)],
    logx=True,
    logy=True,
    equality_check=False,  # rows may appear in different order
    xlabel='len(df)')

pic

结论:

如果只更换2个值是最慢的replacenumpy.where, map and mask是相似的。为了提高性能,比较numpy数组和.values。 所有这些都取决于数据,因此最好用实际数据进行测试。


1
投票

您可以使用np.where的另一个解决方案:

只是一个示例DataFrame:

>>> df
      Sex
0    male
1  female
2  female
3  female
4    male

根据条件创建新列new_Sex

>>> df['new_Sex'] = np.where(df['Sex'] == 'male', 0, 1)
>>> df
      Sex  new_Sex
0    male        0
1  female        1
2  female        1
3  female        1
4    male        0

要么:

>>> df['new_Sex'] = np.where(df['Sex'] != 'male', 1, 0)
>>> df
      Sex  new_Sex
0    male        0
1  female        1
2  female        1
3  female        1
4    male        0
© www.soinside.com 2019 - 2024. All rights reserved.