使用一个热编码器时将 NaN 传输到虚拟变量

问题描述 投票:0回答:1

我正在使用 OneHotEncoder 基于分类变量创建一系列虚拟变量。 我遇到的问题是任何缺失值都不会转移到可用的虚拟变量中。

oh = OneHotEncoder(min_frequency = 0.0001, sparse_output = False) ### df_experience_level
data = oh.fit_transform(df[['experience_level']])
cols=oh.get_feature_names_out()
df_experience_level = pd.DataFrame(data,columns=cols)
missing_values_count = df_experience_level.isnull().sum()
missing_values_count

输出:

experience_level_EN     0
experience_level_EX     0
experience_level_MI     0
experience_level_SE     0
experience_level_nan    0
dtype: int64

我当前使用的代码是:

df.loc[df['experience_level'].isna(), 'experience_level_EN'] = np.nan
df.loc[df['experience_level'].isna(), 'experience_level_EX'] = np.nan
df.loc[df['experience_level'].isna(), 'experience_level_MI'] = np.nan
df.loc[df['experience_level'].isna(), 'experience_level_SE'] = np.nan

然而这很乏味。

运行显而易见的事情:

df.loc[df['experience_level'].isna(), df_experience_level] = np.nan

结果:

ValueError: Index data must be 1-dimensional

有没有办法在一条语句中将 NaN 从父变量传输到每个虚拟变量?

python pandas nan one-hot-encoding
1个回答
0
投票

您可以尝试使用

str.startswith
选择正确的列子集:

# Obviously replace exp_lvl with experience_level
subset = df_exp_lvl.columns.str.startswith('exp_lvl')
df_exp_lvl.loc[df['exp_lvl'].isna(), subset] = np.nan

输出:

>>> df_exp_lvl
    exp_lvl_EN  exp_lvl_EX  exp_lvl_MI  exp_lvl_SE  exp_lvl_nan
0          1.0         0.0         0.0         0.0          0.0
1          0.0         0.0         1.0         0.0          0.0
2          1.0         0.0         0.0         0.0          0.0
3          1.0         0.0         0.0         0.0          0.0
4          0.0         0.0         0.0         1.0          0.0
5          NaN         NaN         NaN         NaN          NaN
6          0.0         1.0         0.0         0.0          0.0
7          0.0         1.0         0.0         0.0          0.0
8          NaN         NaN         NaN         NaN          NaN
9          0.0         0.0         0.0         1.0          0.0
10         0.0         0.0         1.0         0.0          0.0
11         0.0         0.0         1.0         0.0          0.0
12         1.0         0.0         0.0         0.0          0.0
13         NaN         NaN         NaN         NaN          NaN
14         0.0         1.0         0.0         0.0          0.0
>>> df
   exp_lvl
0       EN
1       MI
2       EN
3       EN
4       SE
5      NaN
6       EX
7       EX
8      NaN
9       SE
10      MI
11      MI
12      EN
13     NaN
14      EX
© www.soinside.com 2019 - 2024. All rights reserved.