我正在使用 OneHotEncoder 基于分类变量创建一系列虚拟变量。 我遇到的问题是任何缺失值都不会转移到可用的虚拟变量中。
oh = OneHotEncoder(min_frequency = 0.0001, sparse_output = False) ### df_experience_level
data = oh.fit_transform(df[['experience_level']])
cols=oh.get_feature_names_out()
df_experience_level = pd.DataFrame(data,columns=cols)
missing_values_count = df_experience_level.isnull().sum()
missing_values_count
输出:
experience_level_EN 0
experience_level_EX 0
experience_level_MI 0
experience_level_SE 0
experience_level_nan 0
dtype: int64
我当前使用的代码是:
df.loc[df['experience_level'].isna(), 'experience_level_EN'] = np.nan
df.loc[df['experience_level'].isna(), 'experience_level_EX'] = np.nan
df.loc[df['experience_level'].isna(), 'experience_level_MI'] = np.nan
df.loc[df['experience_level'].isna(), 'experience_level_SE'] = np.nan
然而这很乏味。
运行显而易见的事情:
df.loc[df['experience_level'].isna(), df_experience_level] = np.nan
结果:
ValueError: Index data must be 1-dimensional
有没有办法在一条语句中将 NaN 从父变量传输到每个虚拟变量?
str.startswith
选择正确的列子集:
# Obviously replace exp_lvl with experience_level
subset = df_exp_lvl.columns.str.startswith('exp_lvl')
df_exp_lvl.loc[df['exp_lvl'].isna(), subset] = np.nan
输出:
>>> df_exp_lvl
exp_lvl_EN exp_lvl_EX exp_lvl_MI exp_lvl_SE exp_lvl_nan
0 1.0 0.0 0.0 0.0 0.0
1 0.0 0.0 1.0 0.0 0.0
2 1.0 0.0 0.0 0.0 0.0
3 1.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 1.0 0.0
5 NaN NaN NaN NaN NaN
6 0.0 1.0 0.0 0.0 0.0
7 0.0 1.0 0.0 0.0 0.0
8 NaN NaN NaN NaN NaN
9 0.0 0.0 0.0 1.0 0.0
10 0.0 0.0 1.0 0.0 0.0
11 0.0 0.0 1.0 0.0 0.0
12 1.0 0.0 0.0 0.0 0.0
13 NaN NaN NaN NaN NaN
14 0.0 1.0 0.0 0.0 0.0
>>> df
exp_lvl
0 EN
1 MI
2 EN
3 EN
4 SE
5 NaN
6 EX
7 EX
8 NaN
9 SE
10 MI
11 MI
12 EN
13 NaN
14 EX