我的组织对各种员工属性使用特殊代码。我们正在迁移到一个新系统,我必须根据一定的逻辑将这些代码映射到新代码。
这是我的映射 df
Mappings
:
State Old_Mgmt New_Mgmt Old_ID New_ID New_Site
01 A001 A100 0000 0101 123
01 A002 A100 0000 0102
01 A003 A105 0000 0103 123
02 A001 A100 0000 0101
这是
EmployeeData
:
State Management ID Site
01 A001 0000 456
01 A002 0000 987
02 A002 0000 987
....
映射的逻辑是遍历
EmployeeData
的每一行,如果存在State
、Management
和ID
匹配,那么它将更新为相应的New_
值。然而,对于 Site
,仅当 New_Site
不为空/NaN 时,它才会更新站点 ID。此映射将修改原始数据框。
根据上述映射,新的
EmployeeData
将是:
State Management ID Site
01 A100 0101 123 (modified this row)
01 A100 0102 987 (modified this row)
02 A002 0000 987
....
我最初的想法是做这样的事情:
for i,r in EmployeeData.iterrows(): # For each employee row
# Create masks for the filters we are looking for
mask_state = Mappings['State'] = r['State']
mask_mgmt = Mappings['Old_Mgmt'] = r['Management']
mask_id = Mappings['Old_ID'] = r['ID']
# Filter mappings for the above 3 conditions
MATCH = Mappings[mask_state & mask_mgmt & mask_id]
if MATCH.empty: # No matches found
print("No matches found in mapping. No need to update. Skipping.")
continue
MATCH = MATCH.iloc[0] # If a match is found, it will correspond to only 1 row
EmployeeData.at[i, 'Management'] = MATCH['New_Mgmt']
EmployeeData.at[i, 'ID'] = MATCH['New_ID']
if pd.notna(MATCH['New_Site']):
EmployeeData.at[i, 'Site'] = MATCH['New_Site']
然而,这似乎相当低效,因为我必须过滤每一行的映射。如果仅映射 1 列,我会执行以下操作:
# Make a dict mapping Old_Mgmt -> New_Mgmt
MGMT_MAPPING = pd.Series(Mappings['New_Mgmt'].values,index=Mappings['Old_Mgmt']).to_dict()
mask_state = Mappings['State'] = r['State']
EmployeeData.loc[mask_state, 'Management'] = EmployeeData.loc[mask_state, 'Management'].replace(MGMT_MAPPING)
但这不适用于我的情况,因为我需要映射多个值
尝试:
# merge mappings to EmployeeData
out = EmployeeData.merge(
Mappings,
left_on=["State", "Management"],
right_on=["State", "Old_Mgmt"],
how="left",
)
# fill NaN values with old values
out["New_ID"] = out["New_ID"].fillna(out["ID"])
out["New_Mgmt"] = out["New_Mgmt"].fillna(out["Management"])
# create final dataframe, rename columns
out = out[["State", "New_Mgmt", "New_ID", "Site"]].rename(
columns={"New_Mgmt": "Management", "New_ID": "ID"}
)
print(out)
打印:
State Management ID Site
0 1 A100 101.0 456
1 1 A100 102.0 987
2 2 A002 0.0 987