我有一个这样的csv示例:
keys key_regex datatype detailed_datatype precedence val_regex val_regex_2 val_regex_3 max_words alpha_char_check
0 billingAddress original_billing_key_regex alphabetic address primary NaN NaN NaN NaN NaN
1 deliveryAddress original_delivery_key_regex alphabetic address primary NaN NaN NaN NaN NaN
2 notifyParty original_notify_party_regex alphabetic alphabetic primary NaN NaN NaN NaN NaN
3 originAddress original_seller_address_regex alphabetic address primary NaN NaN NaN NaN NaN
4 billingAddressAlt alternative_billing_key_regex alphabetic address tertiary NaN NaN NaN NaN NaN
5 deliveryAddressAlt alternative_delivery_key_regex alphabetic address tertiary NaN NaN NaN 5.0 1.0
6 originAddressAlt alternative_seller_key_regex alphabetic address tertiary NaN sample_val_re1 NaN NaN 0.0
我正在尝试将keys
列中具有键值的行替换为tertiary_row_replacement_dict
列值作为对应值的行,然后从keys
列值中重命名precendence
至'tertiary'
-保持索引位置与以前相同。
预期的输出是这样的:
'primary'
[有3个原始的csv-每个都有的csvs很大,有很多类似的情况,即具有主要优先级的键和具有主要优先级的备用键。我用键的字典这样的字典:
keys key_regex datatype detailed_datatype precedence val_regex val_regex_2 val_regex_3 max_words alpha_char_check
0 billingAddress alternative_billing_key_regex alphabetic address primary NaN NaN NaN NaN NaN
1 deliveryAddress alternative_delivery_key_regex alphabetic address primary NaN NaN NaN 5.0 1.0
2 notifyParty original_notify_party_regex alphabetic alphabetic primary NaN NaN NaN NaN NaN
3 originAddress alternative_seller_key_regex alphabetic address primary NaN sample_val_re1 NaN NaN 0.0
提供此字典的键和相应的值将始终存在于csv中,我有此代码:
tertiary_row_replacement_dict = {
"originAddress": "originAddressAlt",
"deliveryAddress": "deliveryAddressAlt",
# "totalAmount": "totalAmountAlt",
"billingAddress": "billingAddressAlt"
....
}
它完成了我想做的事情。仅在测试csv上执行此操作大约需要0.034秒,并且可能不是处理仅替换行和替换单元格值的这种情况的最佳或优化方法。是否有任何一种更快速的替代方法,并且具有先决条件知识,即可以用哪一行替换哪一行(即,使用该字典不是强制性的,我们可以将其用作列表列表的元组列表以进行速度权衡)。
您可以使用for k, new_k in row_replacement_dict.items():
t2 = df.loc[df['keys']==new_k].index[0]
df.loc[df.loc[df['keys']==k].index[0]] = [i if i!='tertiary' else 'primary' for i in df.loc[t2]]
df = df.replace([new_k, 'tertiary'], [k, 'primary']).drop([t2])
将三键替换为主键,并使用replace
填写信息:
groupby().first()
输出:
inverse_dict = {v:k for k,v in tertiary_row_replacement_dict.items()}
(df.groupby(df['keys'].replace(inverse_dict))
.first()
.reset_index(drop=True)
)