我正在尝试使用 pandas.merge_asof 合并 2 个表。
第一个表administrators_system_with_schemes_sort:
沙龙_id | staff_id | 日期 |
---|---|---|
872646 | 2715596 | 2024-10-02 00:00:00 |
872646 | 2715596 | 2024-10-03 00:00:00 |
872646 | 2715596 | 2024-10-06 00:00:00 |
872646 | 2715596 | 2024-10-07 00:00:00 |
872646 | 2715596 | 2024-10-10 00:00:00 |
872646 | 2715596 | 2024-10-11 00:00:00 |
872646 | 2715596 | 2024-10-14 00:00:00 |
872646 | 2715596 | 2024-10-15 00:00:00 |
第二个表,bonus_and_penalty_for_staff_id_administrators_sort:
沙龙_id | staff_id | 日期 | 奖金 | 处罚 |
---|---|---|---|---|
872646 | 2715596 | 2024-10-12 00:00:00 | 4070 | 0 |
我的代码:
astype_dict = {
'salon_id': 'int64', 'staff_id': 'int64'
, 'date': 'datetime64[ns]'
}
administrators_system_with_schemes['date'] = [pd.to_datetime(date).date() for date in administrators_system_with_schemes['date']]
bonus_and_penalty_for_staff_id_administrators['date'] = [pd.to_datetime(date).date() for date in bonus_and_penalty_for_staff_id_administrators['date']]
administrators_system_with_schemes_sort = (
administrators_system_with_schemes.copy()
.astype(astype_dict)
.sort_values(by='date')
)
bonus_and_penalty_for_staff_id_administrators_sort = (
bonus_and_penalty_for_staff_id_administrators.copy()
.astype(astype_dict)
.sort_values(by='date')
)
administrators_system_with_schemes_with_additional_bonus_penalty = (
pd.merge_asof(
left = administrators_system_with_schemes_sort
, right = bonus_and_penalty_for_staff_id_administrators_sort
, on = ['date']
, by = ['salon_id', 'staff_id']
, suffixes=['', '_y']
, direction='nearest'
))
结果:
| salon_id | staff_id | date | bonus | penalty |
|-----------:|-----------:|:--------------------|--------:|----------:|
| 872646 | 2715596 | 2024-10-02 00:00:00 | 0 | 0 |
| 872646 | 2715596 | 2024-10-03 00:00:00 | 0 | 0 |
| 872646 | 2715596 | 2024-10-06 00:00:00 | 0 | 0 |
| 872646 | 2715596 | 2024-10-07 00:00:00 | 0 | 0 |
| 872646 | 2715596 | 2024-10-10 00:00:00 | 0 | 0 |
| 872646 | 2715596 | 2024-10-11 00:00:00 | 0 | 0 |
| 872646 | 2715596 | 2024-10-14 00:00:00 | 0 | 0 |
| 872646 | 2715596 | 2024-10-15 00:00:00 | 0 | 0 |
结果是错误的,因为我在表格中得到了合适的值。 我已经尝试了很多方法来更改数据类型,但仍然出现此错误。
有什么想法,如何解决这个问题吗?
谢谢。
熊猫版。 2.1.4(版本 2.2.3 上有同样的错误)。 蟒蛇版本。 3.11.7
import pandas as pd
class DataMerger:
def __init__(self, admins_df, bonuses_df):
self.admins_df = admins_df
self.bonuses_df = bonuses_df
self.astype_dict = {
'salon_id': 'int64',
'staff_id': 'int64',
'date': 'datetime64[ns]'
}
def preprocess_data(self):
# Converting "date" columns to datetime format and ensuring the required data types
self.admins_df['date'] = pd.to_datetime(self.admins_df['date'])
self.bonuses_df['date'] = pd.to_datetime(self.bonuses_df['date'])
# We apply typing and sorting
self.admins_df = self.admins_df.astype(
self.astype_dict).sort_values(by='date')
self.bonuses_df = self.bonuses_df.astype(
self.astype_dict).sort_values(by='date')
def merge_data(self):
# Using merge_asof to merge data
merged_df = pd.merge_asof(
left=self.admins_df,
right=self.bonuses_df,
on='date',
by=['salon_id', 'staff_id'],
suffixes=('', '_y'),
direction='backward' # Use backward to take into account the closest past values
)
# Fill NaN values with zeros for bonuses and penalties
merged_df[['bonus', 'penalty']] = merged_df[[
'bonus', 'penalty']].fillna(0).astype(int)
return merged_df
if __name__ == "__main__":
# Assume administrators_system_with_schemes and bonus_and_penalty_for_staff_id_administrators are your DataFrames
administrators_system_with_schemes = pd.DataFrame({
'salon_id': [872646] * 8,
'staff_id': [2715596] * 8,
'date': [
'2024-10-02', '2024-10-03', '2024-10-06', '2024-10-07',
'2024-10-10', '2024-10-11', '2024-10-14', '2024-10-15'
]
})
bonus_and_penalty_for_staff_id_administrators = pd.DataFrame({
'salon_id': [872646],
'staff_id': [2715596],
'date': ['2024-10-12'],
'bonus': [4070],
'penalty': [0]
})
# Create an instance of the class and perform operations
merger = DataMerger(administrators_system_with_schemes,
bonus_and_penalty_for_staff_id_administrators)
merger.preprocess_data()
print(merger.merge_data())