我有以下数据框:
已付税 | 住宿 | 到达日期 | 出发日期 | 机构 | 用户代码 | 设施编号 | COD_AUTH | MANAGER_TAXID | MANAGER_SURNAME | MANAGER_NAME | Booking_Source | 夜晚 | 预订ID | Guest_Code | 姓名 | 年 | 月 | 期间 | 客人 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6.0 | 我的民宿 | 2023-02-03 | 2023-02-05 | U432 | U432543453543 | 23294 | 4395832fh343j | RSSMRA80P10B602Z | 罗西 | 马里奥 | Hotel.com | 2 | 8796059 | (1641973827 | 2968541682) | 1 | 亚历山德拉·罗西 | 2023 | 2 | 1 |
1 | 0.0 | 我的民宿 | 2023-02-03 | 2023-02-05 | U432 | U432543453543 | 23294 | 4395832fh343j | RSSMRA80P10B602Z | 罗西 | 马里奥 | Hotel.com | 2 | 9874036 | (4987051672 | 3079823164) | 2 | 卡洛塔比安奇 | 2023 | 2 | 1 |
2 | 12.0 | 我的民宿 | 2023-03-09 | 2023-03-11 | U432 | U432543453543 | 23294 | 4395832fh343j | RSSMRA80P10B602Z | 罗西 | 马里奥 | Vacation.com | 4 | 3689027 | (7098192456 | 5219374608) | 1 | 伊琳娜·佩特洛娃 | 2023 | 3 | 2 |
3 | 18.0 | 我的民宿 | 2023-03-27 | 2023-03-30 | U432 | U432543453543 | 23294 | 4395832fh343j | RSSMRA80P10B602Z | 罗西 | 马里奥 | Hotel.com | 6 | 3872106 | (5412683907 | 0931874256) | 1 | 克里斯蒂娜·桑切斯 | 2023 | 3 | 2 |
4 | 24.0 | 我的民宿 | 2023-03-15 | 2023-03-19 | U432 | U432543453543 | 23294 | 4395832fh343j | RSSMRA80P10B602Z | 罗西 | 马里奥 | Hotel.com | 8 | 4169508 | (7284310592 | 0653897214) | 1 | 斯特凡·尼科洛夫 | 2023 | 3 | 2 |
5 | 18.0 | 我的民宿 | 2023-03-23 | 2023-03-26 | U432 | U432543453543 | 23294 | 4395832fh343j | RSSMRA80P10B602Z | 罗西 | 马里奥 | Hotel.com | 6 | 5403918 | (6728159304 | 1738594286) | 1 | 米克尔詹森 | 2023 | 3 | 2 |
6 | 12.0 | 我的民宿 | 2023-03-31 | 2023-04-01 | U432 | U432543453543 | 23294 | 4395832fh343j | RSSMRA80P10B602Z | 罗西 | 马里奥 | Vacation.com | 2 | 8514967 | (6289413057 | 6403891725) | 1 | 大卫莫雷蒂 | 2023 | 3 | 2 |
7 | 12.0 | 我的民宿 | 2023-04-01 | 2023-04-02 | U432 | U432543453543 | 23294 | 4395832fh343j | RSSMRA80P10B602Z | 罗西 | 马里奥 | Hotel.com | 2 | 8209315 | (0462815739 | 8410932576) | 1 | 大卫莫雷蒂 | 2023 | 4 | 0 |
我需要聚合这个 DataFrame 并创建一些新列。我还想删除一些聚合后不需要的列。
我用于聚合的代码是:
df = (
df.groupby(
['Institution', 'USER_CODE', 'FACILITY_ID', 'Year', 'Period', 'Month', 'Guest_Code', 'Booking_Source'],
as_index=False,
)
.agg(**column_map)
.reset_index(drop=True)
)
其中 Guest_Code 标识客人的类型,而 Period 标识一年中的季度。
上面代码中的column_map
在这里定义:
# columns to be deleted after the aggregation
cols_to_del = ['Booking_ID', 'Arrival_date', 'Departure_Date', 'Name', 'Guests', 'Nights']
# columns I want to keep
column_map = {col: pd.NamedAgg(column=col, aggfunc='first') for col in df.columns if col not in cols_to_del}
# The amount of taxes paid in the same month for each Guest_Code and Booking_Source
column_map['Paid_Taxes'] = pd.NamedAgg(column='Paid_Taxes', aggfunc='sum')
# the total amount of guests in the month divided by guest code and booking source
column_map['All_arrives_per_Month-GuestCode-Source'] = pd.NamedAgg(column='Guests', aggfunc='sum')
# the total amount of nights in the month divided by guest code and booking source
column_map['All_nights_per_Month-GuestCode-Source'] = pd.NamedAgg(column='Nights', aggfunc='sum')
# the total amount of guest who are exempt to pay city taxes (basically who is guest_code=2)
column_map['All_tax-exempt_per_Month-Source'] = pd.NamedAgg(
column='Guests',
aggfunc=lambda x: x['Guests'].sum()
if x[x['Guest_Code'] == 2].shape[0] > 1
else x['Guests'].to_frame()
if x[x['Guest_Code'] == 2].shape[0] == 1
else pd.DataFrame([0], columns=['Guests']),
)
除了最后一行,一切正常:
column_map['All_tax-exempt_per_Month-Source'] = pd.NamedAgg(
column='Guests',
aggfunc=lambda x: x['Guests'].sum()
if x[x['Guest_Code'] == 2].shape[0] > 1
else x['Guests'].to_frame()
if x[x['Guest_Code'] == 2].shape[0] == 1
else pd.DataFrame([0], columns=['Guests'])
)
我理解失败是因为
x
是代表df['Guests']
的Series,所以无法调用lambda函数中的
x['Guest_Code']
。如果我能够在 NamedAgg 类中定义多个列,那就太好了。不幸的是,目前这是不可能的(即使我不是唯一一个要求此功能的人),但我真的看不到另一种方法来获得一个列,如果有更多的话,该列是豁免客人的数量总和比所考虑的组中带有Guest_Code=2
的一行,或者如果组中只有一行,则它是客人的数量,或者如果该组没有任何带有Guest_Code=2
的行,则最终为0。
这是创建 DataFrame 的代码:
import pandas as pd
df = pd.DataFrame(
{
'Paid_Taxes': [6.0, 0.0, 12.0, 18.0, 24.0, 18.0, 12.0, 12.0],
'Lodging': ['My-B&B', 'My-B&B', 'My-B&B', 'My-B&B', 'My-B&B', 'My-B&B', 'My-B&B', 'My-B&B'],
'Arrival_date': ['2023-02-03', '2023-02-03', '2023-03-09', '2023-03-27', '2023-03-15', '2023-03-23', '2023-03-31', '2023-04-01'],
'Departure_Date': ['2023-02-05', '2023-02-05', '2023-03-11', '2023-03-30', '2023-03-19', '2023-03-26', '2023-04-01', '2023-04-02'],
'Institution': ['U432', 'U432', 'U432', 'U432', 'U432', 'U432', 'U432', 'U432'],
'USER_CODE': ['U432543453543', 'U432543453543', 'U432543453543', 'U432543453543', 'U432543453543', 'U432543453543', 'U432543453543', 'U432543453543'],
'FACILITY_ID': [23294, 23294, 23294, 23294, 23294, 23294, 23294, 23294],
'COD_AUTH': ['4395832fh343j', '4395832fh343j', '4395832fh343j', '4395832fh343j', '4395832fh343j', '4395832fh343j', '4395832fh343j', '4395832fh343j'],
'MANAGER_TAXID': ['RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z'],
'MANAGER_SURNAME': ['Rossi', 'Rossi', 'Rossi', 'Rossi', 'Rossi', 'Rossi', 'Rossi', 'Rossi'],
'MANAGER_NAME': ['Mario', 'Mario', 'Mario', 'Mario', 'Mario', 'Mario', 'Mario', 'Mario'],
'Booking_Source': ['Hotel.com', 'Hotel.com', 'Vacation.com', 'Hotel.com', 'Hotel.com', 'Hotel.com', 'Vacation.com', 'Hotel.com'],
'Nights': [2, 2, 4, 6, 8, 6, 2, 2],
'Booking_ID': [8796059, 9874036, 3689027, 3872106, 4169508, 5403918, 8514967, 8209315],
'Guest_Code': ['(1641973827 2968541682)', '(4987051672 3079823164)', '(7098192456 5219374608)', '(5412683907 0931874256)', '(7284310592 0653897214)', '(6728159304 1738594286)', '(6289413057 6403891725)', '(0462815739 8410932576)'],
'Name': [1, 2, 1, 1, 1, 1, 1, 1],
'Year': ['Alessandra Rossi', 'Carlotta Bianchi', 'Irina Petrova', 'Cristina Sanchez', 'Stefan Nikolov', 'Mikkel Jensen', 'Davide Moretti', 'Davide Moretti'],
'Month': [2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023],
'Period': [2, 2, 3, 3, 3, 3, 3, 4],
'Guests': [1, 1, 1, 1, 1, 1, 1, 1],
}
)