在分组 DataFrame 的聚合方法中引用 lambda 函数中的两列

Question

我有以下数据框：

	已付税	住宿	到达日期	出发日期	机构	用户代码	设施编号	COD_AUTH	MANAGER_TAXID	MANAGER_SURNAME	MANAGER_NAME	Booking_Source	夜晚	预订ID	Guest_Code	姓名	年	月	期间	客人
0	6.0	我的民宿	2023-02-03	2023-02-05	U432	U432543453543	23294	4395832fh343j	RSSMRA80P10B602Z	罗西	马里奥	Hotel.com	2	8796059	(1641973827	2968541682)	1	亚历山德拉·罗西	2023	2	1
1	0.0	我的民宿	2023-02-03	2023-02-05	U432	U432543453543	23294	4395832fh343j	RSSMRA80P10B602Z	罗西	马里奥	Hotel.com	2	9874036	(4987051672	3079823164)	2	卡洛塔比安奇	2023	2	1
2	12.0	我的民宿	2023-03-09	2023-03-11	U432	U432543453543	23294	4395832fh343j	RSSMRA80P10B602Z	罗西	马里奥	Vacation.com	4	3689027	(7098192456	5219374608)	1	伊琳娜·佩特洛娃	2023	3	2
3	18.0	我的民宿	2023-03-27	2023-03-30	U432	U432543453543	23294	4395832fh343j	RSSMRA80P10B602Z	罗西	马里奥	Hotel.com	6	3872106	(5412683907	0931874256)	1	克里斯蒂娜·桑切斯	2023	3	2
4	24.0	我的民宿	2023-03-15	2023-03-19	U432	U432543453543	23294	4395832fh343j	RSSMRA80P10B602Z	罗西	马里奥	Hotel.com	8	4169508	(7284310592	0653897214)	1	斯特凡·尼科洛夫	2023	3	2
5	18.0	我的民宿	2023-03-23	2023-03-26	U432	U432543453543	23294	4395832fh343j	RSSMRA80P10B602Z	罗西	马里奥	Hotel.com	6	5403918	(6728159304	1738594286)	1	米克尔詹森	2023	3	2
6	12.0	我的民宿	2023-03-31	2023-04-01	U432	U432543453543	23294	4395832fh343j	RSSMRA80P10B602Z	罗西	马里奥	Vacation.com	2	8514967	(6289413057	6403891725)	1	大卫莫雷蒂	2023	3	2
7	12.0	我的民宿	2023-04-01	2023-04-02	U432	U432543453543	23294	4395832fh343j	RSSMRA80P10B602Z	罗西	马里奥	Hotel.com	2	8209315	(0462815739	8410932576)	1	大卫莫雷蒂	2023	4	0

我需要聚合这个 DataFrame 并创建一些新列。我还想删除一些聚合后不需要的列。

我用于聚合的代码是：

df = (
    df.groupby(
        ['Institution', 'USER_CODE', 'FACILITY_ID', 'Year', 'Period', 'Month', 'Guest_Code', 'Booking_Source'],
        as_index=False,
    )
    .agg(**column_map)
    .reset_index(drop=True)
)

其中 Guest_Code 标识客人的类型，而 Period 标识一年中的季度。

上面代码中的

column_map

在这里定义：

# columns to be deleted after the aggregation
cols_to_del = ['Booking_ID', 'Arrival_date', 'Departure_Date', 'Name', 'Guests', 'Nights']

# columns I want to keep
column_map = {col: pd.NamedAgg(column=col, aggfunc='first') for col in df.columns if col not in cols_to_del}

# The amount of taxes paid in the same month for each Guest_Code and Booking_Source
column_map['Paid_Taxes'] = pd.NamedAgg(column='Paid_Taxes', aggfunc='sum')

# the total amount of guests in the month divided by guest code and booking source
column_map['All_arrives_per_Month-GuestCode-Source'] = pd.NamedAgg(column='Guests', aggfunc='sum')

# the total amount of nights in the month divided by guest code and booking source
column_map['All_nights_per_Month-GuestCode-Source'] = pd.NamedAgg(column='Nights', aggfunc='sum')

# the total amount of guest who are exempt to pay city taxes (basically who is guest_code=2)
column_map['All_tax-exempt_per_Month-Source'] = pd.NamedAgg(
    column='Guests',
    aggfunc=lambda x: x['Guests'].sum()
    if x[x['Guest_Code'] == 2].shape[0] > 1
    else x['Guests'].to_frame()
    if x[x['Guest_Code'] == 2].shape[0] == 1
    else pd.DataFrame([0], columns=['Guests']),
)

除了最后一行，一切正常：

column_map['All_tax-exempt_per_Month-Source'] = pd.NamedAgg(
    column='Guests',
    aggfunc=lambda x: x['Guests'].sum()
    if x[x['Guest_Code'] == 2].shape[0] > 1
    else x['Guests'].to_frame()
    if x[x['Guest_Code'] == 2].shape[0] == 1
    else pd.DataFrame([0], columns=['Guests'])
)

我理解失败是因为

是代表df['Guests']的

Series

，所以无法调用lambda函数中的

x['Guest_Code']

。如果我能够在 NamedAgg 类中定义多个列，那就太好了。不幸的是，目前这是不可能的（即使我不是唯一一个要求此功能的人），但我真的看不到另一种方法来获得一个列，如果有更多的话，该列是豁免客人的数量总和比所考虑的组中带有

Guest_Code=2

的一行，或者如果组中只有一行，则它是客人的数量，或者如果该组没有任何带有

Guest_Code=2

的行，则最终为0。

这是创建 DataFrame 的代码：

import pandas as pd

df = pd.DataFrame(
    {
        'Paid_Taxes': [6.0, 0.0, 12.0, 18.0, 24.0, 18.0, 12.0, 12.0],
        'Lodging': ['My-B&B', 'My-B&B', 'My-B&B', 'My-B&B', 'My-B&B', 'My-B&B', 'My-B&B', 'My-B&B'],
        'Arrival_date': ['2023-02-03', '2023-02-03', '2023-03-09', '2023-03-27', '2023-03-15', '2023-03-23', '2023-03-31', '2023-04-01'],
        'Departure_Date': ['2023-02-05', '2023-02-05', '2023-03-11', '2023-03-30', '2023-03-19', '2023-03-26', '2023-04-01', '2023-04-02'],
        'Institution': ['U432', 'U432', 'U432', 'U432', 'U432', 'U432', 'U432', 'U432'],
        'USER_CODE': ['U432543453543', 'U432543453543', 'U432543453543', 'U432543453543', 'U432543453543', 'U432543453543', 'U432543453543', 'U432543453543'],
        'FACILITY_ID': [23294, 23294, 23294, 23294, 23294, 23294, 23294, 23294],
        'COD_AUTH': ['4395832fh343j', '4395832fh343j', '4395832fh343j', '4395832fh343j', '4395832fh343j', '4395832fh343j', '4395832fh343j', '4395832fh343j'],
        'MANAGER_TAXID': ['RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z'],
        'MANAGER_SURNAME': ['Rossi', 'Rossi', 'Rossi', 'Rossi', 'Rossi', 'Rossi', 'Rossi', 'Rossi'],
        'MANAGER_NAME': ['Mario', 'Mario', 'Mario', 'Mario', 'Mario', 'Mario', 'Mario', 'Mario'],
        'Booking_Source': ['Hotel.com', 'Hotel.com', 'Vacation.com', 'Hotel.com', 'Hotel.com', 'Hotel.com', 'Vacation.com', 'Hotel.com'],
        'Nights': [2, 2, 4, 6, 8, 6, 2, 2],
        'Booking_ID': [8796059, 9874036, 3689027, 3872106, 4169508, 5403918, 8514967, 8209315],
        'Guest_Code': ['(1641973827 2968541682)', '(4987051672 3079823164)', '(7098192456 5219374608)', '(5412683907 0931874256)', '(7284310592 0653897214)', '(6728159304 1738594286)', '(6289413057 6403891725)', '(0462815739 8410932576)'],
        'Name': [1, 2, 1, 1, 1, 1, 1, 1],
        'Year': ['Alessandra Rossi', 'Carlotta Bianchi', 'Irina Petrova', 'Cristina Sanchez', 'Stefan Nikolov', 'Mikkel Jensen', 'Davide Moretti', 'Davide Moretti'],
        'Month': [2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023],
        'Period': [2, 2, 3, 3, 3, 3, 3, 4],
        'Guests': [1, 1, 1, 1, 1, 1, 1, 1],
    }
)

在分组 DataFrame 的聚合方法中引用 lambda 函数中的两列

问题描述投票：0回答：0

最新问题

在分组 DataFrame 的聚合方法中引用 lambda 函数中的两列

问题描述 投票：0回答：0

最新问题

问题描述投票：0回答：0