在分组 DataFrame 的聚合方法中引用 lambda 函数中的两列

问题描述 投票:0回答:0

我有以下数据框:

已付税 住宿 到达日期 出发日期 机构 用户代码 设施编号 COD_AUTH MANAGER_TAXID MANAGER_SURNAME MANAGER_NAME Booking_Source 夜晚 预订ID Guest_Code 姓名 期间 客人
0 6.0 我的民宿 2023-02-03 2023-02-05 U432 U432543453543 23294 4395832fh343j RSSMRA80P10B602Z 罗西 马里奥 Hotel.com 2 8796059 (1641973827 2968541682) 1 亚历山德拉·罗西 2023 2 1
1 0.0 我的民宿 2023-02-03 2023-02-05 U432 U432543453543 23294 4395832fh343j RSSMRA80P10B602Z 罗西 马里奥 Hotel.com 2 9874036 (4987051672 3079823164) 2 卡洛塔比安奇 2023 2 1
2 12.0 我的民宿 2023-03-09 2023-03-11 U432 U432543453543 23294 4395832fh343j RSSMRA80P10B602Z 罗西 马里奥 Vacation.com 4 3689027 (7098192456 5219374608) 1 伊琳娜·佩特洛娃 2023 3 2
3 18.0 我的民宿 2023-03-27 2023-03-30 U432 U432543453543 23294 4395832fh343j RSSMRA80P10B602Z 罗西 马里奥 Hotel.com 6 3872106 (5412683907 0931874256) 1 克里斯蒂娜·桑切斯 2023 3 2
4 24.0 我的民宿 2023-03-15 2023-03-19 U432 U432543453543 23294 4395832fh343j RSSMRA80P10B602Z 罗西 马里奥 Hotel.com 8 4169508 (7284310592 0653897214) 1 斯特凡·尼科洛夫 2023 3 2
5 18.0 我的民宿 2023-03-23 2023-03-26 U432 U432543453543 23294 4395832fh343j RSSMRA80P10B602Z 罗西 马里奥 Hotel.com 6 5403918 (6728159304 1738594286) 1 米克尔詹森 2023 3 2
6 12.0 我的民宿 2023-03-31 2023-04-01 U432 U432543453543 23294 4395832fh343j RSSMRA80P10B602Z 罗西 马里奥 Vacation.com 2 8514967 (6289413057 6403891725) 1 大卫莫雷蒂 2023 3 2
7 12.0 我的民宿 2023-04-01 2023-04-02 U432 U432543453543 23294 4395832fh343j RSSMRA80P10B602Z 罗西 马里奥 Hotel.com 2 8209315 (0462815739 8410932576) 1 大卫莫雷蒂 2023 4 0

我需要聚合这个 DataFrame 并创建一些新列。我还想删除一些聚合后不需要的列。

我用于聚合的代码是:

df = (
    df.groupby(
        ['Institution', 'USER_CODE', 'FACILITY_ID', 'Year', 'Period', 'Month', 'Guest_Code', 'Booking_Source'],
        as_index=False,
    )
    .agg(**column_map)
    .reset_index(drop=True)
)

其中 Guest_Code 标识客人的类型,而 Period 标识一年中的季度。

上面代码中的

column_map
在这里定义:

# columns to be deleted after the aggregation
cols_to_del = ['Booking_ID', 'Arrival_date', 'Departure_Date', 'Name', 'Guests', 'Nights']

# columns I want to keep
column_map = {col: pd.NamedAgg(column=col, aggfunc='first') for col in df.columns if col not in cols_to_del}

# The amount of taxes paid in the same month for each Guest_Code and Booking_Source
column_map['Paid_Taxes'] = pd.NamedAgg(column='Paid_Taxes', aggfunc='sum')

# the total amount of guests in the month divided by guest code and booking source
column_map['All_arrives_per_Month-GuestCode-Source'] = pd.NamedAgg(column='Guests', aggfunc='sum')

# the total amount of nights in the month divided by guest code and booking source
column_map['All_nights_per_Month-GuestCode-Source'] = pd.NamedAgg(column='Nights', aggfunc='sum')

# the total amount of guest who are exempt to pay city taxes (basically who is guest_code=2)
column_map['All_tax-exempt_per_Month-Source'] = pd.NamedAgg(
    column='Guests',
    aggfunc=lambda x: x['Guests'].sum()
    if x[x['Guest_Code'] == 2].shape[0] > 1
    else x['Guests'].to_frame()
    if x[x['Guest_Code'] == 2].shape[0] == 1
    else pd.DataFrame([0], columns=['Guests']),
)

除了最后一行,一切正常:

column_map['All_tax-exempt_per_Month-Source'] = pd.NamedAgg(
    column='Guests',
    aggfunc=lambda x: x['Guests'].sum()
    if x[x['Guest_Code'] == 2].shape[0] > 1
    else x['Guests'].to_frame()
    if x[x['Guest_Code'] == 2].shape[0] == 1
    else pd.DataFrame([0], columns=['Guests'])
)

我理解失败是因为

x
是代表df['Guests']
Series
,所以无法调用lambda函数中的
x['Guest_Code']
。如果我能够在 NamedAgg 类中定义多个列,那就太好了。不幸的是,目前这是不可能的(即使我不是唯一一个要求此功能的人),但我真的看不到另一种方法来获得一个列,如果有更多的话,该列是豁免客人的数量总和比所考虑的组中带有
Guest_Code=2
的一行,或者如果组中只有一行,则它是客人的数量,或者如果该组没有任何带有
Guest_Code=2
的行,则最终为0。

这是创建 DataFrame 的代码:

import pandas as pd

df = pd.DataFrame(
    {
        'Paid_Taxes': [6.0, 0.0, 12.0, 18.0, 24.0, 18.0, 12.0, 12.0],
        'Lodging': ['My-B&B', 'My-B&B', 'My-B&B', 'My-B&B', 'My-B&B', 'My-B&B', 'My-B&B', 'My-B&B'],
        'Arrival_date': ['2023-02-03', '2023-02-03', '2023-03-09', '2023-03-27', '2023-03-15', '2023-03-23', '2023-03-31', '2023-04-01'],
        'Departure_Date': ['2023-02-05', '2023-02-05', '2023-03-11', '2023-03-30', '2023-03-19', '2023-03-26', '2023-04-01', '2023-04-02'],
        'Institution': ['U432', 'U432', 'U432', 'U432', 'U432', 'U432', 'U432', 'U432'],
        'USER_CODE': ['U432543453543', 'U432543453543', 'U432543453543', 'U432543453543', 'U432543453543', 'U432543453543', 'U432543453543', 'U432543453543'],
        'FACILITY_ID': [23294, 23294, 23294, 23294, 23294, 23294, 23294, 23294],
        'COD_AUTH': ['4395832fh343j', '4395832fh343j', '4395832fh343j', '4395832fh343j', '4395832fh343j', '4395832fh343j', '4395832fh343j', '4395832fh343j'],
        'MANAGER_TAXID': ['RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z', 'RSSMRA80P10B602Z'],
        'MANAGER_SURNAME': ['Rossi', 'Rossi', 'Rossi', 'Rossi', 'Rossi', 'Rossi', 'Rossi', 'Rossi'],
        'MANAGER_NAME': ['Mario', 'Mario', 'Mario', 'Mario', 'Mario', 'Mario', 'Mario', 'Mario'],
        'Booking_Source': ['Hotel.com', 'Hotel.com', 'Vacation.com', 'Hotel.com', 'Hotel.com', 'Hotel.com', 'Vacation.com', 'Hotel.com'],
        'Nights': [2, 2, 4, 6, 8, 6, 2, 2],
        'Booking_ID': [8796059, 9874036, 3689027, 3872106, 4169508, 5403918, 8514967, 8209315],
        'Guest_Code': ['(1641973827 2968541682)', '(4987051672 3079823164)', '(7098192456 5219374608)', '(5412683907 0931874256)', '(7284310592 0653897214)', '(6728159304 1738594286)', '(6289413057 6403891725)', '(0462815739 8410932576)'],
        'Name': [1, 2, 1, 1, 1, 1, 1, 1],
        'Year': ['Alessandra Rossi', 'Carlotta Bianchi', 'Irina Petrova', 'Cristina Sanchez', 'Stefan Nikolov', 'Mikkel Jensen', 'Davide Moretti', 'Davide Moretti'],
        'Month': [2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023],
        'Period': [2, 2, 3, 3, 3, 3, 3, 4],
        'Guests': [1, 1, 1, 1, 1, 1, 1, 1],
    }
)
python pandas dataframe group-by aggregate
© www.soinside.com 2019 - 2024. All rights reserved.