有没有办法根据 Pandas GroupBy 的计数在 2 个数据帧之间重复值的分布?

问题描述 投票:0回答:1

我有两个具有相同结构的 Pandas 数据框,但形状和值不同:

import pandas as pd

dataframe_1 = pd.DataFrame({'customer_id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'], 
                        'gender': ['M', 'M', 'F', 'F', 'F', 'F'],
                        'age': ['18-25', '25-40', '18-25', '18-25', '60+', '18-25'],
                        'region': ['America', 'Africa', 'America', 'Asia', 'Europe', 'Asia']})

dataframe_2 = pd.DataFrame({'customer_id': ['id11', 'id12', 'id13', 'id14', 'id15', 'id16', 'id17', 'id18', 'id19', 'id20', 'id21'], 
                        'gender': ['M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'F'],
                        'age': ['18-25', '25-40', '18-25', '18-25', '60+', '18-25', '25-40', '60+', '18-25', '60+', '18-25'],
                        'region': ['America', 'Africa', 'America', 'Asia', 'Europe', 'Europe', 'Africa', 'Australia', 'Asia', 'Europe', 'Asia']})

我在 dataframe_1 上执行了 GroupBy 来统计每个组中的客户数量,并以数据框的形式获得了分布:

pd.DataFrame(dataframe_1.groupby(['gender', 'age', 'region'])['customer_id'].count()).reset_index

输出:

    gender  age     region  customer_id
0   F       18-25   America 1
1   F       18-25   Asia    2
2   F       60+     Europe  1
3   M       18-25   America 1
4   M       25-40   Africa  1

有没有办法将这种分布强加在 dataframe_2 上以获得具有匹配参数的不同 customer_ids?

因此,在第 0 行(['F', '18-25', 'America'])的情况下,它将是 datframe_2, id13 的唯一选项。

对于第 1 行(['F', '18-25', 'Asia']),它将是 [id_14, id_19, id_21] 等中任意唯一的 2 个 id。

附注为了添加一些上下文,我尝试根据 B 组中的值分布创建 A 组,以评估 AB 测试的结果。我明白这听起来如何,但这就是我的任务。

python pandas group-by
1个回答
0
投票

这是您问题的可能解决方案:

import pandas as pd

# -- Input Dataframes ----------------------------------------------------------
dataframe_1 = pd.DataFrame(
    {
        'customer_id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'], 
        'gender': ['M', 'M', 'F', 'F', 'F', 'F'],
        'age': ['18-25', '25-40', '18-25', '18-25', '60+', '18-25'],
        'region': ['America', 'Africa', 'America', 'Asia', 'Europe', 'Asia'],
    }
)

dataframe_2 = pd.DataFrame(
    {
        'customer_id': ['id11', 'id12', 'id13', 'id14', 'id15', 'id16', 'id17', 'id18', 'id19', 'id20', 'id21'], 
        'gender': ['M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'F'],
        'age': ['18-25', '25-40', '18-25', '18-25', '60+', '18-25', '25-40', '60+', '18-25', '60+', '18-25'],
        'region': ['America', 'Africa', 'America', 'Asia', 'Europe', 'Europe', 'Africa', 'Australia', 'Asia', 'Europe', 'Asia'],
    }
)

# -- Customer counts by 'gender', 'age', and 'region' --------------------------
dataframe_3 = (
    dataframe_1
    .groupby(['gender', 'age', 'region'], as_index=False)['customer_id']
    .count()
    .rename(columns={"customer_id": "customer_count"})
)
# dataframe_3:
#
#   gender    age   region  customer_count
# 0      F  18-25  America               1
# 1      F  18-25     Asia               2
# 2      F    60+   Europe               1
# 3      M  18-25  America               1
# 4      M  25-40   Africa               1

# -- Selecting n customers from `dataframe_2` using 'customer_count' values ----
# Steps being applied:
# Step 1: We start by merging `dataframe_2` and `dataframe_3` using the columns
#         'gender', 'age', and 'region'. We specify `how='inner'` to join only
#         values with groups that exist in both dataframes.
# Step 2: When we apply the merge operation, the resulting dataframe will have
#         the same columns as `dataframe_2` with an added column 'customer_count'.
#         We then apply `.astype({'customer_count': 'int32'})` to make sure that the
#         values from 'customer_count' are all integers.
# Step 3: Next, we need to apply another `groupby` operation, using the same columns
#         as group keys that we used to create `dataframe_3`
# Step 4: Instead of selecting columns and applying a "normal" aggregate operation,
#         we use `.apply` to perform a custom operation on each group of rows that
#         share the same 'gender', 'age', 'region'. In our case, we'll select 
#         N rows, where N is determined by each groups' 'customer_count'.
# Step 5: Finally, we reset the indexes of the newly created dataframe.
final_dataframe = (
    dataframe_2
    # Step 1
    .merge(dataframe_3, on=['gender', 'age', 'region'], how='inner')
    # Step 2
    .astype({'customer_count': 'int32'})
    # Step 3
    .groupby(['gender', 'age', 'region'], as_index=False)
    # Step 4
    .apply(lambda grp: grp.head(grp["customer_count"].iloc[0]))
    # Step 5
    .reset_index(drop=True)
)
# final_dataframe:
#
#   customer_id gender    age   region  customer_count
# 0        id13      F  18-25  America               1
# 1        id14      F  18-25     Asia               2
# 2        id19      F  18-25     Asia               2
# 3        id15      F    60+   Europe               1
# 4        id11      M  18-25  America               1
# 5        id12      M  25-40   Africa               1

# Final validation: the sum of values from column "customer_count" of `dataframe_3` equals
#                   the number of lines that exist in `final_dataframe`
print(dataframe_3["customer_count"].sum() == final_dataframe.shape[0])
# Prints: True

输出:

客户 ID 性别 年龄 地区 客户数量
id13 F 18-25 美国 1
id14 F 18-25 亚洲 2
id19 F 18-25 亚洲 2
id15 F 60+ 欧洲 1
id11 M 18-25 美国 1
id12 M 25-40 非洲 1
© www.soinside.com 2019 - 2024. All rights reserved.