我有两个具有相同结构的 Pandas 数据框,但形状和值不同:
import pandas as pd
dataframe_1 = pd.DataFrame({'customer_id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
'gender': ['M', 'M', 'F', 'F', 'F', 'F'],
'age': ['18-25', '25-40', '18-25', '18-25', '60+', '18-25'],
'region': ['America', 'Africa', 'America', 'Asia', 'Europe', 'Asia']})
dataframe_2 = pd.DataFrame({'customer_id': ['id11', 'id12', 'id13', 'id14', 'id15', 'id16', 'id17', 'id18', 'id19', 'id20', 'id21'],
'gender': ['M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'F'],
'age': ['18-25', '25-40', '18-25', '18-25', '60+', '18-25', '25-40', '60+', '18-25', '60+', '18-25'],
'region': ['America', 'Africa', 'America', 'Asia', 'Europe', 'Europe', 'Africa', 'Australia', 'Asia', 'Europe', 'Asia']})
我在 dataframe_1 上执行了 GroupBy 来统计每个组中的客户数量,并以数据框的形式获得了分布:
pd.DataFrame(dataframe_1.groupby(['gender', 'age', 'region'])['customer_id'].count()).reset_index
输出:
gender age region customer_id
0 F 18-25 America 1
1 F 18-25 Asia 2
2 F 60+ Europe 1
3 M 18-25 America 1
4 M 25-40 Africa 1
有没有办法将这种分布强加在 dataframe_2 上以获得具有匹配参数的不同 customer_ids?
因此,在第 0 行(['F', '18-25', 'America'])的情况下,它将是 datframe_2, id13 的唯一选项。
对于第 1 行(['F', '18-25', 'Asia']),它将是 [id_14, id_19, id_21] 等中任意唯一的 2 个 id。
附注为了添加一些上下文,我尝试根据 B 组中的值分布创建 A 组,以评估 AB 测试的结果。我明白这听起来如何,但这就是我的任务。
这是您问题的可能解决方案:
import pandas as pd
# -- Input Dataframes ----------------------------------------------------------
dataframe_1 = pd.DataFrame(
{
'customer_id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
'gender': ['M', 'M', 'F', 'F', 'F', 'F'],
'age': ['18-25', '25-40', '18-25', '18-25', '60+', '18-25'],
'region': ['America', 'Africa', 'America', 'Asia', 'Europe', 'Asia'],
}
)
dataframe_2 = pd.DataFrame(
{
'customer_id': ['id11', 'id12', 'id13', 'id14', 'id15', 'id16', 'id17', 'id18', 'id19', 'id20', 'id21'],
'gender': ['M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'F'],
'age': ['18-25', '25-40', '18-25', '18-25', '60+', '18-25', '25-40', '60+', '18-25', '60+', '18-25'],
'region': ['America', 'Africa', 'America', 'Asia', 'Europe', 'Europe', 'Africa', 'Australia', 'Asia', 'Europe', 'Asia'],
}
)
# -- Customer counts by 'gender', 'age', and 'region' --------------------------
dataframe_3 = (
dataframe_1
.groupby(['gender', 'age', 'region'], as_index=False)['customer_id']
.count()
.rename(columns={"customer_id": "customer_count"})
)
# dataframe_3:
#
# gender age region customer_count
# 0 F 18-25 America 1
# 1 F 18-25 Asia 2
# 2 F 60+ Europe 1
# 3 M 18-25 America 1
# 4 M 25-40 Africa 1
# -- Selecting n customers from `dataframe_2` using 'customer_count' values ----
# Steps being applied:
# Step 1: We start by merging `dataframe_2` and `dataframe_3` using the columns
# 'gender', 'age', and 'region'. We specify `how='inner'` to join only
# values with groups that exist in both dataframes.
# Step 2: When we apply the merge operation, the resulting dataframe will have
# the same columns as `dataframe_2` with an added column 'customer_count'.
# We then apply `.astype({'customer_count': 'int32'})` to make sure that the
# values from 'customer_count' are all integers.
# Step 3: Next, we need to apply another `groupby` operation, using the same columns
# as group keys that we used to create `dataframe_3`
# Step 4: Instead of selecting columns and applying a "normal" aggregate operation,
# we use `.apply` to perform a custom operation on each group of rows that
# share the same 'gender', 'age', 'region'. In our case, we'll select
# N rows, where N is determined by each groups' 'customer_count'.
# Step 5: Finally, we reset the indexes of the newly created dataframe.
final_dataframe = (
dataframe_2
# Step 1
.merge(dataframe_3, on=['gender', 'age', 'region'], how='inner')
# Step 2
.astype({'customer_count': 'int32'})
# Step 3
.groupby(['gender', 'age', 'region'], as_index=False)
# Step 4
.apply(lambda grp: grp.head(grp["customer_count"].iloc[0]))
# Step 5
.reset_index(drop=True)
)
# final_dataframe:
#
# customer_id gender age region customer_count
# 0 id13 F 18-25 America 1
# 1 id14 F 18-25 Asia 2
# 2 id19 F 18-25 Asia 2
# 3 id15 F 60+ Europe 1
# 4 id11 M 18-25 America 1
# 5 id12 M 25-40 Africa 1
# Final validation: the sum of values from column "customer_count" of `dataframe_3` equals
# the number of lines that exist in `final_dataframe`
print(dataframe_3["customer_count"].sum() == final_dataframe.shape[0])
# Prints: True
输出:
客户 ID | 性别 | 年龄 | 地区 | 客户数量 |
---|---|---|---|---|
id13 | F | 18-25 | 美国 | 1 |
id14 | F | 18-25 | 亚洲 | 2 |
id19 | F | 18-25 | 亚洲 | 2 |
id15 | F | 60+ | 欧洲 | 1 |
id11 | M | 18-25 | 美国 | 1 |
id12 | M | 25-40 | 非洲 | 1 |