我想根据固定的每日百分比将现有数据集中的个体随机分配到3个不同的组中。下面是样本数据集:
Date Customer_ID
1. 1/3/2019 411
2. 1/3/2019 414
3. 1/3/2019 421
4. 5/3/2019 431
5. 5/3/2019 433
6. 5/3/2019 441
7. 6/3/2019 442
8. 6/3/2019 443
9. 6/3/2019 444
我使用下面的Python代码来创建组。虽然总体流量%是正确的,但未按照每天所需的百分比正确分配组。
Group %
A 10%
B 45%
C 45%
Expected outcome Actual outcome
Date Group A Group B Group C Group A Group B Group C
1/3/2019 10% 45% 45% 7% 2% 91%
1/4/2019 10% 45% 45% 12% 25% 63%
1/5/2019 10% 45% 45% 15% 50% 35%
1/6/2019 10% 45% 45% 20% 61% 19%
1/7/2019 10% 45% 45% 2% 7% 91%
1/8/2019 10% 45% 45% 1% 12% 87%
1/9/2019 10% 45% 45% 9% 21% 70%
1/10/2019 10% 45% 45% 13% 25% 62%
Overall 10% 45% 45% 10% 45% 45%
当前代码:
# Create 3 different groups that have traffic assigned 10%/45%/45%
df['Groups'] = df.groupby('Date')['Customer_ID']\
.transform(lambda x: np.random.choice(['Group_A', 'Group_B', 'Group_C'],
len(x), p= [0.1,0.45,0.45]))
代码仅在整个数据集上提供所需的输出,但不是每天(如实际结果表中所示)
我可以根据每天所需的分布使用哪个python代码创建三个组?
好吧,似乎我在开始时(漫长的一天)误解了这个问题。恕我直言,您的代码按预期工作(抱歉,我只是生成数字而不是日期):
import pandas as pd
import numpy as np
rows = 10000
dates = np.random.choice(range(10), size = rows)
Customer_IDs = np.random.choice(range(2*rows), size = rows, replace = False)
data = np.vstack([dates, Customer_IDs]).T
df = pd.DataFrame(data, columns = ["Date", "Customer_ID"])
df['Groups'] = df.groupby('Date')['Customer_ID']\
.transform(lambda x: np.random.choice(['Group_A', 'Group_B', 'Group_C'],
len(x), p= [0.1,0.45,0.45]))
print(df.groupby(['Date','Groups']).agg({'Date':'count'})\
.groupby(level = 0).apply(lambda x:100 * x / float(x.sum())) )
现在,可能会有一些随机性,严格来说10/45/45是不太可能的。
我建议手动检查每个特定日期的分布情况,并与“实际”表格进行比较:
from collections import Counter
test_date = 1 # change this to '1/3/2019' for example
cntr = Counter(df[df["Date"]==test_date]["Groups"])
cntr_sum = sum(cntr.values())
print( {k: np.round(100 * v/cntr_sum, 2)
for k,v in cntr.items()} )
PS。希望你会有类似的东西:
{'Group_B': 43.35, 'Group_C': 46.23, 'Group_A': 10.42}
希望这次我做对了!