不能按照要求的百分比将数据集**个人每天随机分配**到3组 - 10%/ 45%/ 45%

问题描述 投票:2回答:1

我想根据固定的每日百分比将现有数据集中的个体随机分配到3个不同的组中。下面是样本数据集:

 Date               Customer_ID
 1. 1/3/2019         411
 2. 1/3/2019         414
 3. 1/3/2019         421
 4. 5/3/2019         431
 5. 5/3/2019         433
 6. 5/3/2019         441
 7. 6/3/2019         442
 8. 6/3/2019         443
 9. 6/3/2019         444

我使用下面的Python代码来创建组。虽然总体流量%是正确的,但未按照每天所需的百分比正确分配组。

Group   %
 A    10%
 B    45%
 C    45%

              Expected outcome               Actual outcome
 Date      Group A  Group B Group C     Group A Group B Group C
  1/3/2019  10%      45%    45%           7%    2%       91%
  1/4/2019  10%      45%    45%           12%   25%      63%
  1/5/2019  10%      45%    45%           15%   50%      35%
  1/6/2019  10%      45%    45%           20%   61%      19%
  1/7/2019  10%      45%    45%           2%    7%       91%
  1/8/2019  10%      45%    45%           1%    12%      87%
  1/9/2019  10%      45%    45%           9%    21%      70%
  1/10/2019 10%      45%    45%           13%   25%      62%
  Overall   10%      45%    45%           10%   45%      45%

当前代码:

# Create 3 different groups that have traffic assigned 10%/45%/45%
df['Groups'] = df.groupby('Date')['Customer_ID']\
.transform(lambda x: np.random.choice(['Group_A', 'Group_B', 'Group_C'],
                                      len(x),  p= [0.1,0.45,0.45]))

代码仅在整个数据集上提供所需的输出,但不是每天(如实际结果表中所示)

我可以根据每天所需的分布使用哪个python代码创建三个组?

python pandas random sample sampling
1个回答
1
投票

好吧,似乎我在开始时(漫长的一天)误解了这个问题。恕我直言,您的代码按预期工作(抱歉,我只是生成数字而不是日期):

import pandas as pd
import numpy as np
rows = 10000
dates = np.random.choice(range(10), size = rows)
Customer_IDs = np.random.choice(range(2*rows), size = rows, replace = False)
data = np.vstack([dates, Customer_IDs]).T

df = pd.DataFrame(data, columns = ["Date", "Customer_ID"])

df['Groups'] = df.groupby('Date')['Customer_ID']\
    .transform(lambda x: np.random.choice(['Group_A', 'Group_B', 'Group_C'],
                                      len(x),  p= [0.1,0.45,0.45]))

print(df.groupby(['Date','Groups']).agg({'Date':'count'})\
    .groupby(level = 0).apply(lambda x:100 * x / float(x.sum())) )

现在,可能会有一些随机性,严格来说10/45/45是不太可能的。

我建议手动检查每个特定日期的分布情况,并与“实际”表格进行比较:

from collections import Counter
test_date = 1 # change this to '1/3/2019' for example
cntr = Counter(df[df["Date"]==test_date]["Groups"])
cntr_sum = sum(cntr.values())
print( {k: np.round(100 * v/cntr_sum, 2)
            for k,v in cntr.items()} )

PS。希望你会有类似的东西:

{'Group_B': 43.35, 'Group_C': 46.23, 'Group_A': 10.42}

希望这次我做对了!

© www.soinside.com 2019 - 2024. All rights reserved.