当计数相等时,pandas条件选择

问题描述 投票:0回答:1

以下数据集代表购买行为:

user_id, product_code, bought_date, time_spent, store_id, product_type, refurbished, unqiue_visit_id
001, e.12, 20120102, 104, 101, computer, yes, 1010
002, e.24, 20120201, 100, 101, infant-dress, no, 2001
003, s.32, 20130302, 230, 101, shoes, no, 2121
004, y.23, 20130404, 212, 103, computer, yes, 2422
005, s.43, 20130803, 104, 101, laptop, yes, 2342
001, a.12, 20120102, 104, 101, computer, yes, 1011
002, b.24, 20120201, 100, 101, infant-dress, no, 2001
003, c.32, 20130302, 230, 101, shoes, no, 2122
004, e.23, 20130404, 212, 103, computer, yes, 2424
005, f.43, 20130803, 104, 101, laptop, yes, 2340
001, g.12, 20120102, 104, 101, computer, yes, 1013
002, h.24, 20120201, 100, 101, infant-dress, no, 2031
003, l.32, 20130302, 230, 101, shoes, no, 2000
004, m.23, 20130404, 212, 103, computer, yes, 1422
005, d.43, 20130803, 104, 101, laptop, yes, 1142
001, d.12, 20120102, 104, 101, desk, yes, 1110
002, f.24, 20120201, 100, 101, glass, no, 1111
003, n.32, 20130302, 230, 101, liquid, no, 2021
004, t.23, 20130404, 212, 103, liquid, yes, 22
005, u.43, 20130803, 104, 101, dress, yes, 2942
001, d.12, 20120102, 104, 101, desk, yes, 1910
002, f.24, 20120201, 100, 101, glass, no, 2901
003, n.32, 20130302, 230, 101, liquid, no, 2921
004, t.23, 20130404, 212, 103, liquid, yes, 2922
005, u.43, 20130803, 104, 101, dress, yes, 2942
001, kk.12, 20120103, 105, 101, desk, yes, 410
003, n.32, 20130303, 230, 101, liquid, no, 2621

最终目标是使用以下步骤为用户分配产品类型。

首先,我通过user_idproduct_type分组,并获得用户访问product_type的次数(计数)。

当计数在组(user_idproduct_id)中相等时,您选择用户最近访问过的产品类型并将其分配给用户。如果访问日期相等,那么我们通过查看refurbished(yes > no)来打破平局。

visit_counts = merged_visits_df.groupby(['user_id','product_type'], as_index=False).agg({'unique_visits_id': 'nunique'})

上面给出了访问计数,试图找出其余的过程。

python pandas
1个回答
1
投票

我认为以下是您所要求的(列名在您发布的数据中拼写错误,我保持这种方式,即'unique visit_id')

counts = (
    # sort by bought date
    merged_visits_df.sort_values('bought_date', ascending=False)
    # groupby desired cols
    .groupby(['user_id','product_type'],as_index=False)
    # apply desired aggregation functions
    .agg({'unqiue_visit_id': 'nunique', 'bought_date': 'first', 'refurbished': 'first'})
)

然后我们可以通过user_id获取最大访问次数

max_by_user = counts.groupby('user_id')['unqiue_visit_id'].max()

最后,我们可以过滤那些访问次数=用户最多访问次数的项目,按所需的cols排序,然后获取第一个。

result = (
    # filter to products with max visits by user
    counts[counts['user_id'].apply(max_by_user.get) == counts['unqiue_visit_id']]
    # sort bought_date descending (max on top), refurbished descending (yes above no)
    .sort_values(['bought_date', 'refurbished'], ascending=False)
    # groupby user id and select the first
    .groupby('user_id').nth(0)
)

用这种方式思考可能会稍微直观一些:

第1步:添加要排序的列:

 # initial question
 visits_df = merged_visits_df.groupby(['user_id','product_type']).agg({'unqiue_visit_id': 'nunique'}).add_suffix('_count')
 df_to_sort = merged_visits_df.merge(visits_df.reset_index())
 # follow up question
 df_to_sort['last_num'] = df_to_sort['store_id'] % 10

然后排序,做groupby,先得到:

(
    df_to_sort
    .sort_values([unqiue_visit_id_count, bought_date, last_num], ascending=[False, False, True])
    .groupby(['user_id']).nth(0)
)
© www.soinside.com 2019 - 2024. All rights reserved.