从具有多个条件的前一行中的嵌套列表的 pandas 列中查找重复项

问题描述 投票:0回答:1

我对如何编码有点困惑。

我有一个这样的数据集:

rules   user_list       event_time          row_number
rule1   123,244,344     2024-09-20          1
rule1   125,346,421     2024-09-19          2
rule1   125,343,431     2024-09-18          3
rule2   125,344,423     2024-09-20          1
rule2   125,346,421     2024-09-19          2
rule3   125,348,331     2024-09-20          1
rule3   125,336,221     2024-09-19          2
可重复的df
data = {
    'rules': ['rule1', 'rule1', 'rule1', 'rule2', 'rule2', 'rule3', 'rule3'],
    'user_list': ['123,244,344', '125,346,421', '125,343,431', '125,344,423', '125,346,421', '125,348,331', '125,336,221'],
    'event_time': ['2024-09-20', '2024-09-19', '2024-09-18', '2024-09-20', '2024-09-19', '2024-09-20', '2024-09-19'],
    'row_number': [1, 2, 3, 1, 2, 1, 2]
}
data = pd.DataFrame(data)
data['event_time'] = pd.to_datetime(data['event_time'])

我正在尝试构建另一列,用于计算/查找过去一天内其他行中的最新规则行(其中 row_number = 1)中的 user_ids 数量,并且该规则是不同的规则(因此计算重复的用户)过去一天内根据不同的规则被解雇)。

决赛桌应该是这样的:

rules   user_list       event_time          row_number      dupe_users
rule1   123,244,344     2024-09-20          1               344
rule1   125,346,421     2024-09-19          2               125,125,346,421
rule1   125,343,431     2024-09-18          3               125
rule2   125,344,423     2024-09-20          1               125,344
rule2   125,346,421     2024-09-19          2               125,125,346,421
rule3   125,348,331     2024-09-20          1               125,125
rule3   125,336,221     2024-09-19          2               125,125

例如:用户 344 于 2024 年 9 月 20 日在规则 1 上出现,并于 2024 年 9 月 20 日在规则 2 上出现。

python pandas
1个回答
0
投票

我不确定我是否理解了完整的逻辑,但据我了解,您可以使用带有自定义函数的

groupby.transform
以及
collections.Counter
的帮助:

from collections import Counter

def f(s):
    cnts = [Counter(x.split(',')) for x in s]
    ref = sum(cnts, start=Counter())
    out = []
    for x in cnts:
        diff = ref-x
        out.append(','.join(y for val in x for y in [val]*(diff[val])))
    return out

data['dupe_users'] = data.groupby('event_time')['user_list'].transform(f)

输出:

   rules    user_list event_time  row_number       dupe_users
0  rule1  123,244,344 2024-09-20           1              344
1  rule1  125,346,421 2024-09-19           2  125,125,346,421
2  rule1  125,343,431 2024-09-18           3                 
3  rule2  125,344,423 2024-09-20           1          125,344
4  rule2  125,346,421 2024-09-19           2  125,125,346,421
5  rule3  125,348,331 2024-09-20           1              125
6  rule3  125,336,221 2024-09-19           2          125,125
© www.soinside.com 2019 - 2024. All rights reserved.