如果不为空，则根据条件聚合数据帧中的值

Question

我在构建自定义聚合时遇到了麻烦，但需要注意的是我的连接键在每一行上都不同。有人可以帮我吗？

我被一个问题困扰已经有一段时间了。我有一个巨大的交易数据框，其格式与此接近：

flat_data = {
    'year': [2022, 2022, 2022, 2023, 2023, 2023, 2023, 2023, 2023],
    'month': [1, 1, 2, 1, 2, 2, 3, 3, 3],
    'operator': ['A', 'A', 'B', 'A', 'B', 'B', 'C', 'C', 'C'],
    'value': [10, 15, 20, 8, 12, 15, 30, 40, 50],
    'attribute1': ['x', 'x', 'y', 'x', 'y', 'z', 'x', 'z', 'x'],
    'attribute2': ['apple', 'apple', 'banana', 'apple', 'banana', 'banana', 'apple', 'banana', 'banana'],
    'attribute3': ['dog', 'cat', 'dog', 'cat', 'rabbit', 'tutle', 'cat', 'dog', 'dog'],
}

我有超过80个属性。

另一方面，我有一个如下所示的总计数据框：

totals= {
    'year': [2022, 2022, 2023, 2023, 2023],
    'month': [1, 2, 1, 2, 3],
    'operator': ['A', 'B', 'A', 'B', 'C'],
    'id': ['id1', 'id2', 'id1', 'id2', 'id3'], 
    'attribute1': [None, 'y', 'x', 'z', 'x'],
    'attribute2': ['apple', None, 'apple', 'banana', 'banana'],
}

总计数据框只有我可以在 flat_data 中找到的属性，但有一个额外的 id。我想做的是获取包含年、月、操作员 ID 和值的结果数据框。为此，我需要对与过滤器属性匹配的所有 Flat 行的值进行求和，但仅对非空行进行求和。

我的输出如下：

result= {
    'year': [2022, 2022, 2023, 2023, 2023],
    'month': [1, 2, 1, 2, 3],
    'operator': ['A', 'B', 'A', 'B', 'C'],
    'id': ['id1', 'id2', 'id1', 'id2', 'id3'],
     'sum': [10, 15, 20, 8, 12, 15, 30, 40, 50],
}

其中 sum 是非空属性与 id 属性匹配的行的所有值的总和。

例如 id1 将与 01/2002 的每一行与相同的运算符（或运算符 A）匹配，属性 2 = apple，无论属性 1（第 1 行和第 2 行）如何，因此 01/2022 的运算符 A 的 id 1 的总数将为 25 .

我尝试循环遍历行，但它很容易出错并且内存贪婪。我想尝试使用 pyspark 但找不到如何分配任务。我已经设法逐行完成它。意思是对属性进行联接，然后进行 groupby + sum。然而，我陷入困境的是，由于空约束（即过滤器中的空匹配所有内容），实际上每一行都有自己的一组连接键，因此我无法概括该方法。

Answer 1

    import pandas as pd
flat_data = pd.DataFrame.from_dict({
    'year': [2022, 2022, 2022, 2023, 2023, 2023, 2023, 2023, 2023],
    'month': [1, 1, 2, 1, 2, 2, 3, 3, 3],
    'operator': ['A', 'A', 'B', 'A', 'B', 'B', 'C', 'C', 'C'],
    'value': [10, 15, 20, 8, 12, 15, 30, 40, 50],
    'attribute1': ['x', 'x', 'y', 'x', 'y', 'z', 'x', 'z', 'x'],
    'attribute2': ['apple', 'apple', 'banana', 'apple', 'banana', 'banana', 'apple', 'banana', 'banana'],
    'attribute3': ['dog', 'cat', 'dog', 'cat', 'rabbit', 'tutle', 'cat', 'dog', 'dog'],
})
print('The dataset below is your flat_data:')
print(flat_data)
totals= pd.DataFrame.from_dict({
    'year': [2022, 2022, 2023, 2023, 2023],
    'month': [1, 2, 1, 2, 3],
    'operator': ['A', 'B', 'A', 'B', 'C'],
    'id': ['id1', 'id2', 'id1', 'id2', 'id3'], 
    'attribute1': [None, 'y', 'x', 'z', 'x'],
    'attribute2': ['apple', None, 'apple', 'banana', 'banana'],
})
print('The dataset below is your totals:')
print(totals)
flat_data_aggregated = flat_data.groupby(['year','month','operator']).agg({'value':'mean'}).rename(columns={'value':'value'}).reset_index()
print('The dataset below aggregates the value column for each year, month, and operator removing the null values, using your flat_data:')
print(flat_data_aggregated)         
results = pd.merge(flat_data_aggregated, totals[['year','month','operator','id']],  how='left', on=['year','month','operator'])
print('This is the results dataset after merging the two datasets:')
print(results)

如果不为空，则根据条件聚合数据帧中的值

问题描述投票：0回答：1

1个回答

最新问题

如果不为空，则根据条件聚合数据帧中的值

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1