分组和聚合分析

Question

我正在尝试基于 df 创建聚合分析。这是我的数据演示。

input=pd.DataFrame({'ID':['A','A','A','A','B','B','B','B','C','C','C','C'],
              'Year':[2000,2000,2000,2000,2000,2000,2000,2000,2001,2001,2001,2001],
               'Item':['a1','a2','a3','a4','b1','b2','b3','b4','c1','c2','c3','c4'],
                'Price':[1,3,4,5,2,4,7,3,5,7,6,1],
                'Label 1':[1,1,0,0,1,1,1,0,0,0,0,1],
                'Label 2':[0,1,1,0,0,1,1,0,1,0,0,1]})

我的输出如下所示。我的目标是对于每个ID和Year，我统计每个Label等于1。另外，我还想计算总价。例如，对于 A 公司，项目“a1”和“a2”的“标签 1”为 1，因此“标签 1 计数”=2。 “标签 1 价格”为 11+13=4。

output=pd.DataFrame({'ID':['A','B','C'],
              'Year':[2000,2000,2001],
               'Label 1 Count':[2,3,1],
             'Label 2 Count':[2,2,2],
             'Label 1 Total Price':[4,13,1],
             'Label 2 Total Price':[7,11,6],
              'Item Total':[4,4,4]})

我所做的是：

input['Label 1 Price']=input['Label 1']*input['Price']
input['Label 2 Price']=input['Label 2']*input['Price']

input.groupby(['ID','Year']).agg({'Label 1':'sum',
                                  'Label 2':'sum',
                                  'Label 1 Price':'sum',
                                  'Label 2 Price':'sum',
                                  'Item':'count'})

有更优雅的解决方案吗？另外，我的原始数据集有更多的“标签”，所以我不想手动创建“标签 x 价格”。

Answer 1

您可以使用

filter

自动识别以“Label”开头的列。所以，你可以尝试这样的事情：

import pandas as pd

# Sample data
input = pd.DataFrame({'ID':['A','A','A','A','B','B','B','B','C','C','C','C'],
                      'Year':[2000,2000,2000,2000,2000,2000,2000,2000,2001,2001,2001,2001],
                      'Item':['a1','a2','a3','a4','b1','b2','b3','b4','c1','c2','c3','c4'],
                      'Price':[1,3,4,5,2,4,7,3,5,7,6,1],
                      'Label 1':[1,1,0,0,1,1,1,0,0,0,0,1],
                      'Label 2':[0,1,1,0,0,1,1,0,1,0,0,1]})

动态识别列：

label_cols = input.filter(like='Label').columns

动态创建价格列：

for col in label_cols:
    input[col + ' Price'] = input[col] * input['Price']

最后，汇总数据并打印输出：

output = input.groupby(['ID', 'Year']).agg(
    {col: 'sum' for col in label_cols} |  # Sum of Label counts
    {col + ' Price': 'sum' for col in label_cols} |  # Sum of Label prices
    {'Price': 'sum', 'Item': 'count'}  # Total Price and Item count
).rename(columns={'Item': 'Item Total', 'Price': 'Total Price'}).reset_index()

print(output)

希望这有帮助！

分组和聚合分析

问题描述投票：0回答：1

1个回答

最新问题

分组和聚合分析

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1