如何通过对 pandas DataFrame 的每一行进行分组来有效地选择前 N 列?

问题描述 投票:0回答:1

假设我有一个 pandas DataFrame,代表假设比赛中每个“参赛者”的酷度得分(按日期):

import numpy as np
import pandas as pd

rng = np.random.default_rng()
dates = pd.date_range('2024-08-01', '2024-08-07')
contestants = ['Alligator', 'Beryl', 'Chupacabra', 'Dandelion', 'Eggplant', 'Feldspar']
coolness_score = pd.DataFrame(rng.random((len(dates), len(contestants))), index=dates, columns=contestants)
            Alligator     Beryl  Chupacabra  Dandelion  Eggplant  Feldspar
2024-08-01   0.213901  0.952705    0.801651   0.511080  0.662109  0.486296
2024-08-02   0.495700  0.660502    0.379900   0.778438  0.038616  0.214174
2024-08-03   0.639337  0.036226    0.811501   0.281915  0.101850  0.437146
2024-08-04   0.238590  0.686965    0.357087   0.810922  0.907803  0.370247
2024-08-05   0.712564  0.800191    0.040616   0.503644  0.354333  0.742269
2024-08-06   0.916343  0.299557    0.405399   0.851161  0.336570  0.246618
2024-08-07   0.047052  0.645420    0.823397   0.198483  0.368888  0.168188

此外,每个参赛者都被映射到一个特定的类别,并且对每个类别都有限制:

category_mapping = {
    'Alligator': 'Animal',
    'Beryl': 'Mineral',
    'Chupacabra': 'Animal',
    'Dandelion': 'Vegetable',
    'Eggplant': 'Vegetable',
    'Feldspar': 'Mineral'
}

category_limits = {
    'Animal': 1,
    'Vegetable': 2,
    'Mineral': 1
}

我如何选择在每个日期的每个类别中获得最高分的参赛者?具体考虑三种情况:

  1. 每个类别的最佳单项成绩

  2. 每个类别中最好的 N 个分数,其中 N 在所有类别中都是一致的

  3. 每个类别的最佳分数,其限制由

    category_limits

    定义

或者,更好的是,如何将失败者的分数设置为零?

场景 1 和 2 显然是场景 3 的子集,但我认为可能有一些内置函数可以在这里提高效率。如果留给我自己的设备,我可能会按日期进行迭代,但这似乎是绝对最慢的方法。感谢您的帮助。

pandas group-by ranking
1个回答
0
投票
import numpy as np
import pandas as pd

rng = np.random.default_rng()
dates = pd.date_range('2024-08-01', '2024-08-07')
contestants = ['Alligator', 'Beryl', 'Chupacabra', 'Dandelion', 'Eggplant', 'Feldspar']
coolness_score = pd.DataFrame(rng.random((len(dates), len(contestants))), index=dates, columns=contestants)

## solution:
# melt dataframe(from wide to long data format)
df = coolness_score.reset_index().rename(columns={"index": "date"}).melt(id_vars="date", var_name="contestant", value_name="coolness_score")

# map contestant to category and insert category column after date column
category_mapping = {
    'Alligator': 'Animal',
    'Beryl': 'Mineral',
    'Chupacabra': 'Animal',
    'Dandelion': 'Vegetable',
    'Eggplant': 'Vegetable',
    'Feldspar': 'Mineral'
}
df.insert(1, "category", df["contestant"].map(category_mapping))

# sort by date and category for better readability
df = df.sort_values(by=["date", "category"], ignore_index=True)

# the id of the best single score from each category for each day
day_category_top = df.groupby(["date", "category"])[["coolness_score"]].idxmax().rename(columns={"coolness_score": "best_score_index"})

# 1. The best single score from each category
df.loc[day_category_top["best_score_index"]]
# 2. The best N scores from each category, where N is consistent across all categories
N = 3
df.groupby(["date", "category"]).apply(lambda g: list(g["coolness_score"].nlargest(N)))
© www.soinside.com 2019 - 2024. All rights reserved.