假设我有一个 pandas DataFrame,代表假设比赛中每个“参赛者”的酷度得分(按日期):
import numpy as np
import pandas as pd
rng = np.random.default_rng()
dates = pd.date_range('2024-08-01', '2024-08-07')
contestants = ['Alligator', 'Beryl', 'Chupacabra', 'Dandelion', 'Eggplant', 'Feldspar']
coolness_score = pd.DataFrame(rng.random((len(dates), len(contestants))), index=dates, columns=contestants)
Alligator Beryl Chupacabra Dandelion Eggplant Feldspar
2024-08-01 0.213901 0.952705 0.801651 0.511080 0.662109 0.486296
2024-08-02 0.495700 0.660502 0.379900 0.778438 0.038616 0.214174
2024-08-03 0.639337 0.036226 0.811501 0.281915 0.101850 0.437146
2024-08-04 0.238590 0.686965 0.357087 0.810922 0.907803 0.370247
2024-08-05 0.712564 0.800191 0.040616 0.503644 0.354333 0.742269
2024-08-06 0.916343 0.299557 0.405399 0.851161 0.336570 0.246618
2024-08-07 0.047052 0.645420 0.823397 0.198483 0.368888 0.168188
此外,每个参赛者都被映射到一个特定的类别,并且对每个类别都有限制:
category_mapping = {
'Alligator': 'Animal',
'Beryl': 'Mineral',
'Chupacabra': 'Animal',
'Dandelion': 'Vegetable',
'Eggplant': 'Vegetable',
'Feldspar': 'Mineral'
}
category_limits = {
'Animal': 1,
'Vegetable': 2,
'Mineral': 1
}
我如何选择在每个日期的每个类别中获得最高分的参赛者?具体考虑三种情况:
每个类别的最佳单项成绩
每个类别中最好的 N 个分数,其中 N 在所有类别中都是一致的
每个类别的最佳分数,其限制由
category_limits
定义
或者,更好的是,如何将失败者的分数设置为零?
场景 1 和 2 显然是场景 3 的子集,但我认为可能有一些内置函数可以在这里提高效率。如果留给我自己的设备,我可能会按日期进行迭代,但这似乎是绝对最慢的方法。感谢您的帮助。
import numpy as np
import pandas as pd
rng = np.random.default_rng()
dates = pd.date_range('2024-08-01', '2024-08-07')
contestants = ['Alligator', 'Beryl', 'Chupacabra', 'Dandelion', 'Eggplant', 'Feldspar']
coolness_score = pd.DataFrame(rng.random((len(dates), len(contestants))), index=dates, columns=contestants)
## solution:
# melt dataframe(from wide to long data format)
df = coolness_score.reset_index().rename(columns={"index": "date"}).melt(id_vars="date", var_name="contestant", value_name="coolness_score")
# map contestant to category and insert category column after date column
category_mapping = {
'Alligator': 'Animal',
'Beryl': 'Mineral',
'Chupacabra': 'Animal',
'Dandelion': 'Vegetable',
'Eggplant': 'Vegetable',
'Feldspar': 'Mineral'
}
df.insert(1, "category", df["contestant"].map(category_mapping))
# sort by date and category for better readability
df = df.sort_values(by=["date", "category"], ignore_index=True)
# the id of the best single score from each category for each day
day_category_top = df.groupby(["date", "category"])[["coolness_score"]].idxmax().rename(columns={"coolness_score": "best_score_index"})
# 1. The best single score from each category
df.loc[day_category_top["best_score_index"]]
# 2. The best N scores from each category, where N is consistent across all categories
N = 3
df.groupby(["date", "category"]).apply(lambda g: list(g["coolness_score"].nlargest(N)))