按两个变量分组，然后根据 Python 中另一个变量的值创建新列（pandas）

Question

我可以在 R 中执行此操作，但不知道如何在 Python 中执行此操作。

我有 sbj、num_item、访问次数和身高的数据。我想使用 pandas 创建基线高度。

例如：

sbj	num_item	参观	高度	基线_高度
1	1	基线	1.5	1.5
1	1	第 7 天	2	1.5
1	1	第 14 天	2.5	1.5
1	2	基线	1	1
1	2	第 7 天	1.5	1
1	2	第 14 天	2	1
2	1	基线	0.5	0.5
2	1	第 7 天	1	0.5
2	1	第 14 天	1.5	0.5
2	2	基线	3	3
2	2	第 7 天	3.5	3
2	2	第 14 天	4	3

我想按两个变量 sbj 和 num_item 进行分组。我想创建一个名为baseline_height 的新列。对于每个 sbj 和 num_item 组合，我想将 benchmark_height 设置为基线处的高度值。

我尝试了很多不同的方法，但没有一个有效：

df['baseline_height'] = df.groupby(by = ['sbj', 'num_item']).height[['visit' == "Baseline"]]

    df['baseline_height'] = 0
    df = df.loc[df.groupby(by = ['sbj', 'num_item'])['baseline_height'].apply('visit' == 'Baseline') == df['height']]

    df['baseline_height'] = df.groupby(by = ['sbj', 'num_item']).apply(['height'][['visit'] == 'Baseline'])

    df['baseline_height'] = df.groupby(by = ['sbj', 'num_item']).apply(df['height'][df["visit"]=='Baseline'])

    df_grouped = df.groupby(by = ['sbj', 'num_item'])
    df['baseline_height'] = df_grouped.height[df_grouped["visit"]=='Baseline']

Answer 1

你就快到了。

我建议像您一样按

sbj

和

num_item

进行分组，但仅在按

sbj

、

num_item

和

visit

对值进行排序之后，以确保“基线”始终出现在不同日期之前（ “第 7 天”、“第 14 天”等...）：

df = df[['sbj','num_item','visit','height']]
df.sort_values(['sbj','num_item','visit'])

然后您可以将

baseline_height

设置为

height

：

   df['baseline_height'] = df['height']

然后是使用这篇文章中建议的变换方法的问题：

   df['baseline_height'] = df.groupby(by = ['sbj', 'num_item']).baseline_height.transform('first')