我有一个 Pandas 数据框,其中包含一个人的多个发现(病史),我希望将一个人的病史全部折叠成一行,但保留订单,但在预约日期级别,假设所有发现/检查结果都来自他们的过去以宽格式格式化。
我不知道如何最好地做到这一点,因为所有
groupby
方法都要求我提供一个agg
,然后通过连接将所有列合并为一列,而不是过去给定约会的新单独列。
某些列不会
pivoted
或用作 groupby
索引 (patientId, apptDate, age, bmi
)
需要考虑的是如何最好地处理创建的病史
mh_
列的排序,以便首先将记录填充到较低生成的列中 mh_result1
等
原DF
| patientId | apptDate | age | bmi | examinationId | result | category | comment |
|-----------|------------|-----|-----|---------------|------------|----------------|---------------------------------------------|
| 1 | 2024-07-08 | 45 | 22 | 45 | Long Term | Cardiovascular | Cardiovascular defect finding, fup required |
| 1 | 2024-02-01 | 45 | 22 | 33 | None | None | None |
| 1 | 2023-11-14 | 45 | 22 | 12 | Short Term | Respiratory | Shortness of breath, med prescribed |
| 2 | 2023-12-23 | 32 | 12 | 18 | Short Term | Gastro | Recorded malnutrition |
| 2 | 2022-12-11 | 32 | 13 | 21 | Short Term | Gastro | None |
所需的DF
| patientId | apptDate | age | bmi | examinationId | result | category | comment | mh_result1 | mh_category1 | mh_comment1 | mh_result2 | mh_category2 | mh_category2 |
|-----------|------------|-----|-----|---------------|------------|----------------|---------------------------------------------|------------|--------------|-------------------------------------|------------|--------------|--------------|
| 1 | 2024-07-08 | 45 | 22 | 45 | Long Term | Cardiovascular | Cardiovascular defect finding, fup required | Short Term | Respiratory | Shortness of breath, med prescribed | None | None | None |
| 1 | 2024-02-01 | 45 | 22 | 33 | None | None | None | Short Term | Respiratory | Shortness of breath, med prescribed | None | None | None |
| 1 | 2023-11-14 | 45 | 22 | 12 | Short Term | Respiratory | Shortness of breath, med prescribed | None | None | None | None | None | None |
| 2 | 2023-12-23 | 32 | 12 | 18 | Short Term | Gastro | Recorded malnutrition | Short Term | Gastro | None | None | None | None |
| 2 | 2022-12-11 | 32 | 13 | 21 | Short Term | Gastro | None | None | None | None | None | None | None |
pivot
,然后merge
:
tmp = (df
.sort_values(by='apptDate')
.assign(col=lambda x: x.groupby('patientId').cumcount().add(1))
.pivot(index=['patientId', 'apptDate'], columns='col', values=['result', 'category', 'comment'])
.sort_index(level=1, axis=1, sort_remaining=False)
.groupby(level='patientId').transform(lambda x: x.ffill().shift())
)
tmp.columns = tmp.columns.map(lambda x: f'mh_{x[0]}{x[1]}')
out = df.merge(tmp, left_on=['patientId', 'apptDate'], right_index=True, how='left')
输出:
patientId apptDate age bmi examinationId result category comment mh_result1 mh_category1 mh_comment1 mh_result2 mh_category2 mh_comment2 mh_result3 mh_category3 mh_comment3
0 1 2024-07-08 45 22 45 Long Term Cardiovascular Cardiovascular defect finding, fup required Short Term Respiratory Shortness of breath, med prescribed NaN NaN NaN None None None
1 1 2024-02-01 45 22 33 NaN NaN NaN Short Term Respiratory Shortness of breath, med prescribed NaN NaN NaN None None None
2 1 2023-11-14 45 22 12 Short Term Respiratory Shortness of breath, med prescribed None None None NaN NaN NaN None None None
3 2 2023-12-23 32 12 18 Short Term Gastro Recorded malnutrition Short Term Gastro NaN None None None NaN NaN NaN
4 2 2022-12-11 32 13 21 Short Term Gastro NaN None None NaN None None None NaN NaN NaN