我有一段摘录,需要我识别某种类型的手术
X
,请参见Surg Type
列。
我需要保留在一个窗口/时间段内被视为不同行的医疗预约,其中 3 个预约是之前的 (-3、-2、-1) 和 3 个预约是后的 (+1、+2、+3)
我必须将此订单作为附加栏包含在内。
除此之外,我需要排除窗口外的任何预约和任何其他类型的
Surg Type
,在此示例中,任何手术都表示为 Z。
在此示例中,我想要保留 7/9 行/记录和一个附加列
Prior Post
*** 更新示例***
Original Df
| Patient ID | Surg ID | Surg Type | Surg Date | Medical Appt | Medical Appt Date |
|------------|---------|-----------|------------|--------------|-------------------|
| 1 | 1 | X | 2022-09-03 | Y | 2022-01-01 |
| 1 | 1 | X | 2022-09-03 | Y | 2022-03-04 |
| 1 | 1 | X | 2022-09-03 | Y | 2022-05-04 |
| 1 | 1 | X | 2022-09-03 | N | NaT |
| 1 | 1 | X | 2022-09-03 | Y | 2022-11-04 |
| 1 | 1 | X | 2022-09-03 | Y | 2022-11-29 |
| 1 | 2 | Z | 2022-12-01 | N | NaT |
| 1 | 1 | X | 2022-09-03 | Y | 2023-01-02 |
| 1 | 1 | X | 2022-09-03 | Y | 2023-01-13 |
Desired Df
| Patient ID | Surg ID | Surg Type | Surg Date | Medical Appt | Medical Appt Date | Inclusion |
|------------|---------|-----------|------------|--------------|-------------------|-------------|
| 1 | 1 | X | 2022-09-03 | Y | 2022-01-01 | -3 |
| 1 | 1 | X | 2022-09-03 | Y | 2022-03-04 | -2 |
| 1 | 1 | X | 2022-09-03 | Y | 2022-05-04 | -1 |
| 1 | 1 | X | 2022-09-03 | N | NaT | |
| 1 | 1 | X | 2022-09-03 | Y | 2022-11-04 | +1 |
| 1 | 1 | X | 2022-09-03 | Y | 2022-11-29 | +2 |
| 1 | 2 | Z | 2022-12-01 | N | NaT | Exclude Row |
| 1 | 1 | X | 2022-09-03 | Y | 2023-01-02 | +3 |
| 1 | 1 | X | 2022-09-03 | Y | 2023-01-13 | Exclude row |
您可以过滤手术
X
,然后在排序的日期上计算 rolling.max
,以保留每次手术周围的 ±N
日期(假设手术是 NaT
中带有 Medical Appt Date
的行):
# number of medical appointments to keep before/after a surgery
N = 3
# columns to use a grouper
group_cols = ['Patient ID', 'Surg ID']
# ensure datetime
df[['Surg Date', 'Medical Appt Date']] = df[['Surg Date', 'Medical Appt Date']].apply(pd.to_datetime)
# filter out the non-X types
# sort by date, compute a groupby.rolling.max
# identify the rows to keep
keep = (
df[df['Surg Type'].eq('X')]
.assign(date=lambda d: d['Medical Appt Date'].fillna(d['Surg Date']),
surgery=lambda d: d['Medical Appt Date'].isna()
)
.sort_values(by=group_cols+['date'])
.groupby(group_cols, sort=False)
['surgery'].rolling(2*N+1, center=True, min_periods=1)
.max().astype(bool)
.droplevel(group_cols)
)
# select the rows from the above list of indices to keep
out = df.loc[keep.index[keep]]
输出:
Patient ID Surg ID Surg Type Surg Date Medical Appt Medical Appt Date
0 1 1 X 2022-09-03 Y 2022-01-01
1 1 1 X 2022-09-03 Y 2022-03-04
2 1 1 X 2022-09-03 Y 2022-05-04
3 1 1 X 2022-09-03 N NaT
4 1 1 X 2022-09-03 Y 2022-11-04
5 1 1 X 2022-09-03 Y 2022-11-29
7 1 1 X 2022-09-03 Y 2023-01-02