我有一个大约每 15 秒一次的 df (df1) 和另一个大约每 5 分钟一次的数据 (df2),如下所示。我想让 df2 的“时间”变量与 df1 最接近的时间戳相匹配,然后将最接近的时间戳作为 5 分钟平均 df1 数据的中心。
df1:
time speed
0 2022-10-04 00:00:24 4.590
1 2022-10-04 00:00:41 4.389
2 2022-10-04 00:00:57 4.367
3 2022-10-04 00:01:14 4.539
4 2022-10-04 00:01:30 4.584
5 2022-10-04 00:01:48 4.523
6 2022-10-04 00:02:05 4.498
7 2022-10-04 00:02:21 4.625
8 2022-10-04 00:02:38 4.497
9 2022-10-04 00:02:54 4.406
10 2022-10-04 00:03:12 4.502
11 2022-10-04 00:03:28 4.494
12 2022-10-04 00:03:45 4.445
13 2022-10-04 00:04:01 4.438
14 2022-10-04 00:04:18 4.433
15 2022-10-04 00:04:36 4.441
16 2022-10-04 00:04:52 4.400
17 2022-10-04 00:05:09 4.221
18 2022-10-04 00:05:27 4.115
19 2022-10-04 00:05:43 4.009
20 2022-10-04 00:06:01 4.230
21 2022-10-04 00:06:18 4.360
22 2022-10-04 00:06:34 4.331
23 2022-10-04 00:06:51 4.178
24 2022-10-04 00:07:07 4.238
25 2022-10-04 00:07:25 4.125
26 2022-10-04 00:07:43 3.988
27 2022-10-04 00:08:17 3.573
28 2022-10-04 00:08:34 4.471
29 2022-10-04 00:08:50 4.567
30 2022-10-04 00:09:08 4.451
31 2022-10-04 00:09:25 4.311
32 2022-10-04 00:09:42 4.280
33 2022-10-04 00:09:59 4.439
34 2022-10-04 00:10:17 4.410
35 2022-10-04 00:10:35 4.335
36 2022-10-04 00:10:51 4.193
37 2022-10-04 00:11:08 4.140
38 2022-10-04 00:11:25 4.020
39 2022-10-04 00:11:43 3.872
40 2022-10-04 00:12:01 3.859
41 2022-10-04 00:12:17 4.062
42 2022-10-04 00:12:34 3.861
43 2022-10-04 00:12:51 3.780
44 2022-10-04 00:13:07 3.680
45 2022-10-04 00:13:25 3.909
46 2022-10-04 00:13:42 3.852
47 2022-10-04 00:13:58 3.867
48 2022-10-04 00:14:15 3.715
49 2022-10-04 00:14:32 3.534
50 2022-10-04 00:14:49 3.349
51 2022-10-04 00:15:06 3.213
52 2022-10-04 00:15:23 3.215
53 2022-10-04 00:15:39 3.246
54 2022-10-04 00:15:55 3.195
55 2022-10-04 00:16:14 3.164
56 2022-10-04 00:16:30 3.149
57 2022-10-04 00:16:47 3.281
58 2022-10-04 00:17:03 3.366
59 2022-10-04 00:17:20 3.295
60 2022-10-04 00:17:38 3.487
61 2022-10-04 00:17:54 3.534
62 2022-10-04 00:18:11 3.430
63 2022-10-04 00:18:27 3.474
64 2022-10-04 00:18:44 3.275
65 2022-10-04 00:19:01 3.584
66 2022-10-04 00:19:18 3.616
67 2022-10-04 00:19:34 3.506
68 2022-10-04 00:19:51 3.561
69 2022-10-04 00:20:08 3.316
70 2022-10-04 00:20:27 3.396
71 2022-10-04 00:20:43 3.536
72 2022-10-04 00:20:59 3.631
73 2022-10-04 00:21:16 3.573
74 2022-10-04 00:21:33 3.514
75 2022-10-04 00:21:50 3.603
76 2022-10-04 00:22:07 3.591
77 2022-10-04 00:22:23 3.591
78 2022-10-04 00:22:40 3.659
79 2022-10-04 00:23:14 4.056
df2:
time speed
0 2022-10-03 00:03:23 4.646689
1 2022-10-03 00:08:24 5.328516
2 2022-10-03 00:13:24 5.895778
3 2022-10-03 00:18:24 5.665014
4 2022-10-03 00:22:25 6.313763
我所知道的是,我可以使用 pandas.merge_asof 将时间与“tolerance”参数尽可能接近。从那里开始,我一直在尝试 pandas.groupby 和 pandas.rolling 的几种组合,但仍在努力获得所需的结果。我不知道如何继续,因此我们将不胜感激。
IIUC,你可以尝试这样的事情:
df1_rolling_mean = (
df1.rolling(window="300s", on="time", center=True)
.mean()
.rename(columns={"speed": "speed_avg"})
)
df1 = pd.merge(df1, df1_rolling_mean)
df = pd.merge_asof(
df1,
df2.rename(columns={"speed": "speed_df2"}),
on="time",
tolerance=pd.Timedelta(seconds=15),
)
df = df[df.loc[:, "speed_df2"].notna()]
这将导致:
time speed speed_avg speed_df2
11 2022-10-04 00:03:28 4.494 4.421765 4.646689
28 2022-10-04 00:08:34 4.471 4.265625 5.328516
45 2022-10-04 00:13:25 3.909 3.687167 5.895778
63 2022-10-04 00:18:27 3.474 3.410000 5.665014
78 2022-10-04 00:22:40 3.659 3.615000 6.313763
speed_avg
是 speed
中第 df1
列的 5 分钟平均值;df1
上的 df2
合并时,我们仅保留 time
和 merge_asof
具有 time
共同值的列,容差为 15 秒。