我有两个数据集(A,B),需要根据日期、id 和最近时间合并这两个数据集(请参阅合并数据集)。两个数据集中的时间并不完全匹配,数据集 B 中的时间始终比数据集 A 中的时间晚 0 到 10 分钟。
我已经尝试过
left_join
与within, between, overlaps, etc.
,但无法管理。我想我做错了什么。我无法共享真实数据,但我做了一个简单的数据集示例。
如果您能帮助我,我将不胜感激。
非常感谢
DATASET A
DATETIME | ID | W
--------------------------------
2020-12-02 18:02:01 | 1 | 0.25
2020-12-02 19:06:21 | 1 | 0.35
2020-12-02 18:12:08 | 2 | 0.44
2020-12-03 10:03:03 | 3 | 0.98
DATASET B
DATETIME | ID | X1 | X3
--------------------------------------
2020-12-02 18:08:01 | 1 | 1.3 | 99.3
2020-12-02 18:21:11 | 2 | 4.2 | 33.2
2020-12-03 10:09:22 | 3 | 7.1 | 39.9
MERGED DATASET
DATETIME.x | ID.x | W | DATETIME.y | ID.y | X1 | X3
----------------------------------------------------------------------------
2020-12-02 18:02:01 | 1 | 0.25 | 2020-12-02 18:08:01 | 1 | 1.3 | 99.3
2020-12-02 19:06:21 | 1 | 0.35 | | | |
2020-12-02 18:12:08 | 2 | 0.44 | 2020-12-02 18:21:11 | 2 | 4.2 | 33.2
2020-12-03 10:03:03 | 3 | 0.98 | 2020-12-03 10:09:22 | 3 | 7.1 | 39.9
我使用
fuzzyjoin
进行类似的连接:
fuzzyjoin::fuzzy_left_join(a, b,
by = c("ID" = "ID", "DATETIME" = "DATETIME"),
match_fun = list(`==`, function(x, y) abs(difftime(x, y, units = "mins")) <= 10)
)
输出:
DATETIME.x ID.x W DATETIME.y ID.y X1 X3
1 2020-12-02 18:02:01 1 0.25 2020-12-02 18:08:01 1 1.3 99.3
2 2020-12-02 19:06:21 1 0.35 <NA> NA NA NA
3 2020-12-02 18:12:08 2 0.44 2020-12-02 18:21:11 2 4.2 33.2
4 2020-12-03 10:03:03 3 0.98 2020-12-03 10:09:22 3 7.1 39.9
数据:
a <- structure(list(DATETIME = structure(c(1606932121, 1606935981,
1606932728, 1606989783), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
ID = c(1L, 1L, 2L, 3L), W = c(0.25, 0.35, 0.44, 0.98)), row.names = c("1",
"2", "3", "4"), class = "data.frame")
b <- structure(list(DATETIME = structure(c(1606932481, 1606933271,
1606990162), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
ID = 1:3, X1 = c(1.3, 4.2, 7.1), X3 = c(99.3, 33.2, 39.9)), row.names = c(NA,
-3L), class = "data.frame")
merged <- structure(list(DATETIME.x = structure(c(1606932121, 1606935981,
1606932728, 1606989783), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
ID.x = c(1L, 1L, 2L, 3L), W = c(0.25, 0.35, 0.44, 0.98),
DATETIME.y = structure(c(1606932481, NA, 1606933271, 1606990162
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), ID.y = c(1L,
NA, 2L, 3L), X1 = c(1.3, NA, 4.2, 7.1), X3 = c(99.3, NA,
33.2, 39.9)), row.names = c(NA, -4L), class = "data.frame")