我有一组 52 个空间点。我试图根据 11 公里的距离将这些点分成几组。即,将所有点分成组,组内点相距不超过 11 公里。然后,我想将每个点分配给新列中数据框中的组。
我找到了几篇文章,但它们没有让我到达我想要的地方,因为它们需要设置集群的数量(这并不重要)。我尝试改编this帖子,但它给了我 52 个不同的组,这似乎不太正确,因为有很多点彼此相距在 11 公里以内,所以寻找新的选择。
structure(list(station = c("BE01", "BE02", "BEUWM01", "BL01",
"BL02", "PB01", "PB02", "PB03", "PB04", "PB05", "PB06", "PB07",
"PB09", "PB10", "PB11", "PB12", "PB13", "PB14", "PB15", "PB16",
"PB17", "PB18", "PB19", "PB20", "PB21", "PB22", "PB23", "PB24",
"PB25", "PB26", "PB27", "PB28", "PB29", "PB30", "PB4G01", "PB4G02",
"PBUWM01", "PBUWM02", "SA01", "SA02", "SA02b", "SA03", "SA04",
"SA05", "SA06", "SA07", "SA11", "SAUWM01", "VB01", "VB02", "VB03",
"VB04"), longitude = c(71.6546833333333, 71.6748333333333, 71.66293,
72.4337833333333, 72.4347, 71.7342, 71.7632, 71.7992, 71.8092,
71.8326916667, 71.8405, 71.8796, 71.96835, 71.9697666666667,
71.9727, 71.9745666666667, 71.9385075, 71.8685, 71.8524, 71.8414,
71.8294, 71.758275, 71.7578, 71.7468, 71.9733, 71.9795, 71.9741,
71.9209, 71.8959, 71.8228, 71.7498, 71.7323, 71.9068, 71.7474,
71.9398, 71.8329, 71.98115, 71.75197, 72.24793, 72.241, 72.23027,
72.2569, 72.2812, 72.1980666667, 72.2116, 72.221, 72.2636, 72.24754,
72.2155, 72.2405, 72.2156, 72.2488), latitude = c(-5.25671666666667,
-5.2662, -5.24915, -5.2579, -5.2432, -5.2815, -5.2459, -5.2461,
-5.2448, -5.23439583333, -5.2567, -5.2694, -5.24165, -5.33015,
-5.3344, -5.37878333333333, -5.39709575, -5.4271, -5.4229, -5.4308,
-5.4406, -5.456505, -5.3823, -5.3512, -5.2695, -5.3039, -5.3521,
-5.4126, -5.4243, -5.4644, -5.3957, -5.3181, -5.2668, -5.4261,
-5.2582, -5.445, -5.33995, -5.38898, -5.31183, -5.3056, -5.316745,
-5.2985, -5.3347, -5.35026666667, -5.3686, -5.3225, -5.3327,
-5.30114, -5.5452, -5.5207, -5.5247, -5.546)), row.names = c(NA,
-52L), class = "data.frame")
我们可以按如下方式进行。首先,将 x/y 坐标转换为 sf 点:
library(sf)
points <- st_as_sf(df, coords = c("longitude", "latitude"), crs = 4326)
然后我们可以获得 52 x 52 距离矩阵,给出每对站点之间的距离
adj <- st_distance(points)
此外,我们可以将其转换为二进制矩阵,告诉我们每对站点之间的距离是否在 11 公里以内:
adj <- matrix(as.numeric(as.numeric(adj)) < 11000, nrow = nrow(adj))
注意这是一个邻接矩阵,所以我们可以轻松地将它变成一个图:
library(igraph)
g <- graph_from_adjacency_matrix(adj)
如果绘制此图,我们会看到有 4 个连接的组件,代表彼此相距 11 公里以内的站群:
plot(g)
我们可以获取这些组件的数量并将它们放回到我们的原始数据框中:
df$group <- factor(components(g)$membership)
然后,这将正确标记彼此相距 11 公里以内的站点,正如我们从结果图中看到的那样:
ggplot(rnaturalearth::ne_countries(scale = 10, returnclass = 'sf')) +
geom_sf() +
geom_point(data = df, aes(x = longitude, y = latitude, color = group)) +
coord_sf(xlim = c(71.5, 72.5), ylim = c(-6, -5))
创建于 2023-09-26,使用 reprex v2.0.2
计算
st_distance
的时间随着点数的平方而增长,所以会有点慢。在我的机器上 14 秒获得 3k 点。下面的解决方案对于 3k 点来说快了 20 倍,即 0.5 秒(当然这也可能取决于其他因素,但仍然值得注意):
## 3k random points in theoretical UTM zone
df<- data.frame( x=runif(3000, 500000, 640000),
y=runif(3000, 5000000, 5140000) )
## convert to points in a hypothetical
## area in 32 UTM zone (epsg=32632)
points <- st_as_sf(df, coords = c("x", "y"), crs = 32632)
## draw areas around 1.5 km
areas <- st_buffer(points, 1500, nQuadSegs = 4)
dissolved.areas <- st_cast(st_union(areas), "POLYGON")
dissolved.areas<- sf::st_as_sf(dissolved.areas)
dissolved.areas$id <- 1:nrow(dissolved.areas)
points.with.cluster.id <- st_join(points, dissolved.areas)
df$group <- factor(points.with.cluster.id$id)
ggplot() +
geom_sf(data=dissolved.areas) +
geom_point(data = df, aes(x = x, y = y), color = df$group,
size=1) +
theme_bw()