我有一个包含 13170180 行的巨大数据框,并且我有一个包含 1107650 个点的网格。我想根据距每个网格点中心 2000 米的半径来过滤数据点。
library(geosphere)
library(dplyr)
library(pbapply)
library(ratser)
library(sf)
library(dplyr)
# Define the resolution of the grid (in degrees)
grid_resolution <- 1 / 60 # 1 arc-minute resolution
# Create an empty raster grid with the specified extent and resolution
grid <- raster(xmn = xmin, xmx = xmax, ymn = ymin, ymx = ymax,
resolution = grid_resolution, crs = "+proj=longlat +datum=WGS84")
# Extract grid cell centers (midpoints)
grid_centers <- coordinates(grid)
# Convert to a data frame for easy manipulation
grid_points<- data.frame(grid_centers)
colnames(grid_points) <- c("long_center", "lat_center")
# Define the radius for filtering (2000 meters)
radius <- 2000
# Initialize an empty list to store the points within 2000m for each grid center
points_within_2000m_list <- vector("list", nrow(grid_points))
# Use pblapply to loop through each grid center with a progress bar
points_within_2000m_list <- pblapply(1:nrow(grid_points), function(i) {
# Extract the grid center coordinates (latitude and longitude)
target_lat <- grid_points[i, 2]
target_long <- grid_points[i, 1]
# Calculate the distance between each data point and the current grid center
data$distance <- distHaversine(
matrix(c(data$long, data$lat), ncol = 2),
matrix(c(target_long, target_lat), ncol = 2)
)
# Filter data points within 2000 meters of the current grid center
points_within_2000m <- data %>%
filter(distance <= radius)
# Return the filtered points
return(points_within_2000m)
})
# Now, `points_within_2000m_list` contains a list of data frames,
# where each data frame holds the points within 2000 meters of each grid center
运行此代码需要 28 天才能完成。有什么办法可以跑得更快吗
不是一个完整的答案,但我想建议
sf::st_intersects()
函数来查找缓冲区内的点列表。
创建栅格作为网格的起点。
grid_resolution <- 1 / 60
grid <- terra::rast(xmin = 0, xmax = 4, ymin = 0, ymax = 4,
resolution = grid_resolution, crs = "+proj=longlat +datum=WGS84")
terra::values(grid) <- 1:terra::ncell(grid)
现在,在每个栅格单元网格周围创建缓冲区:
radius <- 2000
b <- as.data.frame(grid, xy = TRUE)[, 1:2] |>
terra::vect(geom = c("x", "y"), crs = terra::crs(grid)) |>
terra::buffer(radius)
现在,是一个样本点而不是您的数据框,但您可以使用
sf::st_as_sf(df, coords = c("X", "Y"), crs = ...)
在 {sf} 包中创建自己的数据框。
points <- terra::spatSample(grid, size = 40000,
method = "random", xy=TRUE)
points$lyr.1 <- row.names(points)
p <- terra::vect(points, geom = c("x", "y"), crs = terra::crs(grid))
terra::plot(grid)
terra::points(points, cex = 0.1, pch = 10)
现在一些 {sf} 包:我们将缓冲区和样本点转换为
sf
对象并运行 intersects()
来获取相交对象的列表:
b_sf <- sf::st_as_sf(b)
p_sf <- sf::st_as_sf(p)
tictoc::tic()
l_sf <- sf::st_intersects(b_sf, p_sf)
tictoc::toc()
#> 10.082 sec elapsed
l_sf
#> Sparse geometry binary predicate list of length 57600, where the
#> predicate was `intersects'
#> first 10 elements:
#> 1: (empty)
#> 2: 1338, 18981
#> 3: 1338, 16901, 20448
#> 4: 1338, 15566, 20448
#> 5: 20448
#> 6: 7345, 34196
#> 7: 4603, 7345, 14446
#> 8: 4603, 7345, 22396
#> 9: 4603, 32411
#> 10: 1159, 30164, 32411
10 秒用于 50+ k 缓冲区,对于懒惰的笔记本电脑来说还不错。
l_sf[1]
为空,l_sf[2]
包含点 1338
和 18981
等。 1016 号缓冲区的样图包含 5 个点:
terra::plot(grid,
xlim = c(sf::st_bbox(b_sf[1016, ,])$xmin[[1]]-0.02, sf::st_bbox(b_sf[1016, ,])$xmax[[1]]+0.02),
ylim = c(sf::st_bbox(b_sf[1016, ,])$ymin[[1]]-0.02, sf::st_bbox(b_sf[1016, ,])$ymax[[1]]+0.02))
b_sf |>
terra::plot(add = TRUE, lwd=0.5, lty = 3)
b_sf[1016, ,] |>
terra::plot(add = TRUE, lwd = 2)
p_sf |>
subset(lyr.1 %in% l_sf[1016][[1]]) |>
terra::plot(add = TRUE, pch = 10, col = "red")
您也可以使用 {terra} 搜索
points
来查找每个 buffer
,但在我的测试中,它需要更长的时间(按幅度左右)
l <- lapply(seq_len(nrow(b)), function(x) terra::values(terra::intersect(p, b[x, ]))[, 1])
为了进一步减少时间,我建议将网格和数据点划分为更小的(重叠至少 2000 m)块,并尝试 {future} 包或同时尝试更多 R 会话。
创建于 2024 年 10 月 14 日,使用 reprex v2.1.0