从网格点2000米半径内过滤大型数据集

问题描述 投票:0回答:1

我有一个包含 13170180 行的巨大数据框,并且我有一个包含 1107650 个点的网格。我想根据距每个网格点中心 2000 米的半径来过滤数据点。

library(geosphere)
library(dplyr)
library(pbapply)
library(ratser)
library(sf)
library(dplyr)  

# Define the resolution of the grid (in degrees)
grid_resolution <- 1 / 60  # 1 arc-minute resolution

# Create an empty raster grid with the specified extent and resolution
grid <- raster(xmn = xmin, xmx = xmax, ymn = ymin, ymx = ymax, 
               resolution = grid_resolution, crs = "+proj=longlat +datum=WGS84")

# Extract grid cell centers (midpoints)
grid_centers <- coordinates(grid)

# Convert to a data frame for easy manipulation
grid_points<- data.frame(grid_centers)
colnames(grid_points) <- c("long_center", "lat_center")

# Define the radius for filtering (2000 meters)
radius <- 2000

# Initialize an empty list to store the points within 2000m for each grid center
points_within_2000m_list <- vector("list", nrow(grid_points))

# Use pblapply to loop through each grid center with a progress bar
points_within_2000m_list <- pblapply(1:nrow(grid_points), function(i) {
  # Extract the grid center coordinates (latitude and longitude)
  target_lat <- grid_points[i, 2]
  target_long <- grid_points[i, 1]
  
  # Calculate the distance between each data point and the current grid center
  data$distance <- distHaversine(
    matrix(c(data$long, data$lat), ncol = 2),
    matrix(c(target_long, target_lat), ncol = 2)
  )
  
  # Filter data points within 2000 meters of the current grid center
  points_within_2000m <- data %>%
    filter(distance <= radius)
  
  # Return the filtered points
  return(points_within_2000m)
})

# Now, `points_within_2000m_list` contains a list of data frames, 
# where each data frame holds the points within 2000 meters of each grid center

运行此代码需要 28 天才能完成。有什么办法可以跑得更快吗

r performance optimization
1个回答
0
投票

不是一个完整的答案,但我想建议

sf::st_intersects()
函数来查找缓冲区内的点列表。

创建栅格作为网格的起点。

grid_resolution <- 1 / 60
grid <- terra::rast(xmin = 0, xmax = 4, ymin = 0, ymax = 4, 
               resolution = grid_resolution, crs = "+proj=longlat +datum=WGS84")
terra::values(grid) <- 1:terra::ncell(grid)

现在,在每个栅格单元网格周围创建缓冲区:

radius <- 2000

b <- as.data.frame(grid, xy = TRUE)[, 1:2] |>
  terra::vect(geom = c("x", "y"), crs = terra::crs(grid)) |>
  terra::buffer(radius)

现在,是一个样本点而不是您的数据框,但您可以使用

sf::st_as_sf(df, coords = c("X", "Y"), crs = ...)
在 {sf} 包中创建自己的数据框。

points <- terra::spatSample(grid, size = 40000,
                method = "random", xy=TRUE)
points$lyr.1 <- row.names(points)
p <- terra::vect(points, geom = c("x", "y"), crs = terra::crs(grid))

terra::plot(grid)
terra::points(points, cex = 0.1, pch = 10)

现在一些 {sf} 包:我们将缓冲区和样本点转换为

sf
对象并运行
intersects()
来获取相交对象的列表:

b_sf <- sf::st_as_sf(b)
p_sf <- sf::st_as_sf(p)

tictoc::tic()
l_sf <- sf::st_intersects(b_sf, p_sf)
tictoc::toc()
#> 10.082 sec elapsed

l_sf
#> Sparse geometry binary predicate list of length 57600, where the
#> predicate was `intersects'
#> first 10 elements:
#>  1: (empty)
#>  2: 1338, 18981
#>  3: 1338, 16901, 20448
#>  4: 1338, 15566, 20448
#>  5: 20448
#>  6: 7345, 34196
#>  7: 4603, 7345, 14446
#>  8: 4603, 7345, 22396
#>  9: 4603, 32411
#>  10: 1159, 30164, 32411

10 秒用于 50+ k 缓冲区,对于懒惰的笔记本电脑来说还不错。

l_sf[1]
为空,
l_sf[2]
包含点
1338
18981
等。 1016 号缓冲区的样图包含 5 个点:

terra::plot(grid, 
            xlim = c(sf::st_bbox(b_sf[1016, ,])$xmin[[1]]-0.02, sf::st_bbox(b_sf[1016, ,])$xmax[[1]]+0.02),
            ylim = c(sf::st_bbox(b_sf[1016, ,])$ymin[[1]]-0.02, sf::st_bbox(b_sf[1016, ,])$ymax[[1]]+0.02))

b_sf |>
  terra::plot(add = TRUE, lwd=0.5, lty = 3)
b_sf[1016, ,] |>
  terra::plot(add = TRUE, lwd = 2)

p_sf |>
  subset(lyr.1 %in% l_sf[1016][[1]]) |>
  terra::plot(add = TRUE, pch = 10, col = "red")

您也可以使用 {terra} 搜索

points
来查找每个
buffer
,但在我的测试中,它需要更长的时间(按幅度左右)

l <- lapply(seq_len(nrow(b)), function(x) terra::values(terra::intersect(p, b[x, ]))[, 1])

为了进一步减少时间,我建议将网格和数据点划分为更小的(重叠至少 2000 m)块,并尝试 {future} 包或同时尝试更多 R 会话。

创建于 2024 年 10 月 14 日,使用 reprex v2.1.0

© www.soinside.com 2019 - 2024. All rights reserved.