我在 R 中有一组矢量图层,我只想保留每层中与所有层相交的那些多边形。换句话说,只有当多边形与其他所有图层中的多边形相交时,才应保留该多边形。我有 8 层,每个 .gpkg 文件大约 33 MB。
以下代码在应用于一小部分图层时工作正常,但是当我尝试处理整个数据集时,它已经运行了 2 天多了而没有完成。
有谁知道可能导致速度减慢的原因或我如何提高计算速度?
# Function to check if a polygon intersects with all other layers
check_intersection_all_layers <- function(polygon, other_layers) {
for (other_layer in other_layers) {
intersection <- st_intersects(polygon, other_layer, sparse = FALSE)
if (!any(intersection)) {
return(FALSE)
}
}
return(TRUE)
}
# Function to filter a layer, keeping only polygons that intersect with at least one polygon in all other layers
filter_layer_by_intersections <- function(layer, other_layers) {
keep_indices <- sapply(1:nrow(layer), function(i) {
check_intersection_all_layers(layer[i, ], other_layers)
})
filtered_layer <- layer[keep_indices, ]
return(filtered_layer)
}
# Apply the filtering to all layers
filtered_layers_list <- lapply(1:length(filtered_polygons_class2_3), function(i) {
current_layer <- filtered_polygons_class2_3[[i]]
other_layers <- filtered_polygons_class2_3[-i]
filter_layer_by_intersections(current_layer, other_layers)
})
在这种情况下,我发现以较小的块处理数据有助于加快速度。此外,如果您的数据未进行投影,这会给某些 R 函数(例如
st_intersection()
)增加大量开销。例如,在该 reprex 中对 gpkg 文件进行子集化的 for()
循环仅在使用 WGS84/EPGS:4326 数据的前两次迭代中就花费了约 10 分钟,而 WGS84 Pseudo Mercator/EPGS:3857 等效耗时 < 3 mins for all four iterations.
此工作流程假设您的数据未被投影,因此包含 org_crs 和 prj_crs 变量。如果您的数据已投影,请将两者设置为与实际数据相同,或删除
st_transform()
的所有实例。您可能需要做的其他修改:
for()
循环之前将唯一的 id (poly_id) 添加到您的 gpkg 文件中(请参阅用于创建示例数据的代码以了解如何执行此操作)。需要此列来重新组合任何连续和不连续的分割多边形。要添加这些列,请将您的 gpkg 文件添加到 R 中,添加必要的列,然后将它们返回到您的工作目录。
与第2点相关,为了确保每个多边形/多边形都是单行,使用
st_read()
。如果您有面积列,则需要在最后重新计算它。
st_write()
的错误。如果是这样,请运行
summarise()
记下代码注释,特别是
! Loop 0 is not valid: Edge 1 crosses edge 4
和
sf_use_s2(FALSE)
第 1 步:加载包并创建一些与实际数据大小大致相同的示例 gpkg 文件
suppressWarnings({})
第2步:
以n块导入gpkg文件,并根据所有层重叠的程度进行子集化
suppressMessages({})
第 3 步:
library(sf)
library(dplyr)
# Define variables representing the CRS of your original data
org_crs <- 4326
# Define variable to project data e.g. WGS84 Pseudo Mercator/EPSG:3857
prj_crs <- 3857
# Skip the following if you already have gpkg files
set.seed(42)
x <- 170
y <- 85
cellsize <- .7
for(i in 1:7) {
tmp <- st_bbox(c(xmin = x * -1, xmax = x, ymin = y * -1, ymax = y), crs = org_crs) %>%
st_make_grid(cellsize = cellsize, square = FALSE) %>%
st_as_sf() %>%
mutate(poly_id = 1:n(), # IMPORTANT
var = runif(n(), 0, 1))
if (file.exists(paste0("data/sf_poly", i, ".gpkg"))) {
file.remove(paste0("data/sf_poly", i, ".gpkg"))
}
st_write(tmp, paste0("data/sf_poly", i, ".gpkg"))
x <- x - 5
y <- y - 5
cellsize <- cellsize - 0.03125
rm(tmp)
}
# Create sf_poly8 in RAM and subset to replicate sf with non-contiguous polygons
sf_poly8 <- st_bbox(c(xmin = x * -1, xmax = x, ymin = y * -1, ymax = y),
crs = org_crs) %>%
st_make_grid(cellsize = cellsize, square = FALSE) %>%
st_as_sf() %>%
mutate(poly_id = 1:n(), # IMPORTANT
var = runif(n(), 0, 1))
# Create sf to subset sf_poly8
sf_subset <- st_bbox(c(xmin = x * -1, xmax = x, ymin = y * -1, ymax = y),
crs = org_crs) %>%
st_make_grid(cellsize = 10, square = FALSE) %>%
st_as_sf()
# Filter the data frame based on the sampled rows
sf_subset <- sf_subset[sample(seq_len(nrow(sf_subset)), size = 50), ]
# Subset sf_poly8 and write to directory
sf_poly8 <- sf_poly8 %>%
mutate(tmp = as.character(st_intersects(., sf_subset))) %>%
filter(tmp != "integer(0)") %>%
select(-tmp)
file.remove("data/sf_poly8.gpkg")
st_write(sf_poly8, "data/sf_poly8.gpkg")
rm(sf_poly8)
会在输出中创建点和线串要素。这些将使用 # List all gpkg files in folder named "data" in your working directory
# You can either copy relevant gpkg files into a "data" folder you create in your
# working directory, or modify to point to the location of the gpkg files
gpkg_files <- list.files("data", pattern = ".gpkg$", full.names = TRUE)
# Define the bounding box coordinates (currently WGS84, edit to match your data)
xmins <- c(-180, -90, 0, 90)
#xmin <- -90
xmaxs <- c(-90, 0, 90, 180)
#xmax <- -0
ymin <- -90
ymax <- 90
# Important: run iter before running for() each time
iter <- 0
# Iterate over bounding box vectors to read subsetted gpkg
for(i in 1:length(xmins)) {
# Create a WKT bbox to subset sf objects
bbox_wkt <- sprintf("POLYGON((%f %f, %f %f, %f %f, %f %f, %f %f))",
xmins[i], ymin, xmins[i], ymax, xmaxs[i],
ymax, xmaxs[i], ymin, xmins[i], ymin)
# Subset gpkg files from directory, project to prj_crs
invisible(
capture.output({
suppressMessages({
suppressWarnings({
lapply(seq_along(gpkg_files), function(i) {
sf_obj <- st_read(gpkg_files[i], wkt_filter = bbox_wkt) %>%
st_transform(prj_crs) %>%
st_set_geometry("geometry")
# Add projected sf to global environment
assign(paste0("sf_poly", i), sf_obj, envir = .GlobalEnv)
})
})
})
})
)
# Use summarise() to create simplifed version of sf_poly* objects
sf_list <- lapply(mget(ls(pattern = "^sf_poly")), summarise)
# Combine all simplified sf_poly* objects, filter to only where all overlap
# Note suppressed warning
sf_all <- suppressWarnings({
bind_rows(sf_list) %>%
st_intersection() %>%
filter(n.overlaps == 8) %>%
st_cast("POLYGON")
})
# Set counter for each iteration
iter <- iter + 1
# Create a list of sf_poly* objects
sf_list <- mget(ls(pattern = "^sf_poly"))
# Subset each sf_poly* object using sf_all and st_intersection(), output results
# Note suppressed warning
invisible(suppressWarnings({
lapply(seq_along(sf_list), function(i) {
subset_sf <- st_intersection(sf_list[[i]], sf_all) # %>%
assign(paste0("sf", i, "_iter", iter), subset_sf, envir = .GlobalEnv)
})
}))
}
st_intersection()
、
filter()
和
st_collection_extract()
组合连续和非连续多边形的同时处理这些特征。
summarise()
检查一个组合 sf 的结果,注意最后n 行中的 NULL/NA 值。这些要素要么是分割多边形(现在是多边形),要么是几何集合。我喜欢保留这些列,因为它可以跟踪处理过程中发生的情况。如果不需要,请删除 n.overlaps 和 origins 列
st_union()