给出以下数据框:
df <- tibble::tribble(
~pass_id, ~km_ini, ~km_fin,
1L, 0.89, 2.39,
2L, 1.53, 3.03,
3L, 21.9, 23.4,
4L, 23.4, 24.9,
5L, 24, 25.5,
6L, 25.9, 27.4,
7L, 36.7, 38.2,
8L, 41.4, 42.9,
9L, 42.1, 43.6,
10L, 45.5, 47
)
由reprex package(v0.3.0)在2020-02-17创建
我需要与[df
的所有行的all符合以下条件的任何50个数字匹配的样本:
>= .750
<= 99.450
< km_ini - .750
> km_fin + .750
我的最佳镜头远远超出了我的期望。首先,我做了一个runif
,然后我enframe
d了,然后尝试了filter
,但是我只是让它在前两个条件下都可以工作。无论如何,我不一定需要将结果作为数据框,它可能是数字向量。
library(tidyverse, verbose = F)
set.seed(42)
sort(runif(100000, 0, 99.450)) %>%
enframe(., "ID", "km") %>%
filter(km >= .750 & km <= 99.450 - .750)
#> # A tibble: 98,467 x 2
#> ID km
#> <int> <dbl>
#> 1 763 0.750
#> 2 764 0.751
#> 3 765 0.751
#> 4 766 0.753
#> 5 767 0.753
#> 6 768 0.754
#> 7 769 0.754
#> 8 770 0.755
#> 9 771 0.755
#> 10 772 0.757
#> # … with 98,457 more rows
编辑:试图直观地显示问题
最终结果必须是一个数字矢量(或df),用于评估整个数据集,而不仅仅是单独的每一行。作为前两行的示例,请参见以下表示形式:
所以,看到了:
这样,马上,前4000米之内的随机数据集只能有3030 + .750的数字。
然后,问题是尝试以编程方式执行此操作,以便评估数据帧的所有行,并且所生成的数字不在上述所有条件之内。
我想我明白。您想要在由距离分隔的间隙中进行采样,而复杂的因素是您不能在标记距离的任何一侧进行750m的采样。
我认为对问题进行更清晰的视觉理解是很有用的。在此图中,x轴表示距离(y轴只是一个“虚拟”轴,因为我们只对x轴感兴趣)。黑条是我们无法采样的“排除区域”。在我们不希望采样的排除区域的两侧也有750m区域,在这里被涂成红色:
因此,本质上,我们希望从该图中x轴的非阴影区域获得均匀的样本。
我的解决方案是先合并重叠的片段,然后创建一个根据每个间隙的大小加权的样本空间,并从该空间中获取50个均匀样本。
这里,我已概括为允许任意限制和样本大小。
sample_space <- function(km_ini, km_fin, km_max = 99.45, buffer = 0.75, n = 50)
{
# Find and merge overlaps
i <- 1
km_ini <- km_ini - buffer
km_fin <- km_fin + buffer
while(i <= length(km_ini))
{
overlaps <- which(km_ini < km_fin[i] & km_fin > km_ini[i])
if(length(overlaps) < 2) {i <- i + 1; next;}
km_ini <- c(km_ini, min(km_ini[overlaps]))
km_fin <- c(km_fin, max(km_fin[overlaps]))
km_ini <- km_ini[-overlaps]
km_fin <- km_fin[-overlaps]
i <- 1
}
# Create a matrix of appropriate gaps
gaps <- cbind(sort(km_fin), c(sort(km_ini)[-1], km_max))
print(gaps)
# Create a weighted sample space
splits <- c(0, cumsum(apply(gaps, 1, diff)))
# Take a sample within that space
runifs <- runif(n, 0, max(splits))
# Add the appropriate starting value back on
index <- as.numeric(cut(runifs, splits))
runifs - splits[index] + gaps[index, 1]
}
现在我们可以做
sample_space(df$km_ini, df$km_fin)
#> [1] 93.107858 92.216660 83.597703 86.341198 72.258245 86.591883 18.572744
#> [8] 16.641163 73.344658 73.075426 78.230074 97.745802 52.654342 52.298444
#> [15] 70.029034 67.430346 95.328900 62.250864 79.144025 86.344868 7.063474
#> [22] 58.797335 79.304272 54.731057 32.137068 84.837576 94.226992 50.808135
#> [29] 65.987277 76.666933 29.225744 33.309866 13.013735 6.925277 65.207383
#> [36] 91.591293 61.614993 18.646062 97.550237 48.478875 12.860920 20.263471
#> [43] 34.980616 50.583291 15.813562 96.104448 91.310377 53.063613 17.376281
#> [50] 72.511153
[为了显示这是否达到了我们想要的效果,让我们在排除区域的图上绘制样本:
set.seed(69)
sample_df <- data.frame(x = sample_space(df$km_ini, df$km_fin),
y = runif(50, 0.45, 0.55))
ggplot(df) +
geom_rect(aes(xmin = km_ini - 0.75, xmax = km_fin + 0.75, ymin = 0, ymax = 1),
alpha = 0.5, fill = "red") +
geom_rect(aes(xmin = km_ini, xmax = km_fin, ymin = 0, ymax = 1), fill = "black") +
geom_rect(aes(xmin = 0, xmax = 0.75, ymin = 0, ymax = 1), alpha = 0.5) +
geom_rect(aes(xmin = 99.45, xmax = 100, ymin = 0, ymax = 1), alpha = 0.5) +
labs(x = "distance", y = "dummy") +
geom_point(data = sample_df, aes(x = x, y = y), col = "blue")
由reprex package(v0.3.0)在2020-03-01创建