如何从每组只有一个代表的数据框中均匀采样？

Question

This是我正在使用的数据框，我正在尝试对列 V2（位置）进行均匀子采样（最小值：1130，最大值：4406748），这样最终的列 V4（谱系）中只有一个代表样本。我尝试以位置均匀分布的方式进行抽样，同时确保整个样本中每组仅包含 1 个代表。

我尝试过对数据进行排序和分箱，但我无法弄清楚如何以数据框中仅存在 1 个代表性谱系的方式均匀地对其进行采样。

sorted_barcodes <- tb_profiler_barcodes %>% arrange(V2)
# bin the data to N bins
binned_sorted <- sorted_df %>%
  mutate(bin = cut(V2, breaks = 150, labels = FALSE))

非常感谢您的帮助。

Answer 1

我认为你不能完美地满足这两个条件（唯一的V4和均匀分布的V2）。如果稍微放宽“均匀分布”的条件，就可以保证

V4

的唯一性。一种方法是使用

strata

包中的

sampling

函数。此函数允许您执行分层抽样，您可以指定仅包含

V4

中的每个值中的 1 个。

V2

的分布（理论上）应该是均匀的，尽管随机抽样可能会得到不好的样本。

tb_profiler_barcodes <- read.table("tbdb.barcode.bed", sep="\t")

library(sampling)
library(dplyr)

sorted_barcodes <- tb_profiler_barcodes %>% arrange(V2)

size <- length(unique(tb_profiler_barcodes$V4)) # number of strata
n <- nrow(tb_profiler_barcodes)

set.seed(123) # Omit in practice
s <- strata(sorted_barcodes, 
            stratanames="V4",
            size=rep(1, 126),  # select only 1 from each strata
            method="srswor")

sample <- getdata(sorted_barcodes, s)

检查

V4

的唯一性：

any(duplicated(sample$V4))  # [1] FALSE

检查

V2

的分布：

plot(sorted_barcodes$V2, rep(1, n), pch=19, 
     xlab="", ylab="", yaxt="n", xaxt="n", col="red")
points(sample$V2, rep(0.8, size), pch=20, col="blue")
legend("toplef", legend=c("Data", "Sample"), 
       col=c("red", "blue"),
       pch=c(19,20), bty="n")

如果分布看起来不够均匀，请重复采样，直到获得看起来更好的分布。

如何从每组只有一个代表的数据框中均匀采样？

问题描述投票：0回答：1

1个回答

最新问题

如何从每组只有一个代表的数据框中均匀采样？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1