根据重叠范围计算唯一组

Question

我的目标是制作一个时间线图表，显示不同时期独特昆虫物种的数量。时间线将被分成 500 年的块，日期为现在之前 (BP)（即 0 - 500、501 - 1000 等）。现在，假设在 0 - 5000 之间。

我有一个包含昆虫样本的数据集，每个样本都有一个物种名称和日期。日期采用范围的形式，“从”在一列中，“到”在另一列中。我的问题是，我现在需要将每个样本的日期范围与我的时间线箱进行比较，查看范围重叠的箱，并仅对该物种进行一次计数。不是样本，而是物种，即该时间线箱中分组变量的存在。

我的数据集的示例如下所示：

species <- 
  c("Aphodius prodromus","Aphodius prodromus","Xestobium rufovillosum","Aphodius ater","Phyllobius calcaratus","Phyllobius calcaratus","Cercyon unipunctatus","Aphodius prodromus","Phyllobius calcaratus","Lochmaea suturalis","Micrelus ericae","Sericus brunneus")

age_older <- 
  c(1550,1550,1550,4100,2119,2119,4100,5000,4347,1476,3564,2567)

age_younger <- 
  c(400,400,400,3808,1974,1974,3808,4000,4051,867,3345,2035)

test = data.frame(species, age_older, age_younger)

一个潜在的问题是，在完整的数据集中，负年龄值将表示 1950 年之后的年份，这意味着负值越低，就越接近今天。并且将存在一些零长度年龄范围的实例（即某些日期仅代表一年，例如 3569 - 3569）。

我想要的是一个数据框，其中包含时间线图表的预定日期箱（对于本例：0 - 5000 之间），每行代表 500 年，然后是该 500 年独特昆虫物种数量的计数值切片。

所以像这样：

年龄_bin	物种计数
0	235
500	678
1000	1456
...	...

我研究了 iRanges 包，它对于一般情况下的重叠计数似乎非常好，但我正在努力找出如何根据分组变量进行计数。

我可以使用下面的代码和 iRanges 包生成总体重叠计数：

library(IRanges)
library(plyranges) #plyranges for using dplyr

#species date ranges
ir1 = with(test, IRanges(age_younger, age_older, names = species))

#timeline bin ranges, from 0 - 5000 BP
ir2 = IRanges(start = seq(0,4501,500), end = seq(500,5000,500))

#calculating number of overlaps
overlaps <- ir2 %>% mutate(n_overlaps = count_overlaps(., ir1))   
overlaps

IRanges object with 10 ranges and 1 metadata column:
           start       end     width | n_overlaps
       <integer> <integer> <integer> |  <integer>
   [1]         0       500       501 |          3
   [2]       500      1000       501 |          4
   [3]      1000      1500       501 |          4
   [4]      1500      2000       501 |          5
   [5]      2000      2500       501 |          3
   [6]      2500      3000       501 |          1
   [7]      3000      3500       501 |          1
   [8]      3500      4000       501 |          4
   [9]      4000      4500       501 |          4
  [10]      4500      5000       501 |          1

虽然这是朝着正确方向迈出的一步，但它并没有解决我将这些重叠与独特物种数量联系起来的主要问题。非常感谢任何帮助。

Answer 1

从您提供的数据开始，我生成一个

binned_data

小标题，并循环遍历垃圾箱以添加信息。另外，我不是直接使用您的

test

数据，而是创建一个

test_distinct

数据框，其中重复的行条目被删除。这简化了流程并防止多次

unique

操作。

除了计算每个容器中的物种数量之外，我还将物种名称存储为字符向量，物种之间用分号分隔。

最后，在我的分箱分析中，

age_bin_start

和

age_bin_end

在过滤中都是包含。您可以根据需要进行调整。

test_distinct <- test |> 
  dplyr::distinct()

binned_data <- tibble::tibble(
  age_bin_start = c(0, seq(501, 4501, by = 500)),
  age_bin_end = seq(500, 5000, by = 500),
  n_distinct_species = 0,
  species_in_bin = NA
)

for (i in seq_len(nrow(binned_data))) {
  b_start <- binned_data[i, "age_bin_start", drop = TRUE]  # `drop = TRUE` here to pull the value as a vector from the tibble
  b_end <- binned_data[i, "age_bin_end", drop = TRUE]
  
  species_in_bin <- test_distinct |> 
    dplyr::filter(age_younger <= b_end & age_older >= b_start) |> 
    dplyr::pull(species)
  
  binned_data[i, "n_distinct_species"] <- length(species_in_bin)
  
  binned_data[i, "species_in_bin"] <- paste(species_in_bin, collapse = ";")
}

结果：

> binned_data
# A tibble: 10 × 4
   age_bin_start age_bin_end n_distinct_species species_in_bin                                                             
           <dbl>       <dbl>              <dbl> <chr>                                                                      
 1             0         500                  2 Aphodius prodromus;Xestobium rufovillosum                                  
 2           501        1000                  3 Aphodius prodromus;Xestobium rufovillosum;Lochmaea suturalis               
 3          1001        1500                  3 Aphodius prodromus;Xestobium rufovillosum;Lochmaea suturalis               
 4          1501        2000                  3 Aphodius prodromus;Xestobium rufovillosum;Phyllobius calcaratus            
 5          2001        2500                  2 Phyllobius calcaratus;Sericus brunneus                                     
 6          2501        3000                  1 Sericus brunneus                                                           
 7          3001        3500                  1 Micrelus ericae                                                            
 8          3501        4000                  4 Aphodius ater;Cercyon unipunctatus;Aphodius prodromus;Micrelus ericae      
 9          4001        4500                  4 Aphodius ater;Cercyon unipunctatus;Aphodius prodromus;Phyllobius calcaratus
10          4501        5000                  1 Aphodius prodromus

我在这里使用

tibble

和

dplyr

作为依赖项，但您不需要需要使用这些包。这可以在基础 R 中解决，只是不那么干净（在我看来）。另外，您可以从 IRanges 包中提取唯一值，但我认为该包对于这个问题来说太过分了。

我希望这有帮助！

根据重叠范围计算唯一组

问题描述投票：0回答：1

1个回答

最新问题

根据重叠范围计算唯一组

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1