为层次结构中所有级别的组/超集创建 data.table 行

问题描述 投票:0回答:1

我有一个分类数据集,如下所示:

sample_table <- data.table(Phylum = c("Arthropoda", "Arthropoda", "Arthropoda"),
                              Class = c("Arachnida",   "Insecta",    "Insecta"),
                              Order = c("Acariformes",  "Coleoptera", "Coleoptera"),
                              Family = c(NA, "Staphylinidae", "Staphylinidae"),
                              Genus = c(NA, "Staphylininae", "Staphylininae"),
                              Species = c(NA, NA, "Philonthus"),
                              Site = c(5, 6, 6),
                              Distance = c(0, 0, 0),
                              N = c(58, 1, 5))

# Phylum       Class      Order        Family           Genus           Species     Site     Distance     N
# Arthropoda   Arachnida  Acariformes  <NA>             <NA>            <NA>        5        0            58
# Arthropoda   Insecta    Coleoptera   Staphylinidae    Staphylininae   <NA>        6        0            1
# Arthropoda   Insecta    Coleoptera   Staphylinidae    Staphylininae   Philonthus  6        0            5

Phylum
Class
Order
Family
Genus
Species
都是分层分类群。
Site
Distance
指的是样本的捕获地点。
N
是在给定地点和给定分类群中距离捕获的样本数量。

第 2 行 (R2) 的

N
值应包括 R3 的
N
值,因为 R2 是包含 R3 的超集。但是,它目前仅包括无法识别超过
Genus
级别的样本(其中
Species
为 NA)。

我希望我的数据集类似于以下内容:

# Phylum       Class      Order        Family          Genus           Species     Site    Distance  N
# Arthropoda   <NA>       <NA>         <NA>            <NA>            <NA>        5       0         58
# Arthropoda   Arachnida  <NA>         <NA>            <NA>            <NA>        5       0         58
# Arthropoda   Arachnida  Acariformes  <NA>            <NA>            <NA>        5       0         58
# Arthropoda   <NA>     <NA>           <NA>            <NA>            <NA>        6       0         6
# Arthropoda   Insecta  <NA>           <NA>            <NA>            <NA>        6       0         6
# Arthropoda   Insecta  Coleoptera     <NA>            <NA>            <NA>        6       0         6
# Arthropoda   Insecta  Coleoptera     Staphylinidae   Staphylininae   <NA>        6       0         6
# Arthropoda   Insecta  Coleoptera     Staphylinidae   Staphylininae   Philonthus  6       0         5

如您所见,现在每个分类超集、每个位点距离对都有一行。每行的 N 值包括被识别为该行最大分类特异性的样本那些被识别为超出该行最大分类特异性的样本。

我已经研究过使用

tidyr::complete()
和交叉连接,但我不确定它们是我正在寻找的。与我能找到的大多数示例不同,我希望在不跨越单独分支上的分类组的情况下完成数据集,本质上是通过添加用 NA 右填充分类列的行(尽管必须有比这样做更好的方法)手动)。我也不确定如何使用这些策略来解决我的分组/求和问题。

r data.table hierarchical-data taxonomy
1个回答
0
投票

A

tidyverse
方法:

my_summ <- \(d, g) {
  d |> 
    drop_na(!!last(g)) |> 
    summarise(N = sum(N), Distance = sum(Distance), .by = c({{g}}, .data$Site))
}

bind_rows(
  my_summ(sample_table, "Phylum"),
  my_summ(sample_table, c("Phylum", "Class")),
  my_summ(sample_table, c("Phylum", "Class", "Order")),
  my_summ(sample_table, c("Phylum", "Class", "Order", "Family")),
  my_summ(sample_table, c("Phylum", "Class", "Order", "Family", "Genus")),
  my_summ(sample_table, c("Phylum", "Class", "Order", "Family", "Genus", "Species"))
) |> 
  select(!!!names(sample_table)) |> 
  distinct() |> 
  arrange(Site, -N)
      Phylum     Class       Order        Family         Genus    Species Site Distance  N
1 Arthropoda      <NA>        <NA>          <NA>          <NA>       <NA>    5        0 58
2 Arthropoda Arachnida        <NA>          <NA>          <NA>       <NA>    5        0 58
3 Arthropoda Arachnida Acariformes          <NA>          <NA>       <NA>    5        0 58
4 Arthropoda      <NA>        <NA>          <NA>          <NA>       <NA>    6        0  6
5 Arthropoda   Insecta        <NA>          <NA>          <NA>       <NA>    6        0  6
6 Arthropoda   Insecta  Coleoptera          <NA>          <NA>       <NA>    6        0  6
7 Arthropoda   Insecta  Coleoptera Staphylinidae          <NA>       <NA>    6        0  6
8 Arthropoda   Insecta  Coleoptera Staphylinidae Staphylininae       <NA>    6        0  6
9 Arthropoda   Insecta  Coleoptera Staphylinidae Staphylininae Philonthus    6        0  5
© www.soinside.com 2019 - 2024. All rights reserved.