我有一个分类数据集,如下所示:
sample_table <- data.table(Phylum = c("Arthropoda", "Arthropoda", "Arthropoda"),
Class = c("Arachnida", "Insecta", "Insecta"),
Order = c("Acariformes", "Coleoptera", "Coleoptera"),
Family = c(NA, "Staphylinidae", "Staphylinidae"),
Genus = c(NA, "Staphylininae", "Staphylininae"),
Species = c(NA, NA, "Philonthus"),
Site = c(5, 6, 6),
Distance = c(0, 0, 0),
N = c(58, 1, 5))
# Phylum Class Order Family Genus Species Site Distance N
# Arthropoda Arachnida Acariformes <NA> <NA> <NA> 5 0 58
# Arthropoda Insecta Coleoptera Staphylinidae Staphylininae <NA> 6 0 1
# Arthropoda Insecta Coleoptera Staphylinidae Staphylininae Philonthus 6 0 5
Phylum
、Class
、Order
、Family
、Genus
和Species
都是分层分类群。 Site
和Distance
指的是样本的捕获地点。 N
是在给定地点和给定分类群中距离捕获的样本数量。
第 2 行 (R2) 的
N
值应包括 R3 的 N
值,因为 R2 是包含 R3 的超集。但是,它目前仅包括无法识别超过 Genus
级别的样本(其中 Species
为 NA)。
我希望我的数据集类似于以下内容:
# Phylum Class Order Family Genus Species Site Distance N
# Arthropoda <NA> <NA> <NA> <NA> <NA> 5 0 58
# Arthropoda Arachnida <NA> <NA> <NA> <NA> 5 0 58
# Arthropoda Arachnida Acariformes <NA> <NA> <NA> 5 0 58
# Arthropoda <NA> <NA> <NA> <NA> <NA> 6 0 6
# Arthropoda Insecta <NA> <NA> <NA> <NA> 6 0 6
# Arthropoda Insecta Coleoptera <NA> <NA> <NA> 6 0 6
# Arthropoda Insecta Coleoptera Staphylinidae Staphylininae <NA> 6 0 6
# Arthropoda Insecta Coleoptera Staphylinidae Staphylininae Philonthus 6 0 5
如您所见,现在每个分类超集、每个位点距离对都有一行。每行的 N 值包括被识别为该行最大分类特异性的样本和那些被识别为超出该行最大分类特异性的样本。
我已经研究过使用
tidyr::complete()
和交叉连接,但我不确定它们是我正在寻找的。与我能找到的大多数示例不同,我希望在不跨越单独分支上的分类组的情况下完成数据集,本质上是通过添加用 NA 右填充分类列的行(尽管必须有比这样做更好的方法)手动)。我也不确定如何使用这些策略来解决我的分组/求和问题。
A
tidyverse
方法:
my_summ <- \(d, g) {
d |>
drop_na(!!last(g)) |>
summarise(N = sum(N), Distance = sum(Distance), .by = c({{g}}, .data$Site))
}
bind_rows(
my_summ(sample_table, "Phylum"),
my_summ(sample_table, c("Phylum", "Class")),
my_summ(sample_table, c("Phylum", "Class", "Order")),
my_summ(sample_table, c("Phylum", "Class", "Order", "Family")),
my_summ(sample_table, c("Phylum", "Class", "Order", "Family", "Genus")),
my_summ(sample_table, c("Phylum", "Class", "Order", "Family", "Genus", "Species"))
) |>
select(!!!names(sample_table)) |>
distinct() |>
arrange(Site, -N)
Phylum Class Order Family Genus Species Site Distance N 1 Arthropoda <NA> <NA> <NA> <NA> <NA> 5 0 58 2 Arthropoda Arachnida <NA> <NA> <NA> <NA> 5 0 58 3 Arthropoda Arachnida Acariformes <NA> <NA> <NA> 5 0 58 4 Arthropoda <NA> <NA> <NA> <NA> <NA> 6 0 6 5 Arthropoda Insecta <NA> <NA> <NA> <NA> 6 0 6 6 Arthropoda Insecta Coleoptera <NA> <NA> <NA> 6 0 6 7 Arthropoda Insecta Coleoptera Staphylinidae <NA> <NA> 6 0 6 8 Arthropoda Insecta Coleoptera Staphylinidae Staphylininae <NA> 6 0 6 9 Arthropoda Insecta Coleoptera Staphylinidae Staphylininae Philonthus 6 0 5