按具有多个变量的数据框中的记录数过滤变量

问题描述 投票:0回答:1

我有一个2461个观测数据框和从BOLD中检索到的80个变量。

Scleractinia <- read_tsv("http://www.boldsystems.org/index.php/API_Public/combined?taxon=Scleractinia&format=tsv")

目前我在这个数据帧的过滤过程中。目前我已经通过“markercode”和“核苷酸”过滤了数据帧。我想通过仅保留具有5个以上记录的“species_name”来进一步过滤数据帧。

Scleractinia.COI5P <- Scleractinia %>%
  filter(markercode == "COI-5P") %>%
  filter(str_detect(nucleotides, "[ACGT]"))
#This is a subset of the main dataset that includes only records with the marker code "COI-5P" and nucleotide sequences.

unique(Scleractinia.COI5P$species_name)
#There are 479 unique species present in this dataset. This is too many to work with so we are going to filter out species that don't have more than 5 records. 

SpeciesCount <- table(Scleractinia.COI5P$species_name)
#This creates a table of species and the number of records available in the dataset for this species. 

我创建了“SpeciesCount”以确定5个记录阈值,因为有很多物种只有1个记录。我不知道如何去过滤Scleractinia.COI5P这样80个变量(即列)仍然可用。

我试过了:

test <- Scleractinia.COI5P %>%
  filter(table(Scleractinia.COI5P$species_name) > 5)

但这导致了0个观察结果,包含80个变量。基本上我希望保留80个变量,以便我可以进一步探索需要过滤的内容,但我希望Scleractinia.COI5P中只有大于或等于5个记录的物种。

r
1个回答
0
投票

使用dplyr,您只需稍微更改管道操作即可。按物种名称分组然后过滤

library(tidyverse)

##Filter first
Scleractinia.COI5P <- Scleractinia %>%
  filter(markercode == "COI-5P") %>%
  filter(str_detect(nucleotides, "[ACGT]"))


##Group by and then filter
filtered_data_frame <- Scleractinia.COI5P %>% 
                       group_by(species_name) %>% filter(n() >=5)

##check to see if only species with over 5 records are represented
total_species <- count(filtered_data_frame, sort = TRUE) 
© www.soinside.com 2019 - 2024. All rights reserved.