我有两个表,其中包含早期阶段 (TIMEPOINT_1) 和后期阶段 (TIMEPOINT_2) 的 DNA 序列。我想从 TIMEPOINT_2 表中筛选 TIMEPOINT_1 表中相似度阈值为 95% 的序列。我尝试过使用“stringdistmatrix”函数并创建相似性矩阵,但没有达到预期的结果。 R 有没有办法做到这一点?
这是一个表结构的示例:
# Creating df TIMEPOINT_1
sequences <- c(
"ACCTTCAGGCAACCTTCAGGCA",
"ACCTTCGAGCAGCCATCAGGCA",
"ACCCGTCCTAGGATCGATCAGGCA",
"TCGAAGTGCATGCATGCTTACGTA",
"CGTGCAAAGCGTGACGTTAGCGT")
sequence_names <- c("time1_seq1", "time1_seq2", "time1_seq3", "time1_seq4", "time1_seq5")
TIMEPOINT_1 <- data.frame(name = sequence_names, sequence = sequences)
# Creating df TIMEPOINT_2
sequences <- c(
"ACCTTCGGGCAACCTTCAGGCA",
"ACCTTCGTGCGGGCCATCAGGCA",
"ACCCGTCCTAGGATCGATCAGGCA",
"TCGAAGTGCATGCATGCTTAAGTA",
"CGTGCAAAGCGTGACTGCACGTGGT")
sequence_names <- c("time2_seq1", "time2_seq2", "time2_seq3", "time2_seq4", "time2_seq5")
TIMEPOINT_2 <- data.frame(name = sequence_names, sequence = sequences)
预期结果:TIMEPOINT_2 表包含 TIMEPOINT_1 表中的匹配序列。
如果我很好地理解你的目标,我会执行一个简单的内部合并:
df <- merge(TIMEPOINT_1, TIMEPOINT_2, by = "sequence", all = F)
df