我对 Perl 和 Bioperl 相当陌生,我正在尝试编写一个脚本来识别相同序列的实例。为了实现这一目标,我设想了一个脚本,它需要 2 个内文件,第一个是 fasta 格式的多重对齐,第二个是将 fasta id 链接到其他相关信息的附件文件。我的方法是使用 Bio::SeqIO 读取多重比对,并将文件内容放入哈希中,其中序列是键,id 是值,或者在序列共享的情况下,id 数组是值.
我认为它应该看起来像这样:
“AATTTGTTGTGTGTACC”=>('Seq1','Seq13'),
“TTTCTCTTTCCCAAAG”=>“Seq2”,
目前,我相信我陷入困境是因为在序列共享的情况下尝试将第二个 id 推入数组时出现错误(即上例中的“Seq13”)。
这是我正在使用的测试多重对齐:
>Seq1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>Seq2
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>Seq13
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
下面是我迄今为止编写的代码:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Data::Dumper;
my $seqs = shift @ARGV or die "please provide a multiple alignment file and an accesory information file: $!\n";
my $info = shift @ARGV or die "please provide a multiple alignment file and an accesory information file: $!\n";
#open(INFO, '<', $info);
my $inseq = Bio::SeqIO->new(
-file => $seqs,
-format => "fasta",
);
my %hts;
while (my $seq = $inseq->next_seq) {
# print $seq->seq(), "\t", $seq->id, "\n";
if (defined $hts{$seq->seq()}) {
print "Sequence already in hash:\t$seq->id\n";
push @{$hts{$seq->seq()}}, ${$seq->id};
}
else {
$hts{$seq->seq()} = $seq->id;
}
print Dumper \%hts
}
因此,我希望得到一些帮助
我收到一个错误,我不太明白,但相信与push语句有关-->
在 ht_sharing.pl 第 24 行第 3 行使用“严格引用”时,不能使用字符串(“Seq1”)作为 ARRAY 引用。
2)当 if 循环之外的 print 语句处于活动状态时,它会按照我认为应该的方式打印 id (即 Seq1),但在 if 循环内的 print 语句中,相同的调用 $seq->id 会生成一个引用(即
Bio::Seq=HASH(0x19e7210)->id
)。为什么是这样?我不明白为什么打印 $seq->id 在同一个 while 循环中有不同的输出。
您的代码非常接近,但有几个小问题。第一个是你想使用语法
if (exists $hash{$key}) { ... }
来查看某个键是否存在,defined
告诉你该值是否已定义。第二件事是您无缘无故地取消引用您的 $seq
对象。
当您在 Bio::SeqIO 对象上调用“next_seq”方法时,它会返回一个 Bio::Seq 对象。如果您对该 Bio::Seq 对象调用“id”方法,它会按预期返回 ID,因此无需遵循任何内容。另外,不需要显式导入 Bio::Seq (这只是注释,不是问题)。
其他评论:
print Dumper %hts;
调用放在 while (my $seq ...)
循环之后(即,在遍历完所有 seq 对象之后)。在浏览文件时转储哈希值在这里并不能提供太多信息。$hts{$seq->seq}++
,然后查看排序后的值以查看是否有重复项。这样会更快。