我正在计算序列的log-odds分数并返回给出最大分数的motif(序列的小部分)。我有代码计算我的文件中每个序列的最高分数,我在存储提供该分数的主题时遇到问题。请参阅我的其他文章的文件格式,对数赔率分数的一般计算,等Perl: Creating and manipulating hash of arrays for log-odds scores of DNA sequences。我的代码如下:
use strict;
use warnings;
use List::Util 'max';
use Data::Dumper;
#USER SPECIFICATIONS
#User specifies motif width
my $width = 3;
#User enters the filename that contains the sequence data
print "Please enter the filename of the fasta sequence data: ";
my $filename1 = <STDIN>;
#Remove newline from file
chomp $filename1;
#Open the file and store each dna seq in hash
my %id2seq = ();
my %HoA = ();
my %loscore = ();
my %maxscore = ();
my %maxmot = ();
my $id = '';
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
while (<FILE>)
{
if($_ =~ /^>(.+)/)
{
$id = $1; #Stores 'Sequence 1' as the first $id, for example
}
else
{
$HoA{$id} = [ split(//) ]; #Splits the contents to allow for position reference later
$id2seq{$id} .= $_; #Creates a hash with each seq associated to an id number
$maxmot{$id} = (); #Creates empty hash to push motifs to
foreach $id (keys %HoA)
{
for my $len (0..(length($HoA{$id})-$width-1))
{
push @{ $loscore{$id} }, 0;
}
}
push @{ $maxscore{$id} }, -30; #Creates a HoA with each id number to have a maxscore (initial score -30)
}
}
close FILE;
foreach $id (keys %id2seq)
{
print "$id2seq{$id}\n\n";
}
print "\n\n";
#EXPECTATION
#Create log-odds table of motifs
my %logodds;
$logodds{'A'}[0] = 0.1;
$logodds{'A'}[1] = 0.2;
$logodds{'A'}[2] = 0.3;
$logodds{'C'}[0] = 0.2;
$logodds{'C'}[1] = 0.5;
$logodds{'C'}[2] = 0.2;
$logodds{'G'}[0] = 0.3;
$logodds{'G'}[1] = 0.2;
$logodds{'G'}[2] = 0.4;
$logodds{'T'}[0] = 0.4;
$logodds{'T'}[1] = 0.1;
$logodds{'T'}[2] = 0.1;
#MAXIMIZATION
#Determine location for each sequence that maximally
#aligns to the motif pattern
foreach $id (keys %HoA)
{
for my $pos1 (0..length($HoA{$id})-$width-1) #Look through all positions the motif can start at
{
for my $pos2 ($pos1..$pos1+($width-1)) #Define the positions within the motif (0 to width-1)
{
for my $base (qw( A C G T))
{
if ($HoA{$id}[$pos2] eq $base) #If the character matches a base:
{
for my $pos3 (0..$width-1) #Used for position reference in %logodds
{
#Calculate the log-odds score at each location
$loscore{$id}[$pos2] += $logodds{$base}[$pos3];
#Calculate the maximum log-odds score for each sequence
#Find the motif that gives the maximum score for each sequence
$maxscore{$id} = max( @{ $loscore{$id} });
if ($loscore{$id}[$pos2] == $maxscore{$id})
{
push @{ $maxmot{$id} }, $HoA{$id}[$pos3]; #NOT SURE THAT THIS IS THE CORRECT WAY TO PUSH WHAT I WANT
}
}
}
}
}
}
}
print "\n\n";
print Dumper(\%maxmot);
%maxmot
的预期输出应该是这样的:
'Sequence 11' => [ 'C', 'A', 'T'],
'Sequence 14' => ['C', 'T', 'G'], etc.
由于$width = 3
,预期产量应该有三个基数。我得到的输出给了我每个基数的倍数,而不是任何明显的顺序(或至少,我不能注意到一个模式):
'Sequence 11' => [ 'C', 'C', 'C', 'A', 'A', 'A', 'A', 'T', 'A', 'A', 'T', 'T', 'T'],
'Sequence 14' => ['C', 'C', 'T', 'T', 'C', 'C', 'T', 'T', 'T', 'T', 'T'], etc.
我认为这个问题植根于push @{ $maxmot{$id} }, $HoA{$id}[$pos3];
步骤,但我不太确定如何修复它。我试过推动$HoA{$id}[$pos2]
,但这似乎也没有解决我的问题。一如既往,任何和所有的帮助表示赞赏!如果需要,我可以澄清一下,我知道我的问题有点令人费解。先感谢您。
push()
不是问题。从运行你的代码开始,很明显条件$loscore{$id}[$pos2] == $maxscore{$id}
比你期望的更频繁true
。
以下是我在代码审查中会提出的一些问题:
length()
?它不返回数组的长度。for my $base (qw( A C G T)) { if ($HoA{$id}[$pos2] eq $base) {...
只是一种低效的等效my $base = $HoA{$id}[$pos2];
方式?$pos2
的计算执行$pos2 + 1
次,即一次为0
,两次为1
,...即后来的位置获得更高的分数。$loscore{$id}[$pos2]
的一个计算是@{ $logodds{$base} }
的总和,即在计算时忽略位置$pos2 + $pos3
的基数$maxscore{$id}
,然后使用条件中的变化值$width
基地长,但你的代码只存储单个基地到%maxmot
我正在做一个有根据的猜测,并建议以下是正确的算法。我正在使用您在上一个问题中给出的3个序列。我还倾倒了其他2个哈希值,以便计算结果可见。
我冒昧地重写你的代码,使其更加简洁明了。但是您应该能够将新代码中的行追溯到原始代码。
#!/usr/bin/perl
use strict;
use warnings;
use List::Util 'max';
use Data::Dumper;
my $width = 3;
my %HoA;
my %maxpos;
my %loscore;
my $id = '';
while (<DATA>) {
if (/^>(.+)/) {
$id = $1;
} else {
$HoA{$id} = [ split(//) ];
$maxpos{$id} = @{ $HoA{$id} } - $width - 1;
$loscore{$id} = [ (0) x ($maxpos{$id} + 1) ];
}
}
my %logodds = (
A => [0.1, 0.2, 0.3],
C => [0.2, 0.5, 0.2],
G => [0.3, 0.2, 0.4],
T => [0.4, 0.1, 0.1],
);
#MAXIMIZATION
my %maxscore;
my %maxmot;
# Calculate the log-odds score at each location
foreach $id (keys %HoA) {
for my $index (0..$maxpos{$id}) {
for my $offset (0..$width-1) {
# look at base in sequence $id at $offset after $index
my $base = $HoA{$id}[$index + $offset];
$loscore{$id}[$index] += $logodds{$base}[$offset];
}
}
}
# Calculate the maximum log-odds score for each sequence
foreach $id (keys %HoA) {
$maxscore{$id} = max( @{ $loscore{$id} });
}
# Find the motif that gives the maximum score for each sequence
foreach $id (keys %HoA) {
for my $index (0..$maxpos{$id}) {
if ($loscore{$id}[$index] == $maxscore{$id}) {
# motif of length $width
my $motif = join('', @{ $HoA{$id} }[$index..$index + $width - 1]);
$maxmot{$id}->{$motif}++;
}
}
}
print Data::Dumper->Dump([\%loscore, \%maxscore, \%maxmot],
[qw(*loscore *maxscore *maxmot)]);
exit 0;
__DATA__
>Sequence_1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence_2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence_3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG
测试运行:
$ perl dummy.pl
%loscore = (
'Sequence_1' => [
'1.2',
'0.8',
'0.6',
'0.8',
'0.5',
'0.8',
'1',
'0.8',
'0.4',
'0.5',
'0.8',
'0.7',
'0.5',
'0.9',
'0.6',
'0.4',
'0.3',
'0.6',
'0.8',
'0.7',
'0.4',
'1.2',
'0.5',
'0.3',
'0.6',
'0.7',
'1.1',
'0.8',
'0.4',
'0.7',
'1',
'0.5',
'1.1',
'1',
'0.6',
'0.7',
'0.5',
'1.1',
'0.8'
],
'Sequence_2' => [
'0.9',
'1',
'0.6',
'1',
'0.6',
'1.1',
'0.8',
'0.5',
'1',
'1.1',
'0.6',
'1',
'0.9',
'0.8',
'0.5',
'1.1',
'0.8',
'0.5',
'1.1',
'0.9',
'0.9',
'1.1',
'0.8',
'0.6',
'0.6',
'1.2',
'0.6',
'0.7',
'0.7',
'0.9',
'0.7',
'0.7',
'0.7',
'1',
'0.6',
'0.6',
'1.1',
'0.8',
'0.7'
],
'Sequence_3' => [
'1.3',
'0.7',
'0.7',
'0.8',
'0.9',
'0.8',
'0.5',
'1',
'0.7',
'1',
'0.8',
'0.8',
'0.5',
'0.8',
'0.8',
'0.6',
'0.7',
'0.4',
'1.2',
'0.8',
'0.7',
'0.9',
'0.8',
'0.7',
'0.8',
'1',
'0.6',
'0.9',
'0.8',
'0.4',
'0.6',
'1.2',
'0.8',
'0.5',
'1',
'1',
'0.8',
'0.7',
'0.7',
'1.1',
'0.7',
'0.7'
]
);
%maxscore = (
'Sequence_1' => '1.2',
'Sequence_3' => '1.3',
'Sequence_2' => '1.2'
);
%maxmot = (
'Sequence_3' => {
'TCG' => 1
},
'Sequence_2' => {
'TCA' => 1
},
'Sequence_1' => {
'TCA' => 2
}
);
这看起来更接近您的预期输出。但当然,我的猜测可能会完全消失......
如果我正确理解了对数分数计算,那么每个主题的得分是一个常数,因此可以预先计算。这将导致以下更直接的方法:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $width = 3;
my %logodds = (
A => [0.1, 0.2, 0.3],
C => [0.2, 0.5, 0.2],
G => [0.3, 0.2, 0.4],
T => [0.4, 0.1, 0.1],
);
# calculate log score for each motif combination
my $motif_score = {'' => 0}; # start with a 0-length motif
foreach my $offset (0..$width - 1) {
my %scores;
# for all motifs...
foreach my $prefix (keys %{ $motif_score }) {
my $base_score = $motif_score->{$prefix};
# ... add another base to the motif
for my $base (qw(A G C T)) {
$scores{"${prefix}${base}"} = $base_score + $logodds{$base}[$offset];
}
}
# store the scores for the new sequences
$motif_score = \%scores;
}
#print Data::Dumper->Dump([$motif_score], [qw(motif_score)]);
my $id;
my %maxmot;
while (<DATA>) {
if (/^>(.+)/) {
$id = $1;
} else {
chomp(my $sequence = $_);
my $max = -1;
# run a window of length $width over the sequence
foreach my $index (0..length($sequence) - $width - 1) {
# extract the motif at $index from sequence
my $motif = substr($sequence, $index, $width);
my $score = $motif_score->{$motif};
# update max score if the motif has a higher score
if ($score > $max) {
$max = $score;
$maxmot{$id} = $motif;
}
}
}
}
print Data::Dumper->Dump([\%maxmot], [qw(*maxmot)]);
exit 0;
__DATA__
>Sequence_1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence_2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence_3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG
测试运行:
$ perl dummy.pl
%maxmot = (
'Sequence_2' => 'TCA',
'Sequence_3' => 'TCG',
'Sequence_1' => 'TCA'
);
由于每个基序的logscore是常数,按logscore顺序排序的motif列表也将是常量。鉴于该列表,我们只需要找到与给定序列匹配的第一个主题。因此,我们可以在问题上应用高度优化的正则表达式引擎。根据您的实际问题大小,这可能是更有效的解决方案:
#!/usr/bin/perl
use warnings;
use strict;
use List::Util qw(first pairs);
use Data::Dumper;
my $width = 3;
my %logodds = (
A => [0.1, 0.2, 0.3],
C => [0.2, 0.5, 0.2],
G => [0.3, 0.2, 0.4],
T => [0.4, 0.1, 0.1],
);
my @bases = keys %logodds;
# calculate log score for each motif combination
my $motif_logscore = {'' => 0}; # start with a 0-length motif
foreach my $offset (0..$width - 1) {
my %score;
# for all motifs...
foreach my $prefix (keys %{ $motif_logscore }) {
my $base_score = $motif_logscore->{$prefix};
# ... add another base to the motif
for my $base (@bases) {
$score{"${prefix}${base}"} = $base_score + $logodds{$base}[$offset];
}
}
# update hash ref to new motif scores
$motif_logscore = \%score;
}
#print Data::Dumper->Dump([$motif_logscore], [qw(motif_logscore)]);
my @motifs_sorted =
# list of [<motif>, <regular expression>] array refs
map { [$_->[0], qr/$_->[0]/] }
# sort in descending numerical score order
sort { $b->[1] cmp $a->[1] }
# list of [<motif>, <score>] array refs
pairs %{ $motif_logscore };
#print Data::Dumper->Dump([\@motifs_sorted], [qw(*motifs_sorted)]);
my $id;
my %maxmot;
while (<DATA>) {
if (/^>(.+)/) {
$id = $1;
} else {
my $sequence = $_;
# find the first pair where the regex matches -> store motif
$maxmot{$id} = (
first { ($sequence =~ $_->[1])[0] } @motifs_sorted
)->[0];
}
}
print Data::Dumper->Dump([\%maxmot], [qw(*maxmot)]);
exit 0;
__DATA__
>Sequence_1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence_2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence_3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG
也许你不需要一个数组但是哈希?
将推送更改为
undef $maxmot{$id}{ $HoA{$id}[$pos3] };
它创建一个哈希值(具有未定义的值,因此只有键很重要)。在输出中,我看不到链接问题输入中每个序列的3个以上的键。