[Bioperl-l] Problems parsing blast reports

Chris Fields cjfields at illinois.edu
Mon May 16 20:58:10 UTC 2011


Agreed, though I would rather screen based on the actual statistic I wanted than assume the first one will always give the best of anything (I'm a bit of a devil's advocate in that regard).  Following is some completely untested and probably incorrect code (batteries not included, offer void in all 50 states), but you get the idea:

=================================

use List::Utils qw(reduce);
...

while( my $result = $in->next_result) {
   my $best_hit = reduce {$a->significance < $b->significance ? $a : $b } $result->hits;
   if (defined($best_hit)) {
       my $best_hsp = reduce {$a->evalue < $b->evalue ? $a : $b } $best_hit->hsps;
       # do earth-shattering stuff here
   }
}

=================================

chris

On May 16, 2011, at 3:14 PM, Dave Messina wrote:

> Hi Lorenzo,
> 
> 
> Hsps for a given hit are also ordered by score so is it really necessary to
>> loop?.
>> 
> 
> If you know you'll always be interested in the highest-scoring hsp for a
> hit, and if you're not using anything in the hsp object. then yes, it should
> always come first and I don't think you'll need the hsp loop.
> 
> Glad to hear you've got it working!
> 
> Best,
> Dave
> 
> 
> 
> 
> 
> 
>> Best,
>> Lorenzo
>> PS: Sorry, I forgot to reply all in my last message
>> 
>> while( my $result = $in->next_result)
>>>      {
>>>      $total++;
>>>      my $query = $result->query_name();
>>>          while (my $hit = $result->next_hit)
>>>          {
>>>              my $hitname = $hit->name();
>>>            my $bits = $hit->bits();
>>>            my $evalue = $hit->significance();
>>>                if ( $result->num_hits == 1 and $query eq $hitname )
>>>                {
>>>                $cs++;
>>>                $singletons{ $cs } = "$query;$hitname";
>>>                last;
>>> 
>>>                }
>>>                if ( $query ne $hitname and $bits >= $minimumbitsscore and
>>> $evalue <= $thresholdevalue )
>>>                {
>>>                $cb++;
>>>                $besthits{ $cb } = "$query;$hitname";
>>>                last;
>>>                }
>>>                elsif    ( $query ne $hitname and $bits <=
>>> $minimumbitsscore and $evalue >= $thresholdevalue)
>>>                {
>>>                $cd++;
>>>                $diverged { $cd } = "$query;$hitname";
>>>                last;
>>>                }
>>> #            while( my $hsp = $hit->next_hsp )    {
>>> #            #last;
>>> #            }
>>>        }
>>>    }
>>> 
>>> On Mon, May 16, 2011 at 2:40 PM, Dave Messina <David.Messina at sbc.su.se>wrote:
>> 
>>> Hi Lorenzo,
>>> 
>>> Please remember to reply all so the mailing list sees the whole
>>> discussion.
>>> 
>>> I looked at your code and used your data, and the reason you're not always
>>> getting the best hit in your output is that subsequent hits which also pass
>>> your filter are writing over the best hit in your hash. So it's not a random
>>> hit that's getting saved, it's the last one that matched your criteria.
>>> 
>>> To fix this, you need to short-circuit out of the loop once you've found a
>>> hit matching your criteria.
>>> 
>>> Label the next_result loop like:
>>> 
>>>    RESULT: while ( my $result = $in->next_result ) {
>>> 
>>> and then have the last line inside your filtering if block be
>>> 
>>>    next RESULT;
>>> 
>>> Assuming "best hit" to you means best e-value, the above works since hits
>>> are sorted by best to worst e-value.
>>> 
>>> 
>>> I'd also like to get the queries only returning itself as hits
>>>> (singletons) but I must make run besthits properly before (any suggestion?).
>>> 
>>> 
>>> For this, I'd suggest that you actually save the best hit regardless of
>>> whether it's a self-match, and then if there's a hit passing your criteria
>>> that isn't a self-match, you test for that and replace the previously saved
>>> self-match in %besthits. (Again, short-circuiting out of the loop once
>>> you've got that hit.)
>>> 
>>> 
>>> For the moment I'm only testing bitscore and thresholdevalue so I don't
>>>> need to loop over hsp (do I?).
>>> 
>>> 
>>> No, remember each line in the -m8 output is an hsp, not a hit. So if you
>>> only loop over the hits, you'll only see the first hsp. In your example
>>> dataset, there weren't any with multiple hsps, so that's why you didn't run
>>> into this problem.
>>> 
>>> 
>>> Dave
>>> 
>>> 
>>> 
>>> 
>>> On Sat, May 14, 2011 at 22:17, Lorenzo Carretero <lcpaulet at googlemail.com
>>>> wrote:
>>> 
>>>> Hi David,
>>>> Thanks for your reply.
>>>> My scripts is within a subroutine within a .pm file called from a .pl
>>>> script. I checked the arguments are being passed correctly. For the moment
>>>> I'm only testing bitscore and thresholdevalue so I don't need to loop over
>>>> hsp (do I?). Of course, I'm using the strict and warning pragmas and use
>>>> Bio::SearchIO; I tried with a reduced version of my BLASTP output (attached
>>>> as a file). This is what I get in %besthits:
>>>> 
>>>>> gnl|Alyrata|AL2G21220,gnl|Alyrata|AL6G21100
>>>>> gnl|Alyrata|AL0G11620,gnl|Alyrata|AL1G37580
>>>>> gnl|Alyrata|AL6G05070,gnl|Alyrata|AL3G15690
>>>>> gnl|Alyrata|AL2G12090,gnl|Alyrata|AL4G18800
>>>>> gnl|Alyrata|AL1G15460,gnl|Alyrata|AL0G01870
>>>>> 
>>>> 
>>>> My script only returns the correct hits for AL6G05070 and AL0G11260,
>>>> while the 3d for AL2G21220, the 17th for AL2G12090 and the 13th for
>>>> AL1G15460. Note that the best non-self hit for AL2G12090 is not the second
>>>> one but the first as it shows better score than the self-hit. I'd also like
>>>> to get the queries only returning itself as hits (singletons) but I must
>>>> make run besthits properly before (any suggestion?).
>>>> Thanks again,
>>>> Lorenzo
>>>> 
>>>> On Sat, May 14, 2011 at 6:15 PM, Dave Messina <David.Messina at sbc.su.se>wrote:
>>>> 
>>>>> Hi Lorenzo,
>>>>> 
>>>>> Hmm, I tried your code, and it worked for me.
>>>>> 
>>>>> I set your input variables set as follows:
>>>>> my ( $filename, $format, $minimumbitsscore, $thresholdevalue,
>>>>> $minimumidentity,
>>>>>    $maximumredundancy, $minimumalnlength )
>>>>>  = ('2008.blasttable', 'blasttable', 300, '1e-100', 10, 0, 10);
>>>>> 
>>>>> where the file is t/data/2008.blasttable that comes with the BioPerl
>>>>> distro. With those settings, the top hit goes into %besthits and the third
>>>>> hit goes into %diverged.
>>>>> 
>>>>> I also added at the top:
>>>>> use strict;
>>>>> use warnings;
>>>>> use Bio::SearchIO;
>>>>> 
>>>>> You *are* using strict and warnings, aren't you? :)
>>>>> 
>>>>> One thing that may be an issue is that you're doing this at the hit
>>>>> level, and remember that in -m8 format each line represents an HSP.
>>>>> 
>>>>> I'd need your input — really, a neat little test case and the
>>>>> corresponding erroneous output — to help further.
>>>>> 
>>>>> Dave
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, May 14, 2011 at 17:29, Lorenzo Carretero <
>>>>> lcpaulet at googlemail.com> wrote:
>>>>> 
>>>>>> Hi all,
>>>>>> I'm trying to parse blasttable '-m 8 format' reports from whole
>>>>>> intra-genome
>>>>>> comparisons of all vs all to get the best non-self hit and diverged
>>>>>> best
>>>>>> non-self hit as separate hashes (key=query=>value=non_self_hit). The
>>>>>> following script seems to run ok but it returns unexpected results
>>>>>> (i.e. it
>>>>>> doesn't catch the best hit but a random hit apparently). I assume hits
>>>>>> are
>>>>>> iterated over the while loop (while (my $hit = $result->next_hit) as
>>>>>> returned in the blast results (i.e. sorted by hits bit scores).
>>>>>> Any help would be much appreciated.
>>>>>> Cheers,
>>>>>> Lorenzo
>>>>>> 
>>>>>> 
>>>>>> my ( $filename, $format, $minimumbitsscore, $thresholdevalue,
>>>>>>> $minimumidentity, $maximumredundancy, $minimumalnlength ) = @_;
>>>>>>> my ( %besthits, %diverged, %redundant ,%genefusion ) = ();
>>>>>>> my ( $refbh, $refdiv, $cb, $cd, $refred, $reffus, $cr, $cf);
>>>>>>> my $total = 0;
>>>>>>> my $in = new Bio::SearchIO (     -file   => $filename,
>>>>>>>                                 -format => $format,
>>>>>>>                                 #-verbose => -1,
>>>>>>>                                  )
>>>>>>>                                 or die "No $filename BLAST file with
>>>>>>> $format found";
>>>>>>>      while( my $result = $in->next_result)    {
>>>>>>>      $total++;
>>>>>>>          while (my $hit = $result->next_hit)    {
>>>>>>>          my $query = $result->query_name();
>>>>>>>          my $hitname = $hit->name();
>>>>>>>          my $bits = $hit->bits();
>>>>>>>          my $evalue = $hit->significance();
>>>>>>>            if ($query ne $hitname and $bits >= $minimumbitsscore and
>>>>>>> $evalue <= $thresholdevalue )    {
>>>>>>>                $besthits{ $query } = $hitname;
>>>>>>>            }
>>>>>>>            elsif    ( $bits <= $minimumbitsscore and $evalue >=
>>>>>>> $thresholdevalue){
>>>>>>>                $diverged { $query } = $hitname;
>>>>>>>            }
>>>>>>> 
>>>>>>> #                while( my $hsp = $hit->next_hsp )    {
>>>>>>> #                my $querylen = $hsp->length( 'query' );
>>>>>>> #                my $hitlen = $hsp->length( 'hit' );
>>>>>>> #                my $alnlen = $hsp->length( 'total' );
>>>>>>> #                my $identity = $hsp->percent_identity();
>>>>>>> #                    if ( $identity > $maximumredundancy )    {
>>>>>>> #                    $redundant{ $query } = $hitname;
>>>>>>> #                    }
>>>>>>> #                    elsif ( ($querylen <= ($alnlen / 2) ) ||
>>>>>> ($hitlen <=
>>>>>>> ($alnlen / 2) ) || ($alnlen <= $minimumalnlength) )    {
>>>>>>> #                    $genefusion{ $query } = $hitname;
>>>>>>> #                    }
>>>>>>> #                    elsif ( ($evalue >= $thresholdevalue) ||
>>>>>> ($identity <=
>>>>>>> $minimumidentity) || ($bits <= $minimumbitsscore) )    {
>>>>>>> #                    $diverged{ $query } = $hitname;
>>>>>>> #                    }
>>>>>>> #                    else    {
>>>>>>> #                    $besthits{ $query } = $hitname;
>>>>>>> #                    }
>>>>>>> #                }
>>>>>>>            }
>>>>>>>    }
>>>>>>> $refbh = \%besthits;
>>>>>>> $refdiv = \%diverged;
>>>>>>> $refred = \%redundant;
>>>>>>> $reffus = \%genefusion;
>>>>>>> $cb = keys ( %besthits );
>>>>>>> $cd = keys( %diverged );
>>>>>>> $cr = keys( %redundant );
>>>>>>> $cf = keys( %genefusion );
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list