[Bioperl-l] Problems with Bio::SearchIO

Dan Bolser dan.bolser at gmail.com
Mon Nov 10 22:29:08 UTC 2008


2008/11/7 Chris Fields <cjfields at illinois.edu>:
> On Nov 7, 2008, at 8:27 AM, Dan Bolser wrote:
>
>> Hi,
>>
>> I'm new to BioPerl, so apologies if I'm doing something wrong. Please
>> just let me know what I'm doing wrong or what you need.
>>
>>
>> I downloaded the latest BioPerl via SVN (revision 14980), and did the,
>>
>> perl Build.PL
>> ./Build test
>> ./Build install
>>
>> (note that I have lib::local installed, pointing things at ~/perl5)
>>
>> I ignored the failed tests, as-per the "everyone who is l88t does
>> this" instructions ;-)
>
> SearchIO (specifically, HSP methods used to calculate seq start/end
> correctly) are being revised in svn, so they are failing at the moment.  My
> local (not-yet-committed) revisions are failing much more, but that's b/c
> the tests are wrong; these should be added in relatively soon once I work
> out more kinks in the code.
>
>> I am trying to parse the output of a "blastall -p blastn" job
>> (blastall 2.2.18) using either the default format or tabular format. I
>> copied the "search_overview.PLS" script found here:
>>
>> ...
>>
>> Looking closer I found that $parser->result_count() only gets set
>> after calling $parser->next_result. Any way to force this? In some
>> Perl objects I've seen a 'parse' method that kicks the object into
>> (silently) calling all its get methods. Is there an equivalent (but
>> apparently undocumented) method? Actually, I think it should kick
>> itself when called... or not? Certainly the docs do not suggest that
>> is won't return a the number of results ("Function: Gets the number of
>> Blast results that have been parsed.") So I think this is a bug.
>
> We could make it so that the result_count() is eager (parses the results and
> reports the total back).  Not sure, but we could optionally cache the
> already-parsed Result objects (that could run into memory issues if one is
> parsing a ton of reports, so it needs to be off by default).

I see (I think). Anyone first calling result_count() and *then*
iterating over the results is getting a performance hit by effectively
parsing the results twice? I would suggest that you make this function
eager, but document the potential performance issue so that people can
choose not to call it first. However, I don't think I can have
understood correctly. How can its value be set correctly after calling
next() only once?


>> Actually I just checked against version 1.4 and "result_count()" fails
>> with the default, the XML or the tabular format results (all three
>> representing the same query against the same database with the same
>> blast version). Note that trying with the XML format in the latest
>> version of BioPerl results in the outright failure:
>>
>> ------------- EXCEPTION: Bio::Root::NotImplemented -------------
>> MSG: Abstract method "Bio::SearchIO::result_count" is not implemented
>> by package Bio::SearchIO::blastxml.
>> This is not your fault - author of Bio::SearchIO::blastxml should be
>> blamed!
>>
>> ...
>
> That's a bug.  Could you file this so we can track it?
>
> http://bugzilla.open-bio.org/

http://bugzilla.open-bio.org/show_bug.cgi?id=2650


>> While parsing and comparing the different formats (and versions) I
>> noticed a couple of other problems that I could not find reported
>> anywhere, so I'll report them here.
>>
>> The appropriate parts of the code are:
>>
>> while( my $r = $parser->next_result ) {
>>  print $r->query_length, "\n";
>>  while(my $h = $r->next_hit ) {
>>   print $h->length, "\n";
>>   my $p  = $h->hsp('best');
>>   print $h->significance, "\t", $h->score, "\t", $h->bits, "\n";
>>  }
>> }
>>
>> It seems that both the "$r->query_length" and the "$h->length" are
>> missing when using the tabular format blast results (using either
>> BioPerl version). I get confused here because there is also some
>> undocumented behavior used in the above script that means a 'hit'
>> (apparently) returns the start and end point of the two most extreme
>> HSPs (and the length of that range).
>
> The HSPs are tiled prior to returning values for these methods.  The tiling
> algorithm works for the most part but still has a few odd issues (one is
> reported in bugzilla).  I am not familiar with that bit of code so can't
> comment, but Sendu or Steve might answer.

I'll also log a bug.


> I try to stay away from using the various hit methods unless necessary and
> usually go with simple HSP coordinates.

OK.


>> Secondly, I found in all but the default format (using either BioPerl
>> version), the "$p->score" is missing (set to an empty string). It
>> shows up fine using the default format. The significance and the
>> bit-score show up fine... or at least they show up... The values look
>> wrong now I come to check. e.g. the score is equal to the bit-score
>> when using the default format with version 1.4, and is often smaller
>> than the bit-score using the latest version (or is $h->bits not the
>> bit score?)
>
> There is some discussion about this in the mail list archives:
>
> http://thread.gmane.org/gmane.comp.lang.perl.bio.general/16273/focus=16296
>
> NCBI BLAST has the hit summary table reporting the raw score(), whereas in
> WU_BLAST the hit table reports bit_score(); in the HSPs I think everything
> is defined.  If you have a minimal hit constructed (data only reported in
> the hit table, no HSP data is reported) some may be undefined.  The hit
> should be calculating the best overall score values from the contained HSPs
> and falling back to directly-set hit table data (set as above).  It is
> possible this may not be happening when parsing other formats so it may be a
> bug (and would be worth filing so we can look into it).

OK. Also I'll have to double check the actual query report! It could
be the problem is there.


>> The closest hits in the mailing list that I could find to these probemes
>> were:
>>
>> http://lists.open-bio.org/pipermail/bioperl-l/2002-May/007936.html
>> http://lists.open-bio.org/pipermail/bioperl-l/2002-September/009586.html
>>
>> but I don't think that they are relevant.
>>
>> Since it comes up here, how is the 'best' HSP defined? it isn't
>> documented as far as I can tell.
>
> 'best' - when comparing HSP data to the summary hit table (in text output
> only), the highest scoring HSP represent the hit (highest score/raw_score,
> lowest evalue).

Which?


>> About the documentation... looking here:
>>
>> http://search.cpan.org/~birney/bioperl-1.4/Bio/SearchIO.pm
>>
>>
>> Several of the structured methods 'blocks' are followed by a "See
>> Bio::..." link to other pages in CPAN. However the 'next_result'
>> method is followed by a link to
>> http://search.cpan.org/~birney/bioperl-1.4/Bio/Root/RootI.pm - I think
>> it should be a link to
>> http://search.cpan.org/~birney/bioperl-1.4/Bio/Search/Result/ResultI.pm
>>
>> Also, it would be nice (especially for noobs) if the full list of
>> accepted format codes were given on that page. The current text "#
>> format can be 'fasta', 'blast', 'exonerate', ..." is extremely
>> frustrating for a beginner "... what?!". I now realize that each
>> format code is matched by a "Bio::SearchIO::formatcode" module, but I
>> didn't know that from reading the above.
>>
>> While I'm at it, on page
>> http://search.cpan.org/~birney/bioperl-1.4/Bio/Search/Hit/HitI.pm -
>> the phrase "Equivalent to raw_score()" appearing under the heading
>> "score" is a broken link. In fact every "See also : $this->method()"
>> type link on that page is broken (there are about 25). Also the link
>> to "See also : BUGS" is broken.
>
> The pdoc documentation is better and more up-to-date (unfortunately the
> bioperl-1.4 CPAN docs are out-of-date but always come up first, I think b/c
> of the stable release status).
>
>>> User feedback is an integral part of the evolution of this and other
>>> Bioperl modules. Send your comments and suggestions preferably to one of the
>>> Bioperl mailing lists. Your participation is much appreciated.
>>
>> Thank you for your participation! I hope the above can help in some
>> way, and I hope its not down to me making trivial mistakes! If these
>> look like genuine bugs, should I report them on RT?
>
> No, use the bugzilla set up.  We do not use CPAN's RT and generally redirect
> any bugs to bugzilla.
>
>> Out of interest, I did get some fail while testing, specifically (or
>> perhaps coincidentally) some related to SearchIO...
>>
>> ./Build test verbose=1 > test.results.dump &> test.results.dump
>>
>> ...
>> Dan.
>
> Those are due to the changes I have been making (using svn code is bleeding
> edge!).
>
>> P.S. I've also been attacking the wiki, so please undo any mess that I
>> may have made there.


Thanks very much for the detailed reply. Overall, would you recommend
that I use SVN or 1.4 or 1.5.2?

All the best,

Dan.


>> --
>> http://network.nature.com/profile/dan
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Marie-Claude Hofmann
> College of Veterinary Medicine
> University of Illinois Urbana-Champaign
>
>
>
>
>



-- 
http://network.nature.com/profile/dan



More information about the Bioperl-l mailing list