[Bioperl-l] bioperl pulls, xml parsers push, and things get complicated

Wed Jul 19 10:34:48 EDT 2006

The Bio::SearchIO modules are supposed work like a SAX parser, where results
are returned as the report is parsed b/c of the occurrence of specific
'events' (start_element, end_element, and so on).  However, the actual
behaviour for each module changes depending on the report type and the
author's intention.  

There was a thread about a month ago on HMMPFAM report parsing where there
was some contention as to how to build hits(models)/HSPs(domains).  HMMPFAM
output has one HSP per hit and is sorted on the sequence length so a
particular hit can appear more than once, depending on how many times it
hits along the sequence length itself.  So, to gather all the HSPs together
under one hit you would have to parse the entire report and build up a
Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through
everything.  Currently it just reports Hit/HSP pairs and it is up to the
user to build that tree.

In contrast, BLAST output should be capable of throwing hit/HSP clusters on
the fly based on the report output, but is quite slow (event the XML output
crawls).  Jason thinks it's b/c of object inheritance and instantiation; I
think it's probably more complicated than that (there are a ton of method
calls which tend to slow things down quite a bit as well).  

I would say try using SearchIO, but instead of relying directly on object
handler calls to create Hit/HSP objects using an object factory (which is
where I think a majority of the speed is lost), build the data internally on
the fly using start_element/end_element, then return hashes instead based on
the element type triggered using end_element.  

As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX
(using XML::SAX::ExpatXS/expat) and plan on switching it over to using
hashes at some point, possibly starting off with a different SearchIO plugin
module.  If you have other suggestions (XML parser of choice, ways to speed
up parsing/retrieve data) we would be glad to hear them.

Chris

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Robert Buels
> Sent: Tuesday, July 18, 2006 7:06 PM
> To: bioperl-l at bioperl.org
> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get
> complicated
> 
> Hi all,
> 
> Here's a kind of abstract question about Bioperl and XML parsing:
> 
> I'm thinking about writing a bioperl parser for genomethreader XML, and
> I'm sort of mulling over the 'impedence mismatch' between the way
> bioperl Bio::*IO::* modules work and the way all of the current XML
> parsers work.  Bioperl uses a 'pull' model, where every time you want a
> new chunk of stuff, you call $io_object->next_thing.  All the XML
> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
> 'push' model, where every time they parse a chunk, they call _your_
> code, usually via a subroutine reference you've given to the XML parser
> when you start it up.
> 
>  From what I can tell, current Bioperl IO modules that parse XML are
> using push parsers to parse the whole document, holding stuff in memory,
> then spoon-feeding it in chunks to the calling program when it calls
> next_*().  This is fine until the input XML gets really big, in which
> case you can quickly run out of memory.
> 
> Does anybody have good ideas for nice, robust ways of writing a bioperl
> IO module for really big input XML files?  There don't seem to be any
> perl pull parsers for XML.  All I've dug up so far would be having the
> XML push parser running in a different thread or process, pushing chunks
> of data into a pipe or similar structure that blocks the progress of the
> push parser until the pulling bioperl code wants the next piece of data,
> but there are plenty of ugly issues with that, whether one were too use
> perl threads for it (aaagh!) or fork and push some kind of intermediate
> format through a pipe or socket between the two processes (eek!).
> 
> So, um, if you've read this far, do you have any ideas?
> 
> Rob
> 
> --
> Robert Buels
> SGN Bioinformatics Analyst
> 252A Emerson Hall, Cornell University
> Ithaca, NY  14853
> Tel: 503-889-8539
> rmb32 at cornell.edu
> http://www.sgn.cornell.edu
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l