[Bioperl-l] bioperl pulls, xml parsers push, and things get complicated

Wed Jul 19 12:20:21 EDT 2006

There are a lot of different XML processing strategies. Most fall into 
two categories: stream-based and tree-based.

With the stream-based strategy, the parser continuously alerts a program 
to patterns in the XML. The parser functions like a pipeline, taking XML 
markup on one end and pumping out processed nuggets of data to your program.

With the tree-based strategy, the parser keeps the data to itself until 
the very end, when it presents a complete model of the document to your 
program. The whole point to this strategy is that your program can pull 
out any data it needs, in any order.

Most of the times I use tree-based strategies because they place all of 
the data into a structure which lets me to access any internal node 
using array/hash references. The simplest parser for this is XML::Simple 
using XML::Parser as the 'preferred parser' (which is built on top of 
XML::Parser::Expat, which is a wrapper around the expat library).

More advanced parsers (both stream and tree-based) are:

* XML::LibXML (a wrapper for libxml2's C library)
* XML::Grove (takes a tree and changes it into an object hierarchy. Each 
node type is represented by a different class)
* XML::PYX (for repackaging XML as a stream of easily recognizable and 
transmutable symbols)
* XML::SimpleObject (changes a hierarchy of lists into a hierarchy of 
objects)
* XML::XPath (for writing expressions that pinpoint specific pieces of 
documents)

There are also some standards-based solutions like:

* XML::SAX (Simple API for XML) for event streams.
* XML::DOM (Document Object Model) for tree processing.

Your strategy of choice depends a lot on the type of XML files you want 
to parse. Understanding the structure of the files and deciding which is 
the data you want to extract from them is a fundamental step to choose 
the appropriate method/parser to use.

Just my 2 cents :)

Regards,
Mauricio.

Chris Fields wrote:
> The Bio::SearchIO modules are supposed work like a SAX parser, where results
> are returned as the report is parsed b/c of the occurrence of specific
> 'events' (start_element, end_element, and so on).  However, the actual
> behaviour for each module changes depending on the report type and the
> author's intention.  
> 
> There was a thread about a month ago on HMMPFAM report parsing where there
> was some contention as to how to build hits(models)/HSPs(domains).  HMMPFAM
> output has one HSP per hit and is sorted on the sequence length so a
> particular hit can appear more than once, depending on how many times it
> hits along the sequence length itself.  So, to gather all the HSPs together
> under one hit you would have to parse the entire report and build up a
> Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through
> everything.  Currently it just reports Hit/HSP pairs and it is up to the
> user to build that tree.
> 
> In contrast, BLAST output should be capable of throwing hit/HSP clusters on
> the fly based on the report output, but is quite slow (event the XML output
> crawls).  Jason thinks it's b/c of object inheritance and instantiation; I
> think it's probably more complicated than that (there are a ton of method
> calls which tend to slow things down quite a bit as well).  
> 
> I would say try using SearchIO, but instead of relying directly on object
> handler calls to create Hit/HSP objects using an object factory (which is
> where I think a majority of the speed is lost), build the data internally on
> the fly using start_element/end_element, then return hashes instead based on
> the element type triggered using end_element.  
> 
> As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX
> (using XML::SAX::ExpatXS/expat) and plan on switching it over to using
> hashes at some point, possibly starting off with a different SearchIO plugin
> module.  If you have other suggestions (XML parser of choice, ways to speed
> up parsing/retrieve data) we would be glad to hear them.
> 
> Chris
> 
> 
> 
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Robert Buels
>> Sent: Tuesday, July 18, 2006 7:06 PM
>> To: bioperl-l at bioperl.org
>> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get
>> complicated
>>
>> Hi all,
>>
>> Here's a kind of abstract question about Bioperl and XML parsing:
>>
>> I'm thinking about writing a bioperl parser for genomethreader XML, and
>> I'm sort of mulling over the 'impedence mismatch' between the way
>> bioperl Bio::*IO::* modules work and the way all of the current XML
>> parsers work.  Bioperl uses a 'pull' model, where every time you want a
>> new chunk of stuff, you call $io_object->next_thing.  All the XML
>> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
>> 'push' model, where every time they parse a chunk, they call _your_
>> code, usually via a subroutine reference you've given to the XML parser
>> when you start it up.
>>
>>  From what I can tell, current Bioperl IO modules that parse XML are
>> using push parsers to parse the whole document, holding stuff in memory,
>> then spoon-feeding it in chunks to the calling program when it calls
>> next_*().  This is fine until the input XML gets really big, in which
>> case you can quickly run out of memory.
>>
>> Does anybody have good ideas for nice, robust ways of writing a bioperl
>> IO module for really big input XML files?  There don't seem to be any
>> perl pull parsers for XML.  All I've dug up so far would be having the
>> XML push parser running in a different thread or process, pushing chunks
>> of data into a pipe or similar structure that blocks the progress of the
>> push parser until the pulling bioperl code wants the next piece of data,
>> but there are plenty of ugly issues with that, whether one were too use
>> perl threads for it (aaagh!) or fork and push some kind of intermediate
>> format through a pipe or socket between the two processes (eek!).
>>
>> So, um, if you've read this far, do you have any ideas?
>>
>> Rob
>>
>> --
>> Robert Buels
>> SGN Bioinformatics Analyst
>> 252A Emerson Hall, Cornell University
>> Ithaca, NY  14853
>> Tel: 503-889-8539
>> rmb32 at cornell.edu
>> http://www.sgn.cornell.edu
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 

-- 
MAURICIO HERRERA CUADRA
arareko at campus.iztacala.unam.mx
Laboratorio de Genética
Unidad de Morfofisiología y Función
Facultad de Estudios Superiores Iztacala, UNAM