[Bioperl-l] bioperl pulls, xml parsers push, and things get complicated

Wed Jul 19 13:48:55 UTC 2006

There are 3rd generation XML "Pull" parsers (also called "StAX" for 
Streaming API for XML), but they seem to still be stuck in Java land (e.g. 
"MXP1")

You could probably use POE to setup a state machine that used XML::Twig to 
"push" units of XML content onto a stack, to be read by your "next_*" pull 
method (where the XML::Twig push "stalled" until the "next_*" method was 
called, and vice versa).

-Aaron

bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 08:06:02 PM:

> Hi all,
> 
> Here's a kind of abstract question about Bioperl and XML parsing:
> 
> I'm thinking about writing a bioperl parser for genomethreader XML, and 
> I'm sort of mulling over the 'impedence mismatch' between the way 
> bioperl Bio::*IO::* modules work and the way all of the current XML 
> parsers work.  Bioperl uses a 'pull' model, where every time you want a 
> new chunk of stuff, you call $io_object->next_thing.  All the XML 
> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a 
> 'push' model, where every time they parse a chunk, they call _your_ 
> code, usually via a subroutine reference you've given to the XML parser 
> when you start it up.
> 
>  From what I can tell, current Bioperl IO modules that parse XML are 
> using push parsers to parse the whole document, holding stuff in memory, 

> then spoon-feeding it in chunks to the calling program when it calls 
> next_*().  This is fine until the input XML gets really big, in which 
> case you can quickly run out of memory.
> 
> Does anybody have good ideas for nice, robust ways of writing a bioperl 
> IO module for really big input XML files?  There don't seem to be any 
> perl pull parsers for XML.  All I've dug up so far would be having the 
> XML push parser running in a different thread or process, pushing chunks 

> of data into a pipe or similar structure that blocks the progress of the 

> push parser until the pulling bioperl code wants the next piece of data, 

> but there are plenty of ugly issues with that, whether one were too use 
> perl threads for it (aaagh!) or fork and push some kind of intermediate 
> format through a pipe or socket between the two processes (eek!).
> 
> So, um, if you've read this far, do you have any ideas?
> 
> Rob
> 
> -- 
> Robert Buels
> SGN Bioinformatics Analyst
> 252A Emerson Hall, Cornell University
> Ithaca, NY  14853
> Tel: 503-889-8539
> rmb32 at cornell.edu
> http://www.sgn.cornell.edu
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>