[Bioperl-l] bioperl pulls, xml parsers push, and things get complicated

Wed Jul 19 13:40:47 UTC 2006

In the past the way this was done for potentially big XML files is to  
use regex-based extraction of chunks that correspond to a object you  
want to return per call to next_XXX(). That chunk would then be  
passed on to the XML parser under the hood.

This only gets problematic once even the chunks are huge, or the name  
of the element that encloses your chunk can be ambiguous with what's  
in your text. The latter is unlikely though if you include the angle  
brackets.

I believe this is how at least some bioperl parsers for XML-based  
formats were written, and it seemed to work fine.

	-hilmar

On Jul 18, 2006, at 8:06 PM, Robert Buels wrote:

> Hi all,
>
> Here's a kind of abstract question about Bioperl and XML parsing:
>
> I'm thinking about writing a bioperl parser for genomethreader XML,  
> and
> I'm sort of mulling over the 'impedence mismatch' between the way
> bioperl Bio::*IO::* modules work and the way all of the current XML
> parsers work.  Bioperl uses a 'pull' model, where every time you  
> want a
> new chunk of stuff, you call $io_object->next_thing.  All the XML
> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
> 'push' model, where every time they parse a chunk, they call _your_
> code, usually via a subroutine reference you've given to the XML  
> parser
> when you start it up.
>
>  From what I can tell, current Bioperl IO modules that parse XML are
> using push parsers to parse the whole document, holding stuff in  
> memory,
> then spoon-feeding it in chunks to the calling program when it calls
> next_*().  This is fine until the input XML gets really big, in which
> case you can quickly run out of memory.
>
> Does anybody have good ideas for nice, robust ways of writing a  
> bioperl
> IO module for really big input XML files?  There don't seem to be any
> perl pull parsers for XML.  All I've dug up so far would be having the
> XML push parser running in a different thread or process, pushing  
> chunks
> of data into a pipe or similar structure that blocks the progress  
> of the
> push parser until the pulling bioperl code wants the next piece of  
> data,
> but there are plenty of ugly issues with that, whether one were too  
> use
> perl threads for it (aaagh!) or fork and push some kind of  
> intermediate
> format through a pipe or socket between the two processes (eek!).
>
> So, um, if you've read this far, do you have any ideas?
>
> Rob
>
> -- 
> Robert Buels
> SGN Bioinformatics Analyst
> 252A Emerson Hall, Cornell University
> Ithaca, NY  14853
> Tel: 503-889-8539
> rmb32 at cornell.edu
> http://www.sgn.cornell.edu
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================