[Bioperl-l] bioperl pulls, xml parsers push, and things get complicated
Robert Buels
rmb32 at cornell.edu
Tue Jul 18 20:06:02 EDT 2006
Hi all,
Here's a kind of abstract question about Bioperl and XML parsing:
I'm thinking about writing a bioperl parser for genomethreader XML, and
I'm sort of mulling over the 'impedence mismatch' between the way
bioperl Bio::*IO::* modules work and the way all of the current XML
parsers work. Bioperl uses a 'pull' model, where every time you want a
new chunk of stuff, you call $io_object->next_thing. All the XML
parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
'push' model, where every time they parse a chunk, they call _your_
code, usually via a subroutine reference you've given to the XML parser
when you start it up.
From what I can tell, current Bioperl IO modules that parse XML are
using push parsers to parse the whole document, holding stuff in memory,
then spoon-feeding it in chunks to the calling program when it calls
next_*(). This is fine until the input XML gets really big, in which
case you can quickly run out of memory.
Does anybody have good ideas for nice, robust ways of writing a bioperl
IO module for really big input XML files? There don't seem to be any
perl pull parsers for XML. All I've dug up so far would be having the
XML push parser running in a different thread or process, pushing chunks
of data into a pipe or similar structure that blocks the progress of the
push parser until the pulling bioperl code wants the next piece of data,
but there are plenty of ugly issues with that, whether one were too use
perl threads for it (aaagh!) or fork and push some kind of intermediate
format through a pipe or socket between the two processes (eek!).
So, um, if you've read this far, do you have any ideas?
Rob
--
Robert Buels
SGN Bioinformatics Analyst
252A Emerson Hall, Cornell University
Ithaca, NY 14853
Tel: 503-889-8539
rmb32 at cornell.edu
http://www.sgn.cornell.edu
More information about the Bioperl-l
mailing list