[Bioperl-l] bioperl pulls, xml parsers push, and things get complicated
Robert Buels
rmb32 at cornell.edu
Wed Jul 19 15:30:28 EDT 2006
POE is a really neat thing, I didn't know about it before. Something
tells me, however, that I would have trouble convincing people to
install POE as a dependency for a genomethreader output parser. ;-) I
hope I'll have the opportunity to use it sometime.
For the curious, here's a nice intro to POE:
http://perl.com/pub/a/2001/01/poe.html
And the POE main site:
http://poe.perl.org/
Rob
aaron.j.mackey at GSK.COM wrote:
> There are 3rd generation XML "Pull" parsers (also called "StAX" for
> Streaming API for XML), but they seem to still be stuck in Java land (e.g.
> "MXP1")
>
> You could probably use POE to setup a state machine that used XML::Twig to
> "push" units of XML content onto a stack, to be read by your "next_*" pull
> method (where the XML::Twig push "stalled" until the "next_*" method was
> called, and vice versa).
>
> -Aaron
>
> bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 08:06:02 PM:
>
>
>> Hi all,
>>
>> Here's a kind of abstract question about Bioperl and XML parsing:
>>
>> I'm thinking about writing a bioperl parser for genomethreader XML, and
>> I'm sort of mulling over the 'impedence mismatch' between the way
>> bioperl Bio::*IO::* modules work and the way all of the current XML
>> parsers work. Bioperl uses a 'pull' model, where every time you want a
>> new chunk of stuff, you call $io_object->next_thing. All the XML
>> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a
>> 'push' model, where every time they parse a chunk, they call _your_
>> code, usually via a subroutine reference you've given to the XML parser
>> when you start it up.
>>
>> From what I can tell, current Bioperl IO modules that parse XML are
>> using push parsers to parse the whole document, holding stuff in memory,
>>
>
>
>> then spoon-feeding it in chunks to the calling program when it calls
>> next_*(). This is fine until the input XML gets really big, in which
>> case you can quickly run out of memory.
>>
>> Does anybody have good ideas for nice, robust ways of writing a bioperl
>> IO module for really big input XML files? There don't seem to be any
>> perl pull parsers for XML. All I've dug up so far would be having the
>> XML push parser running in a different thread or process, pushing chunks
>>
>
>
>> of data into a pipe or similar structure that blocks the progress of the
>>
>
>
>> push parser until the pulling bioperl code wants the next piece of data,
>>
>
>
>> but there are plenty of ugly issues with that, whether one were too use
>> perl threads for it (aaagh!) or fork and push some kind of intermediate
>> format through a pipe or socket between the two processes (eek!).
>>
>> So, um, if you've read this far, do you have any ideas?
>>
>> Rob
>>
>> --
>> Robert Buels
>> SGN Bioinformatics Analyst
>> 252A Emerson Hall, Cornell University
>> Ithaca, NY 14853
>> Tel: 503-889-8539
>> rmb32 at cornell.edu
>> http://www.sgn.cornell.edu
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
>
>
--
Robert Buels
SGN Bioinformatics Analyst
252A Emerson Hall, Cornell University
Ithaca, NY 14853
Tel: 503-889-8539
rmb32 at cornell.edu
http://www.sgn.cornell.edu
More information about the Bioperl-l
mailing list