[Bioperl-l] Next-gen modules

Chris Fields cjfields at illinois.edu
Wed Jul 8 21:54:16 EDT 2009


On Jul 8, 2009, at 5:45 PM, Robert Buels wrote:

> Giles Weaver wrote:
>> takes about 15 minutes, so adapter removal is definitely the  
>> bottleneck. I'm
>> confident that some relatively simple developments in Bioperl and/ 
>> or EMBOSS
>> will yield some big performance improvements - if you see my sample  
>> code in
>
> Apropos this kind of thing, have you guys already discussed using  
> lazy object creation for objects returned from bioperl parsers?  Not  
> really relevant in the short term, but it could be a useful avenue  
> to pursue for addressing some performance concerns people (like ebi)  
> have.

There are some lazy parsers for SearchIO, but each of those has  
specific classes geared towards the SearchIO format, an issue I worry  
about.  I'm not sure about going down the path of having a  
Bio::Search::Result::FooResult, Bio::Search::Hit::FooHit, and  
Bio::Search::HSP::FooHSP for each 'Foo' format. The same thing could  
occur with SeqIO, TreeIO, etc.  A possible maintenance nightmare.

What I would like to see are generic lazy implementations for some of  
the various class, primarily Seq, AnnotationCollection, FeatureHolder/ 
Collection, etc, and parsers pass in just the necessary data (lazy  
implies file points or stream points).  This may not be terribly hard  
to do if using iterators, but (as you may have seen) many of the  
current methods are greedily defined, so new interface methods would  
need to be drawn up (and older ones refactored to work with newer ones).

> In very vague terms, one would probably implement this by defining a  
> very light-weight role/class  called something like  
> Bio::LazyInflator, that would provide only an `inflate` method.   
> Parsers would parse into lightweight structures (probably arrayrefs)  
> that implement LazyInflator and users could choose between grabbing  
> data out of the uninflated arrayref directly, or they could call  
> inflate() on it to transform it into a real object (like a  
> Bio::Annotation or Bio::Seq or something).

I would go one step further and reimplement the various  
AnnotationCollection/featureHolder methods in terms of a completely  
lazy implementation (i.e. parses the file or stream into a lazy Seq).   
See SwissKnife for instance.

> The exact implementation of this would vary depending on whether  
> Moose is being used.

This may be an area where optimization via Moose may not matter as  
much.  It would be best to attempt some of this initially in bioperl,  
then port to Moose/Bio::Moose.

> This could potentially also be compatible with having some of the  
> tight parsing loops be implemented in XS.
>
> Rob

That's where it'll get a little trickier; you would probably need a  
decent grammar to get everything out the way you want it, or at least  
parse everything event-based, and other grammars would have to have  
similarly named rules/tokens so the same action could be tied to the  
data being parsed.  I had a first go at generic parsing in the  
gbdriver/embldriver/swissdriver modules, which just pass data chunks  
to the handler object (which could do anything it wants with the  
data).  The only thing not passed in yet are file points.  That needs  
to be fleshed out more when I have the tuits, but you are more than  
welcome to look.

Also, just to note (and something to think about): Perl6 has this  
'solved' to a large degree with grammar/action combinations, where you  
define a grammar for a particular format and attach an Action class to  
process everything:

my $action = MyActionClass.new();

while Bio::Grammar::Fasta.parse($filehandle, :action($action)) {
    # do interesting things with data from $action
}

In this case the Action class could create a Seq out of all the data,  
or possibly create something much more lightweight and lazily  
evaluated (for instance, use the file points instead of the actual  
text).  The grammar in this case would essentially be C- or PIR-based  
I believe.

Note the quotes above with 'solved'; with Rakudo you can almost do  
this now, however some of the Perl 6 specification needs to be fleshed  
out re: Grammars, and the grammar engine for Parrot (PGE) needs to be  
properly set up for iteration through a stream.  There is  enough  
interest that I think things could be worked out fairly quickly (e.g.  
months, not years).

chris



More information about the Bioperl-l mailing list