[Bioperl-l] Next-gen modules
Chris Fields
cjfields at illinois.edu
Wed Jul 8 21:54:16 EDT 2009
On Jul 8, 2009, at 5:45 PM, Robert Buels wrote:
> Giles Weaver wrote:
>> takes about 15 minutes, so adapter removal is definitely the
>> bottleneck. I'm
>> confident that some relatively simple developments in Bioperl and/
>> or EMBOSS
>> will yield some big performance improvements - if you see my sample
>> code in
>
> Apropos this kind of thing, have you guys already discussed using
> lazy object creation for objects returned from bioperl parsers? Not
> really relevant in the short term, but it could be a useful avenue
> to pursue for addressing some performance concerns people (like ebi)
> have.
There are some lazy parsers for SearchIO, but each of those has
specific classes geared towards the SearchIO format, an issue I worry
about. I'm not sure about going down the path of having a
Bio::Search::Result::FooResult, Bio::Search::Hit::FooHit, and
Bio::Search::HSP::FooHSP for each 'Foo' format. The same thing could
occur with SeqIO, TreeIO, etc. A possible maintenance nightmare.
What I would like to see are generic lazy implementations for some of
the various class, primarily Seq, AnnotationCollection, FeatureHolder/
Collection, etc, and parsers pass in just the necessary data (lazy
implies file points or stream points). This may not be terribly hard
to do if using iterators, but (as you may have seen) many of the
current methods are greedily defined, so new interface methods would
need to be drawn up (and older ones refactored to work with newer ones).
> In very vague terms, one would probably implement this by defining a
> very light-weight role/class called something like
> Bio::LazyInflator, that would provide only an `inflate` method.
> Parsers would parse into lightweight structures (probably arrayrefs)
> that implement LazyInflator and users could choose between grabbing
> data out of the uninflated arrayref directly, or they could call
> inflate() on it to transform it into a real object (like a
> Bio::Annotation or Bio::Seq or something).
I would go one step further and reimplement the various
AnnotationCollection/featureHolder methods in terms of a completely
lazy implementation (i.e. parses the file or stream into a lazy Seq).
See SwissKnife for instance.
> The exact implementation of this would vary depending on whether
> Moose is being used.
This may be an area where optimization via Moose may not matter as
much. It would be best to attempt some of this initially in bioperl,
then port to Moose/Bio::Moose.
> This could potentially also be compatible with having some of the
> tight parsing loops be implemented in XS.
>
> Rob
That's where it'll get a little trickier; you would probably need a
decent grammar to get everything out the way you want it, or at least
parse everything event-based, and other grammars would have to have
similarly named rules/tokens so the same action could be tied to the
data being parsed. I had a first go at generic parsing in the
gbdriver/embldriver/swissdriver modules, which just pass data chunks
to the handler object (which could do anything it wants with the
data). The only thing not passed in yet are file points. That needs
to be fleshed out more when I have the tuits, but you are more than
welcome to look.
Also, just to note (and something to think about): Perl6 has this
'solved' to a large degree with grammar/action combinations, where you
define a grammar for a particular format and attach an Action class to
process everything:
my $action = MyActionClass.new();
while Bio::Grammar::Fasta.parse($filehandle, :action($action)) {
# do interesting things with data from $action
}
In this case the Action class could create a Seq out of all the data,
or possibly create something much more lightweight and lazily
evaluated (for instance, use the file points instead of the actual
text). The grammar in this case would essentially be C- or PIR-based
I believe.
Note the quotes above with 'solved'; with Rakudo you can almost do
this now, however some of the Perl 6 specification needs to be fleshed
out re: Grammars, and the grammar engine for Parrot (PGE) needs to be
properly set up for iteration through a stream. There is enough
interest that I think things could be worked out fairly quickly (e.g.
months, not years).
chris
More information about the Bioperl-l
mailing list