[Bioperl-l] Asking for advice on full EMBL extraction

Chris Fields cjfields at illinois.edu
Fri May 8 13:45:09 EDT 2009

On May 7, 2009, at 6:25 PM, Jason Stajich wrote:

> It parses from a stream or file, one sequence at a time so it only  
> reads a single sequence out at a time, but it does have to parse  
> that whole sequence record which is where feature rich sequences  
> might be causing problems.
> I think per your other mention of Tie::File - the whole file is not  
> going into memory so that is not the problem, it is the creation of  
> many objects that it does as it parses the sequence that is likely  
> the problem.  It will read up to the first "//" from that Tie::File  
> anyways, that becomes an entire string which is then parsed to pull  
> out the relevant features so you don't gain anything with Tie::File  
> -- what would be the way to solve it is if the objects could be  
> created and reside in a DB on disk rather than in-memory.  I'd  
> really enjoy seeing more indexed and hashed data to objects stored  
> on disk when mem requirements are such so that very large datasets  
> can be handled more nimbly.

Or maybe implement some lazy iterator-based methods.  We have brought  
up the subject of the SwissKnife modules here before...

> I think there have been several attempts to simplify, but it  
> basically means a dedicated developer to really overhaul or map to a  
> new system.  What we've tried to build is a decent API so a new  
> implementation can be done without affecting the 'next_seq' and  
> 'write_seq' API.
> Non-withstanding the seemed API confusion caused by _ancient_  
> decisions on giving function names of Bio::SeqFeatureI 'seq' and  
> Bio::PrimarySeq 'seq' which return different types -- don't forget  
> that Lincoln's Bio::DB::Fasta uses the 'seq' method to return a  
> sequence as a string as well so major API changes in general here  
> will create in all likelihood a big split between the branches that  
> will make any new Bioperl not match up well with existing scripts or  
> libraries that use it - hence the reason for no "great realigning"  
> to a completely well-planned out API rather than the organically  
> grown whims of several generations of devs.  I say this in jest a  
> bit - I do want to see changes, but I think it really will have to  
> be called something else besides BioPerl to avoid confusion and the  
> fact that a lot of things will break that depend on the current  
> APIs.  BioPerl2 or something indicating a Perl6 association.
> -jason

Just thought of this: doesn't the feature iterator in  
Bio::DB::SeqFeature::Store use next_seq for features?  Yikes...

Anyway, I think if we set a decent enough deprecation schedule, users  
would adjust, but that's generally for small changes.

Dramatic large-scale changes (such as Moose integration and conversion  
of interfaces to roles) should be done in a separate project.   
Similarly, as mentioned before, perl6 is a different (yet related)  
beast to perl5, and so a bioperl-related project using perl6 shouldn't  
be called BioPerl 2.0.

The nice aspect of this: we can take what we like from BioPerl now and  
refactor it for either project, along the way making sure only the  
most critical modules get in.


