[Bioperl-l] Asking for advice on full EMBL extraction

Chris Fields cjfields at illinois.edu
Fri May 8 17:45:09 UTC 2009


On May 7, 2009, at 6:25 PM, Jason Stajich wrote:

> It parses from a stream or file, one sequence at a time so it only  
> reads a single sequence out at a time, but it does have to parse  
> that whole sequence record which is where feature rich sequences  
> might be causing problems.
>
> I think per your other mention of Tie::File - the whole file is not  
> going into memory so that is not the problem, it is the creation of  
> many objects that it does as it parses the sequence that is likely  
> the problem.  It will read up to the first "//" from that Tie::File  
> anyways, that becomes an entire string which is then parsed to pull  
> out the relevant features so you don't gain anything with Tie::File  
> -- what would be the way to solve it is if the objects could be  
> created and reside in a DB on disk rather than in-memory.  I'd  
> really enjoy seeing more indexed and hashed data to objects stored  
> on disk when mem requirements are such so that very large datasets  
> can be handled more nimbly.

Or maybe implement some lazy iterator-based methods.  We have brought  
up the subject of the SwissKnife modules here before...

> I think there have been several attempts to simplify, but it  
> basically means a dedicated developer to really overhaul or map to a  
> new system.  What we've tried to build is a decent API so a new  
> implementation can be done without affecting the 'next_seq' and  
> 'write_seq' API.
>
> Non-withstanding the seemed API confusion caused by _ancient_  
> decisions on giving function names of Bio::SeqFeatureI 'seq' and  
> Bio::PrimarySeq 'seq' which return different types -- don't forget  
> that Lincoln's Bio::DB::Fasta uses the 'seq' method to return a  
> sequence as a string as well so major API changes in general here  
> will create in all likelihood a big split between the branches that  
> will make any new Bioperl not match up well with existing scripts or  
> libraries that use it - hence the reason for no "great realigning"  
> to a completely well-planned out API rather than the organically  
> grown whims of several generations of devs.  I say this in jest a  
> bit - I do want to see changes, but I think it really will have to  
> be called something else besides BioPerl to avoid confusion and the  
> fact that a lot of things will break that depend on the current  
> APIs.  BioPerl2 or something indicating a Perl6 association.
>
> -jason

Just thought of this: doesn't the feature iterator in  
Bio::DB::SeqFeature::Store use next_seq for features?  Yikes...

Anyway, I think if we set a decent enough deprecation schedule, users  
would adjust, but that's generally for small changes.

Dramatic large-scale changes (such as Moose integration and conversion  
of interfaces to roles) should be done in a separate project.   
Similarly, as mentioned before, perl6 is a different (yet related)  
beast to perl5, and so a bioperl-related project using perl6 shouldn't  
be called BioPerl 2.0.

The nice aspect of this: we can take what we like from BioPerl now and  
refactor it for either project, along the way making sure only the  
most critical modules get in.

chris




More information about the Bioperl-l mailing list