[Bioperl-l] Bio::Index::Fasta vs Bio::DB::Fasta

Lincoln Stein lstein@cshl.org
Mon, 21 Jan 2002 10:01:08 -0500


Hi,

Just getting back from the Perl Whirl Geek Cruise, all relaxed
suntanned, and ready to answer 1000 e-mail messages!

Ewan Birney writes:
 > On Fri, 11 Jan 2002, Lincoln Stein wrote:
 > 
 > > Hi Folks,
 > > 
 > > I've just recently become aware that Bio::Index::Fasta has very heavy 
 > > overlapping functionality with Bio::DB::Fasta, and this is likely to lead to 
 > > some user confusion down the road.
 > > 
 > > I would remove Bio::DB::Fasta in favor of the Bio::Index version, except that 
 > > I don't think that Bio::Index::Fasta does the thing that first motivated 
 > > Bio::DB::Fasta, which was the ability to retrieve subsequences efficiently.  
 > > I have big (tens of megabyte) fasta files that contain 
 > > whole C. elegans chromosomes, and want to fetch a few base pairs from the 
 > > middle of them without reading the whole record into memory.  Can 
 > > Bio::Index::Fasta do this?
 > 
 > 
 > I am pretty sure it can't do this (which is why i believe you checked in
 > DB::Fasta in the first place). Does DB::Fasta make assumptions about line
 > length so it can SEEK to the right place?

As DB::Fasta is reading the FASTA files it stores information about
the line lengths it encounters.  So each FASTA file can have a
different line length, and indeed each entry within each FASTA file
can have a different line length (but line lengths must be uniform
within an entry).

 > Clearly merging the two pieces would be great. It is not something I am
 > overly worried about but it would be nice. 
 > 
 > 
 > Two routes:
 > 
 > (I am assumming that we are still calling it Bio::Index::Fasta...)
 > 
 >   (a)
 > 
 >      Bio::Index::Fasta gives back a Bio::SeqI complianant object which is
 > actually a new thing called Bio::Seq::LargeFastaFixedLineLength (silly
 > name...). This object does not load the sequence into memory but executes
 > 
 >      $seq->subseq(100000,1000020);
 > 
 >      with a SEEK.
 > 
 > 
 >   (b) Bio::Index::Fasta will accept gets on slices
 > 
 > 
 > Reading the documentation of Bio::DB::Fasta I notice that you have put
 > nearly every access in (!) ---- I am always *so* impressed by your modules
 > Lincoln, they nearly always have every route into them first off.
 > 
 > 
 > 
 > So --- you have carte blanche to rearrange this area. As long as you are
 > convinced that you wont be effecting exisiting FASTA indexes you can do
 > what you like with Bio::Index::Fasta before 1.0 ---- it should work
 > however with existing indexes - (ie, don't change the hash key
 > representations etc).

Scary.  For the time being I've removed Bio::DB::Fasta dependencies
from Bio::DB::GFF and LDAS.  I think I'll leave the big reorganization
until after 1.0. So much to do before the conference....

Lincoln

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY
Positions available at my lab: see http://stein.cshl.org/#hire
========================================================================