[Bioperl-l] Bio::Index::Fasta vs Bio::DB::Fasta

Tony Cox avc@sanger.ac.uk
Mon, 14 Jan 2002 08:51:26 +0000 (GMT)


On Sat, 12 Jan 2002, Ewan Birney wrote:

Just a note that I have a _lot_ of code and time invested internally in the
Bio::Index::Fasta modules. It forms a fairly major plank of out internal
sequence fetching architecture here in Sanger (along with the more complex
functionality of SRS). Most of the time it is used for "normal" sequence
fetching (EMBL clones etc) and not for chr-sized DNA chunks where the DB::Fasta
really wins.It also compliments the Fastq modules that can be used to get
matching quality data if it exists. 

In short does Index::Fasta  _have _ to go?

Tony


+>On Fri, 11 Jan 2002, Lincoln Stein wrote:
+>
+>> Hi Folks,
+>> 
+>> I've just recently become aware that Bio::Index::Fasta has very heavy 
+>> overlapping functionality with Bio::DB::Fasta, and this is likely to lead to 
+>> some user confusion down the road.
+>> 
+>> I would remove Bio::DB::Fasta in favor of the Bio::Index version, except that 
+>> I don't think that Bio::Index::Fasta does the thing that first motivated 
+>> Bio::DB::Fasta, which was the ability to retrieve subsequences efficiently.  
+>> I have big (tens of megabyte) fasta files that contain 
+>> whole C. elegans chromosomes, and want to fetch a few base pairs from the 
+>> middle of them without reading the whole record into memory.  Can 
+>> Bio::Index::Fasta do this?
+>
+>
+>I am pretty sure it can't do this (which is why i believe you checked in
+>DB::Fasta in the first place). Does DB::Fasta make assumptions about line
+>length so it can SEEK to the right place?
+>
+>
+>Clearly merging the two pieces would be great. It is not something I am
+>overly worried about but it would be nice. 
+>
+>
+>Two routes:
+>
+>(I am assumming that we are still calling it Bio::Index::Fasta...)
+>
+>  (a)
+>
+>     Bio::Index::Fasta gives back a Bio::SeqI complianant object which is
+>actually a new thing called Bio::Seq::LargeFastaFixedLineLength (silly
+>name...). This object does not load the sequence into memory but executes
+>
+>     $seq->subseq(100000,1000020);
+>
+>     with a SEEK.
+>
+>
+>  (b) Bio::Index::Fasta will accept gets on slices
+>
+>
+>Reading the documentation of Bio::DB::Fasta I notice that you have put
+>nearly every access in (!) ---- I am always *so* impressed by your modules
+>Lincoln, they nearly always have every route into them first off.
+>
+>
+>
+>So --- you have carte blanche to rearrange this area. As long as you are
+>convinced that you wont be effecting exisiting FASTA indexes you can do
+>what you like with Bio::Index::Fasta before 1.0 ---- it should work
+>however with existing indexes - (ie, don't change the hash key
+>representations etc).
+>
+>
+>If you want to do a more serious reorganisation then it has got to be post
+>1.0.
+>
+>
+>
+>Your choice of options and code.
+>
+>
+>> 
+>> Lincoln
+>> 
+>> 
+>
+>_______________________________________________
+>Bioperl-l mailing list
+>Bioperl-l@bioperl.org
+>http://bioperl.org/mailman/listinfo/bioperl-l
+>

******************************************************
Tony Cox			Email:avc@sanger.ac.uk
Sanger Institute		WWW:www.sanger.ac.uk
Wellcome Trust Genome Campus	Webmaster
Hinxton				Tel: +44 1223 834244
Cambs. CB10 1SA			Fax: +44 1223 494919
******************************************************