[BioPython] help with retrieving seq

Thomas Sicheritz-Ponten thomas@cbs.dtu.dk
02 Mar 2001 08:03:25 +0100


Dinakar Desai <Desai.Dinakar@mayo.edu> writes:

> Hello:
> 
> It works well for the sequences that are closer to start of file but if
> the sequence is towards the end, it takes almost forever ( i mean it is
> slow). Is there any indexing technique. I was thinking, I should create
> some sort of index because I will be doing this quite often and that way
> search can be really fast. Or is there any efficient method of searching
> EST database. Does any one has any suggestion regarding indexing.

Look at my example in biopython/Doc/examples/getgene.py
This scripts index'es a SWISS + TREMBL flatfile for fast retrieval of
entries. (scans all entries and records the start and stop position in a
gdbm db like "yank" from TIGR)


Create the index
setenv PYPHY /opt/bio/databases
getgene.py --index nr.dat

lookup:
getgene.py EFTU_ECOLI


The indexing takes some time (everything from 15 minutes to 3 hours) - but
the retrieval of individual entries is really fast (even for those entries
at the end of the file). It would be very easy to modify this script for
FASTA files.

good luck,
-thomas
-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas@biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...