[BioPython] help with retrieving seq
Jeffrey Chang
jchang@SMI.Stanford.EDU
Thu, 1 Mar 2001 18:30:58 -0800 (PST)
Hi Dinakar,
It sounds like you want to create a dictionary to help retrieve records
from a fasta-formatted sequence file in real time. There's code for that
in the Bio/Fasta package in biopython.
Look at the Fasta.index_file function along with the Dictionary
class. The code basically iterates through the Fasta file and saves the
index of each sequence, so that you can retrieve them randomly. Please
let me know if you would like me to send you an example!
Jeff
On Thu, 1 Mar 2001, Dinakar wrote:
> Brad and others:
>
> Thank you very much. Today I tried to create a dictionary with est_id and
> file marker i.e the position of the file in the database (may not be best
> solution) and it took about 1 hour on our Beowolf(Sp????) cluster with 4 gig
> memory (most of the time was spent on reading the file,I guess). I used
> Brad's example of fasta parser to create dictionary. Dictionary is about 60
> MB(est_id (key) and file marker as value). It takes about 4 min to load the
> file and look for key and seek the location of file in database and retrieve
> sequence (i tried sequence at the end of file). I used cPickle to load the
> file. There must be better algorithm to search for such a big file. Friend of
> mine suggested to use database to store key and file location. Someone else
> suggested to use GDBM (from gnu) to look for better solution. Does anyone
> else have better solution than what I am doing now( I am sure there are
> better solutions).
>
> I hope to hear from you soon.
>
> Thank you.
>
> Dinakar
>
>
> Brad Chapman wrote:
>
> > Hi Dinakar;
> >
> > [Finding records in FASTA files]
> > > It works well for the sequences that are closer to start of file but if
> > > the sequence is towards the end, it takes almost forever ( i mean it is
> > > slow).
> >
> > Yup, definately true -- if you have really big files, this probably
> > isn't the best approach.
> >
> > > Is there any indexing technique. I was thinking, I should create
> > > some sort of index because I will be doing this quite often and that way
> > > search can be really fast. Or is there any efficient method of searching
> > > EST database. Does any one has any suggestion regarding indexing.
> >
> > You probably want to check out the next section in the Tutorial:
> >
> > 2.4.4. FASTA files as Dictionaries
> >
> > The example there is actually of indexing a FASTA file using accession
> > numbers. This sounds really close to what you need. Let us know if you
> > have problems modifying the example to fit in your actual case. BTW,
> > the example code is in Doc/examples/fasta_dictionary.py if you want to
> > start from that.
> >
> > Hope this helps,
> > Brad
>