[BioPython] help with retrieving seq

Dinakar Desai Desai.Dinakar@mayo.edu
Thu, 01 Mar 2001 13:24:22 -0600


This is a multi-part message in MIME format.
--------------958DD7C35B12E3ECE58FBBFE
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Hello:

It works well for the sequences that are closer to start of file but if
the sequence is towards the end, it takes almost forever ( i mean it is
slow). Is there any indexing technique. I was thinking, I should create
some sort of index because I will be doing this quite often and that way
search can be really fast. Or is there any efficient method of searching
EST database. Does any one has any suggestion regarding indexing.

Thanks a lot.

Dinakar


Brad Chapman wrote:
> 
> Hi Dinakar;
> 
> > We have est database at localsite in fasta format. I was wondering is
> > there any method in biopython that will  retrieve est sequence given the
> > est_id.  If you need more information, please let me know. I tried to
> > look through the code but I can not makeout.
> 
> Probably what you need is just a slight modification of the FASTA
> parser described in 2.4.3 of the Tutorial. Here's a quick function
> that I think does what you want:
> 
> import string
> from Bio import Fasta
> 
> def locate_est(fasta_to_parse, id_to_find):
>     """Find an EST with a given id.
> 
>     Arguments:
> 
>     o fasta_to_parse - The FASTA file containing all ESTs to search.
> 
>     o id_to_find - The id of the EST record we want to retrieve.
> 
>     Returns the FASTA formatted record, or None if the record could not
>     be found.
>     """
>     # parse fasta files into FASTA Record classes
>     parser = Fasta.RecordParser()
>     fasta_handle = open(fasta_to_parse, 'r')
>     # iterator to iterate over all FASTA records in the file
>     iterator = Fasta.Iterator(fasta_handle, parser)
> 
>     while 1:
>         # get the next record from the iterator
>         cur_record = iterator.next()
> 
>         # if we ran out of records, we didn't find the id
>         if cur_record is None:
>             fasta_handle.close()
>             return None
> 
>         # search for the ID in the title
>         id_pos = string.find(cur_record.title, id_to_find)
> 
>         # if we found the string, return this record in FASTA format
>         if id_pos != -1:
>             fasta_handle.close()
>             return str(cur_record)
> 
> $ python
> Python 2.1a2 (#1, Feb  3 2001, 15:37:56)
> [GCC 2.95.2 19991024 (release/franzo)] on linux2
> Type "copyright", "credits" or "license" for more information.
> >>> from example import locate_est
> >>> est_record = locate_est("/home/chapmanb/bioppjx/biopython/Doc/examples/ls_orchid.fasta", "Z78532.1")
> >>> print est_record
> >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
> CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATA
> TGATCGAGTGAATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCG
> TGACCCTGCTTTGTTGTTGGGCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTA
> CGGTGCAGTTTGCGCCAAGTCATATAAAGCATCACTGATGAATGACATTATTGTCAGAAA
> AAATCAGAGGGGCAGTATGCTACTGAGCATGCCAGTGAATTTTTATGACTCTCGCAACGG
> ATATCTTGGCTCTAACATCGATGAAGAACGCAGCTAAATGCGATAAGTGGTGTGAATTGC
> AGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCTCGAGGCCATCAGGCTAAG
> GGCACGCCTGCCTGGGCGTCGTGTGTTGCGTCTCTCCTACCAATGCTTGCTTGGCATATC
> GCTAAGCTGGCATTATACGGATGTGAATGATTGGCCCCTTGTGCCTAGGTGCGGTGGGTC
> TAAGGATTGTTGCTTTGATGGGTAGGAATGTGGCACGAGGTGGAGAATGCTAACAGTCAT
> AAGGCTGCTATTTGAATCCCCCATGTTGTTGTATTTTTTCGAACCTACACAAGAACCTAA
> TTGAACCCCAATGGAGCTAAAATAACCATTGGGCAGTTGATTTCCATTCAGATGCGACCC
> CAGGTCAGGCGGGGCCACCCGCTGAGTTGAGGC
> 
> Hope this helps.
> 
> Brad
--------------958DD7C35B12E3ECE58FBBFE
Content-Type: text/x-vcard; charset=us-ascii;
 name="desai.dinakar.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dinakar Desai
Content-Disposition: attachment;
 filename="desai.dinakar.vcf"

begin:vcard 
n:Desai;Dinakar
tel;fax:507-284-0615
tel;home:507-289-3972
tel;work:507-266-2831
x-mozilla-html:FALSE
adr:;;;;;;
version:2.1
email;internet:desai.dinakar@mayo.edu
fn:Dinakar
end:vcard

--------------958DD7C35B12E3ECE58FBBFE--