[BioPython] help with retrieving seq

Thu, 1 Mar 2001 00:45:30 -0500

Hi Dinakar;

> We have est database at localsite in fasta format. I was wondering is
> there any method in biopython that will  retrieve est sequence given the
> est_id.  If you need more information, please let me know. I tried to
> look through the code but I can not makeout.

Probably what you need is just a slight modification of the FASTA
parser described in 2.4.3 of the Tutorial. Here's a quick function
that I think does what you want:

import string
from Bio import Fasta

def locate_est(fasta_to_parse, id_to_find):
    """Find an EST with a given id.

    Arguments:

    o fasta_to_parse - The FASTA file containing all ESTs to search.

    o id_to_find - The id of the EST record we want to retrieve.

    Returns the FASTA formatted record, or None if the record could not
    be found.
    """
    # parse fasta files into FASTA Record classes
    parser = Fasta.RecordParser()
    fasta_handle = open(fasta_to_parse, 'r')
    # iterator to iterate over all FASTA records in the file
    iterator = Fasta.Iterator(fasta_handle, parser)

    while 1:
        # get the next record from the iterator
        cur_record = iterator.next()

        # if we ran out of records, we didn't find the id
        if cur_record is None:
            fasta_handle.close()
            return None

        # search for the ID in the title
        id_pos = string.find(cur_record.title, id_to_find)

        # if we found the string, return this record in FASTA format
        if id_pos != -1:
            fasta_handle.close()
            return str(cur_record)

$ python
Python 2.1a2 (#1, Feb  3 2001, 15:37:56) 
[GCC 2.95.2 19991024 (release/franzo)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> from example import locate_est
>>> est_record = locate_est("/home/chapmanb/bioppjx/biopython/Doc/examples/ls_orchid.fasta", "Z78532.1")
>>> print est_record
>gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATA
TGATCGAGTGAATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCG
TGACCCTGCTTTGTTGTTGGGCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTA
CGGTGCAGTTTGCGCCAAGTCATATAAAGCATCACTGATGAATGACATTATTGTCAGAAA
AAATCAGAGGGGCAGTATGCTACTGAGCATGCCAGTGAATTTTTATGACTCTCGCAACGG
ATATCTTGGCTCTAACATCGATGAAGAACGCAGCTAAATGCGATAAGTGGTGTGAATTGC
AGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCTCGAGGCCATCAGGCTAAG
GGCACGCCTGCCTGGGCGTCGTGTGTTGCGTCTCTCCTACCAATGCTTGCTTGGCATATC
GCTAAGCTGGCATTATACGGATGTGAATGATTGGCCCCTTGTGCCTAGGTGCGGTGGGTC
TAAGGATTGTTGCTTTGATGGGTAGGAATGTGGCACGAGGTGGAGAATGCTAACAGTCAT
AAGGCTGCTATTTGAATCCCCCATGTTGTTGTATTTTTTCGAACCTACACAAGAACCTAA
TTGAACCCCAATGGAGCTAAAATAACCATTGGGCAGTTGATTTCCATTCAGATGCGACCC
CAGGTCAGGCGGGGCCACCCGCTGAGTTGAGGC

Hope this helps.

Brad