[BioPython] help with retrieving seq
Brad Chapman
chapmanb@arches.uga.edu
Thu, 1 Mar 2001 00:45:30 -0500
Hi Dinakar;
> We have est database at localsite in fasta format. I was wondering is
> there any method in biopython that will retrieve est sequence given the
> est_id. If you need more information, please let me know. I tried to
> look through the code but I can not makeout.
Probably what you need is just a slight modification of the FASTA
parser described in 2.4.3 of the Tutorial. Here's a quick function
that I think does what you want:
import string
from Bio import Fasta
def locate_est(fasta_to_parse, id_to_find):
"""Find an EST with a given id.
Arguments:
o fasta_to_parse - The FASTA file containing all ESTs to search.
o id_to_find - The id of the EST record we want to retrieve.
Returns the FASTA formatted record, or None if the record could not
be found.
"""
# parse fasta files into FASTA Record classes
parser = Fasta.RecordParser()
fasta_handle = open(fasta_to_parse, 'r')
# iterator to iterate over all FASTA records in the file
iterator = Fasta.Iterator(fasta_handle, parser)
while 1:
# get the next record from the iterator
cur_record = iterator.next()
# if we ran out of records, we didn't find the id
if cur_record is None:
fasta_handle.close()
return None
# search for the ID in the title
id_pos = string.find(cur_record.title, id_to_find)
# if we found the string, return this record in FASTA format
if id_pos != -1:
fasta_handle.close()
return str(cur_record)
$ python
Python 2.1a2 (#1, Feb 3 2001, 15:37:56)
[GCC 2.95.2 19991024 (release/franzo)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> from example import locate_est
>>> est_record = locate_est("/home/chapmanb/bioppjx/biopython/Doc/examples/ls_orchid.fasta", "Z78532.1")
>>> print est_record
>gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATA
TGATCGAGTGAATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCG
TGACCCTGCTTTGTTGTTGGGCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTA
CGGTGCAGTTTGCGCCAAGTCATATAAAGCATCACTGATGAATGACATTATTGTCAGAAA
AAATCAGAGGGGCAGTATGCTACTGAGCATGCCAGTGAATTTTTATGACTCTCGCAACGG
ATATCTTGGCTCTAACATCGATGAAGAACGCAGCTAAATGCGATAAGTGGTGTGAATTGC
AGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCTCGAGGCCATCAGGCTAAG
GGCACGCCTGCCTGGGCGTCGTGTGTTGCGTCTCTCCTACCAATGCTTGCTTGGCATATC
GCTAAGCTGGCATTATACGGATGTGAATGATTGGCCCCTTGTGCCTAGGTGCGGTGGGTC
TAAGGATTGTTGCTTTGATGGGTAGGAATGTGGCACGAGGTGGAGAATGCTAACAGTCAT
AAGGCTGCTATTTGAATCCCCCATGTTGTTGTATTTTTTCGAACCTACACAAGAACCTAA
TTGAACCCCAATGGAGCTAAAATAACCATTGGGCAGTTGATTTCCATTCAGATGCGACCC
CAGGTCAGGCGGGGCCACCCGCTGAGTTGAGGC
Hope this helps.
Brad