[BioPython] GenBank file parsing question
Lee R. Shekter
lee at epigenomix.com
Mon Nov 17 14:48:22 EST 2003
All -
I have been using BioPython to parse GenBank files. The fields that I
have been parsing include the accession number, the date, and the
sequence. I have then output these into Fasta format for input into a
sequence analysis program. I also want to grab the country of origin and
the strain. I think these are available to be parsed as they are
documented in Bio/GenBank/genbank_format.py under a section called:
feature_key_names however I can't figure out the syntax to call this
part after having loaded in my file and parsed it like this:
#!/usr/bin/env python
from Bio import GenBank,Writer
import sys,string
def GenBank2FASTA():
parser = GenBank.RecordParser()
genbankfile = open(sys.argv[1],'r')
iterator = GenBank.Iterator(genbankfile,parser)
while 1:
cur_record = iterator.next()
if cur_record is None:
break
outfile = open(sys.argv[2],'a')
cur_record.accession = " ".join(cur_record.accession)
outfile.write('>%s|%s\n' %(cur_record.accession,cur_record.date))
seqwidth = 60
for i in range(0,len(cur_record.sequence),seqwidth):
outfile.write('%s\n' % cur_record.sequence[i:i+seqwidth])
etc.,etc.,
This results in a file that looks like this:
>AAQ10924|24-SEP-2003
MAGRSGDDDKELLKAVKIIKILYQSNPYPEPKGSRQARKNRRRRWRARQRQIDSISERIL
STCLGRPTEPVPLQLPPLERLHLDSREDCGTSGTQQSQGVETGVGRPQISVESSGVLGSR
TET
>AAQ10915|24-SEP-2003
MAGRSGDNDEELLKAVRIIKILYKSNPYPEPKGSRQARKNRRRRWRARQRQIDSISERIL
STYLGRSTEPVPLQLPPLERLHLDCREDCGTSGTQQSQGVETGVGRPQISVESPVILGSR
TKN
>AAQ10906|24-SEP-2003
MAGRSGDGDEGILPTVKIIQILYPSHPYPEPKGSRQARKNRRRRWRARQKQIDSISERIL
STCLGRPAEPVPLQLPPLERLHLDSREDCGTSGTQQSQGVETGVGRPQISVESSGVLGSR
TET
What I'd like is have this plus:
>AAQ10906|24-SEP-2003 | country | strain
MAGRSGDGDEGILPTVKIIQILYPSHPYPEPKGSRQARKNRRRRWRARQKQIDSISERIL
STCLGRPAEPVPLQLPPLERLHLDSREDCGTSGTQQSQGVETGVGRPQISVESSGVLGSR
TET
I'm sure there's a pretty simple solution to my problem. Any help would
be appreciated either in terms of using BioPython to parse the country
of origin and strain fields.
Lee Shekter
More information about the BioPython
mailing list