[BioPython] GenBank file parsing question

Lee R. Shekter lee at epigenomix.com
Mon Nov 17 14:48:22 EST 2003


All -

I have been using BioPython to parse GenBank files. The fields that I
have been parsing include the accession number, the date, and the
sequence. I have then output these into Fasta format for input into a
sequence analysis program. I also want to grab the country of origin and
the strain. I think these are available to be parsed as they are
documented in Bio/GenBank/genbank_format.py under a section called:
feature_key_names however I can't figure out the syntax to call this
part after having loaded in my file and parsed it like this:

#!/usr/bin/env python
from Bio import GenBank,Writer
import sys,string
def GenBank2FASTA():
   parser = GenBank.RecordParser()
   genbankfile = open(sys.argv[1],'r')
   iterator = GenBank.Iterator(genbankfile,parser)
   while 1:
      cur_record = iterator.next()
      if cur_record is None:
	break
      outfile = open(sys.argv[2],'a')
      cur_record.accession = " ".join(cur_record.accession)
      outfile.write('>%s|%s\n' %(cur_record.accession,cur_record.date))
      seqwidth = 60
      for i in range(0,len(cur_record.sequence),seqwidth):
	outfile.write('%s\n' % cur_record.sequence[i:i+seqwidth])
		etc.,etc.,

This results in a file that looks like this:
>AAQ10924|24-SEP-2003
MAGRSGDDDKELLKAVKIIKILYQSNPYPEPKGSRQARKNRRRRWRARQRQIDSISERIL
STCLGRPTEPVPLQLPPLERLHLDSREDCGTSGTQQSQGVETGVGRPQISVESSGVLGSR
TET
>AAQ10915|24-SEP-2003
MAGRSGDNDEELLKAVRIIKILYKSNPYPEPKGSRQARKNRRRRWRARQRQIDSISERIL
STYLGRSTEPVPLQLPPLERLHLDCREDCGTSGTQQSQGVETGVGRPQISVESPVILGSR
TKN
>AAQ10906|24-SEP-2003
MAGRSGDGDEGILPTVKIIQILYPSHPYPEPKGSRQARKNRRRRWRARQKQIDSISERIL
STCLGRPAEPVPLQLPPLERLHLDSREDCGTSGTQQSQGVETGVGRPQISVESSGVLGSR
TET

What I'd like is have this plus:
>AAQ10906|24-SEP-2003 | country | strain
MAGRSGDGDEGILPTVKIIQILYPSHPYPEPKGSRQARKNRRRRWRARQKQIDSISERIL
STCLGRPAEPVPLQLPPLERLHLDSREDCGTSGTQQSQGVETGVGRPQISVESSGVLGSR
TET

I'm sure there's a pretty simple solution to my problem. Any help would
be appreciated either in terms of using BioPython to parse the country
of origin and strain fields.

Lee Shekter





More information about the BioPython mailing list