[BioPython] GenBank file parsing question
Brad Chapman
chapmanb at uga.edu
Mon Nov 24 12:15:03 EST 2003
Hi Lee;
Sorry for the delay in getting back with you.
> I also want to grab the country of origin and
> the strain. I think these are available to be parsed as they are
Yes, they definitely are parsed. To get them you will need to go
into the feature tables of the GenBank Record object. I'm not sure
exactly where country and strain are always stored as I don't
normally look at bacterial GenBank entries but they look to be in
the "source" feature from a quick glance. If this is true generally
then the following code should work:
def get_country_and_strain(record):
country = "No country"
strain = "No strain"
for feature in record.features:
if feature.key == "source":
for qualifier in feature.qualifiers:
if qualifier.key == "country":
country = qualifier.value
elif qualifier.key == "strain":
strain = qualifier.value
return country, strain
Then you should be able to modify your code to something like (I
also modified it to use Fasta Records which write the sequences
prettily for you automatically:
#!/usr/bin/env python
from Bio import GenBank,Writer
import sys,string
def GenBank2FASTA():
parser = GenBank.RecordParser()
genbankfile = open(sys.argv[1],'r')
iterator = GenBank.Iterator(genbankfile,parser)
while 1:
cur_record = iterator.next()
if cur_record is None:
break
outfile = open(sys.argv[2],'a')
accession = " ".join(cur_record.accession)
country, strain = get_country_and_strain(cur_record)
fasta_rec = Fasta.Record()
fasta_rec.title = "%s|%s|%s|%s" % (accession,
cur_record.date, country, strain)
fasta_rec.sequence = cur_record.sequence
outfile.write(str(fasta_rec) + "\n")
Hopefully this makes sense and helps. Sorry again about the delay in
responding.
Brad
More information about the BioPython
mailing list