[BioPython] GenBank file parsing question

Mon Nov 24 12:15:03 EST 2003

Hi Lee;
Sorry for the delay in getting back with you.

> I also want to grab the country of origin and
> the strain. I think these are available to be parsed as they are

Yes, they definitely are parsed. To get them you will need to go
into the feature tables of the GenBank Record object. I'm not sure
exactly where country and strain are always stored as I don't
normally look at bacterial GenBank entries but they look to be in
the "source" feature from a quick glance. If this is true generally
then the following code should work:

def get_country_and_strain(record):
    country = "No country"
    strain = "No strain"
    for feature in record.features:
        if feature.key == "source":
            for qualifier in feature.qualifiers:
                if qualifier.key == "country":
                    country = qualifier.value
                elif qualifier.key == "strain":
                    strain = qualifier.value

    return country, strain

Then you should be able to modify your code to something like (I
also modified it to use Fasta Records which write the sequences
prettily for you automatically:

#!/usr/bin/env python
from Bio import GenBank,Writer
import sys,string
def GenBank2FASTA():
    parser = GenBank.RecordParser()
    genbankfile = open(sys.argv[1],'r')
    iterator = GenBank.Iterator(genbankfile,parser)
    while 1:
        cur_record = iterator.next()
        if cur_record is None:
            break
        outfile = open(sys.argv[2],'a')
        accession = " ".join(cur_record.accession)
        country, strain = get_country_and_strain(cur_record)
        fasta_rec = Fasta.Record()
        fasta_rec.title = "%s|%s|%s|%s" % (accession,
           cur_record.date, country, strain)
        fasta_rec.sequence = cur_record.sequence
        outfile.write(str(fasta_rec) + "\n")

Hopefully this makes sense and helps. Sorry again about the delay in
responding.
Brad