[Biopython-dev] About a new GenBankWriter class with SeqIO interface

Tue May 15 20:36:05 UTC 2007

Hello everyone,

   I started using Biopython in my research and I needed a way to
write GenBank files from a SeqRecord (which was parsed from other
GenBank/etc files). So I wrote something up. It uses the SeqIO
interface and behaves like the fasta writer.

The SeqIO.write(record, handle, "genbank") interface accepts "record"
as either a SeqRecord generator with multiple records or a single
record from SeqRecord. So record = SeqRecord or record =
SeqRecord.next() both work. (I'm a relatively new to Python, so please
excuse any bad terminology or stylistic deficiencies).

The changes are: a new file called GenBankWriter.py in Bio/GenBank.
Small changes to the __init__.py of Bio/GenBank. Changes to the
_feed_first_line function of Scanner.py of Bio/GenBank.

I had to change the way Bio/GenBank/Scanner.py reads the Locus line of
a GenBank file in order to handle missing data and newer molecule
types (e.g. ss-RNA, ds-DNA, mt-DNA, etc). I also add/change a couple
of lines in __init__.py to store whether a sequence was linear or
circular and to store the string that encodes its molecule type
(ss-RNA, etc). The output of SeqIO.write(record,handle,"genbank") is
functionally identical to a GenBank file from NCBI except for some
spacing and word wrap issues.

What is the best way to submit new code for review? Whom do I send it
to and should I send only the modified files?

I've included one of my test scripts below just to show how it works.
(Does anyone suggest any changes in the interface?)

Thank you.

Sincerely,

   Howard Salis
   Postdoctoral Scholar
   UC San Francisco

#ASimpleTest.py
"""A vigorous exercise of the GenBankWriter class and the SeqIO interface."""

from Bio import SeqIO
from Bio import GenBank

working_dir = "E:\\Plasmids\\"

#Get some arbitrarily chosen GenBank files (these are relatively small ones)
gi_list = GenBank.search_for("EF470550 OR EF470551")
print gi_list
ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank")

#Write the pair of strings to a single file.
handle = open(working_dir + "Source.gb","w")
for gi in gi_list:
    handle.write(str(ncbi_dict[gi]))
handle.close()

#Parse the Source file into a SeqRecord generator (two records)
handle = open(working_dir + "Source.gb","r")
records = SeqIO.parse(handle,"genbank")

#write many records to a single GenBank file
file = open(working_dir + "ManyRecords.gb","w")
SeqIO.write(records,file,"genbank")
file.close()
handle.close()

#----

#Parse the Source file into a SeqRecord generator (two records)
handle = open(working_dir +"Source.gb","r")
records = SeqIO.parse(handle,"genbank")

#Write individual records into their own GenBank file
counter=0
for record in records:
    counter+=1
    file = open(working_dir + "OneFile_" + str(counter) + ".gb","w")
    SeqIO.write(record,file,"genbank")
    file.close()

handle.close()

#Open then back up again, parse them, and write them to a single file
handle = open(working_dir + "ManyRecords_Out.gb","w")
for num in range(1,counter+1):
    print num
    file = open(working_dir +"OneFile_" + str(num) + ".gb","r")
    records = SeqIO.parse(file,"genbank")
    SeqIO.write(records,handle,"genbank")
    file.close()

handle.close()

#Compare the original GenBank file in Source.gb to the GenBankWriter'd one.
original = open(working_dir +"Source.gb","r")
newone = open(working_dir + "ManyRecords_Out.gb","r")

records_original = SeqIO.parse(original,"genbank")
records_newone = SeqIO.parse(newone,"genbank")

for (record_original,record_newone) in zip(records_original,records_newone):
    print str(record_original)
    print str(record_newone)

original.close()
newone.close()

print "Done"