[Biopython-dev] [Bug 1909] Format issue with GenBank with segmented
BACs (eg GI:55276707)
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Tue Dec 20 07:32:41 EST 2005
http://bugzilla.open-bio.org/show_bug.cgi?id=1909
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |INVALID
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2005-12-20 07:32 -------
A GenBank format entry for GI:55276707 can be downloaded from here:
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=55276707
Its a 401 kb GenBank file, containing THREE separate GenBank records (three
segments), starting:
LOCUS AY643842S1 12998 bp DNA linear PLN 17-NOV-2004
DEFINITION Hordeum vulgare subsp. vulgare clone BAC 519K7 hardness locus
region.
ACCESSION AY643842
VERSION AY643842.1 GI:55276708
KEYWORDS .
SEGMENT 1 of 3
..
Using the old Martel GenBank parser (e.g. BioPython 1.41) the following works
perfectly:
print "Method 1 - Using for record in Iterator"
from Bio import GenBank
gbk_filename = "AY643842.gbk"
input_file = open(gbk_filename, "r")
for gb_record in GenBank.Iterator(input_file, GenBank.RecordParser()) :
print "Loaded GenBank record %s" % gb_record.locus
print "Done"
input_file.close()
Or:
print "Method 2 - Using Iterator.next()"
from Bio import GenBank
gbk_filename = "AY643842.gbk"
input_file = open(gbk_filename, "r")
gb_iterator = GenBank.Iterator(input_file, GenBank.RecordParser())
while True:
gb_record = gb_iterator.next()
if gb_record is None : break
print "Loaded GenBank record %s" % gb_record.locus
print "Done"
input_file.close()
This bit of code will reproduce the error reported:
print "Method 3 - No Iterator object, this fails"
from Bio import GenBank
gbk_filename = "AY643842.gbk"
input_file = open(gbk_filename, "r")
gb_record = GenBank.RecordParser().parse(input_file)
..
The reason the error message says "unparsed text remains" beyond position
18263, is the fact that there are actually two more records in the file.
Your text editor may have a "goto character" command (TextPad does, available
to try from www.textpad.com but it does cost money).
The following snippet of code is another way to find out where a Martel parser
is failing from a position in a file, in this case 18263:
print "Debug:"
input_file = open(gbk_filename, "r")
raw_text = "".join(input_file.readlines())
input_file.close()
print raw_text[18263:18263+100] + "..."
Debug:
LOCUS AY643842S2 129099 bp DNA linear PLN 17-NOV-2004
DEFINITION Hordeum ...
i.e. It's complaining about the presence of second record (i.e. LOCUS line
onwards) in the GenBank file.
Resolution
==========
If you can't be sure in advance that there is only one record, allways use the
GenBank.Iterator object.
Note
====
Using the current version of the GenBank parser (in CVS, not yet released),
then method 3 above will work and give you the (just) first record. It does
not warn you in any way that there is a second or third record available.
P.S.
====
My testing and the original report were done on Windows. If you run this on
unix, then because of the different line endings, the exact position of the
second record will change slightly.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Biopython-dev
mailing list