[Biopython-dev] [Bug 1762] Bio.GenBank.FeatureParser dislikes valid
accessions and locus lines
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Wed Nov 9 07:09:11 EST 2005
http://bugzilla.open-bio.org/show_bug.cgi?id=1762
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2005-11-09 07:09 -------
The limitation with truncated LOCUS lines is still present with the switch to
my non-martel parser (see bug 1747), but this also means the original patch is
not applicable anymore.
The new parser seems to be OK with this style ACCESSION line:
ACCESSION U00096 AE000111-AE000510
e.g. Using the RecordParser() it is accessable as cur_record.accession ==
['U00096', 'AE000111-AE000510']
Test script:
===========================================================
import time
from Bio import GenBank
#gb_file = "/tmp/U00096_full_locus.gbk"
gb_file = "/tmp/U00096_truncated_locus.gbk"
feature_parser = GenBank.FeatureParser()
gb_handle = open(gb_file, 'r')
start_time = time.time()
gb_iterator = GenBank.Iterator(gb_handle, feature_parser)
count = 0
while 1:
print "Starting...",
cur_record = gb_iterator.next()
print "Done"
if cur_record is None:
break
count = count + 1
# now do something with the record
print count, cur_record.name, len(cur_record.features),
len(cur_record.seq)
if 'data_file_division' in cur_record.annotations :
print cur_record.annotations['data_file_division']
if 'date' in cur_record.annotations :
print cur_record.annotations['date']
job_time = time.time() - start_time
print "Time elapsed %0.2f seconds for %s" % (job_time, gb_file)
==============================================================
Test script output for the undoctored GenBank file from the NCBI's website
(sent to file, I just searched for U00096 in google):
==============================================================
Starting... Done
1 U00096 8877 4639675
BCT
08-SEP-2005
Starting... Done
Time elapsed 79.05 seconds for /tmp/U00096_full_locus.gbk
==============================================================
Test script output for the truncated locus line version, where I edited the
first line by hand from:
LOCUS U00096 4639675 bp DNA circular BCT
to:
LOCUS U00096
==============================================================
Starting...
Traceback (most recent call last):
File "/tmp/U00096_test.py", line 17, in -toplevel-
cur_record = gb_iterator.next()
File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 129, in
next
return self._parser.parse(File.StringHandle(data))
File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 219, in
parse
self._scanner.feed(handle, self._consumer)
File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 1382,
in feed
assert False, \
AssertionError: Did not recognise the LOCUS line layout:
LOCUS U00096
==============================================================
The patch should be straight forward (I'll try and do this this afternoon) but
note Jan T. Kim's warning:-
> I haven't checked whether missing division / length /
> DNA/RNA/protein / circular/linear information results in
> appropriate defaults in the objects created by parsing.
> As long as the corresponding members are not used, there
> should not be any problem.
Peter
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Biopython-dev
mailing list