[Biopython-dev] [Bug 1762] Bio.GenBank.FeatureParser dislikes valid accessions and locus lines

Wed Nov 9 07:09:11 EST 2005

http://bugzilla.open-bio.org/show_bug.cgi?id=1762

------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2005-11-09 07:09 -------
The limitation with truncated LOCUS lines is still present with the switch to
my non-martel parser (see bug 1747), but this also means the original patch is
not applicable anymore.

The new parser seems to be OK with this style ACCESSION line:

ACCESSION   U00096 AE000111-AE000510

e.g. Using the RecordParser() it is accessable as cur_record.accession ==
['U00096', 'AE000111-AE000510']

Test script:
===========================================================
import time
from Bio import GenBank
#gb_file = "/tmp/U00096_full_locus.gbk"
gb_file = "/tmp/U00096_truncated_locus.gbk"

feature_parser = GenBank.FeatureParser()

gb_handle = open(gb_file, 'r')

start_time = time.time()

gb_iterator = GenBank.Iterator(gb_handle, feature_parser)

count = 0
while 1:
     print "Starting...",
     cur_record = gb_iterator.next()
     print "Done"

     if cur_record is None:
         break

     count = count + 1

     # now do something with the record
     print count, cur_record.name, len(cur_record.features),
len(cur_record.seq)
     if 'data_file_division' in cur_record.annotations :
         print cur_record.annotations['data_file_division']
     if 'date' in cur_record.annotations :
         print cur_record.annotations['date']

job_time = time.time() - start_time

print "Time elapsed %0.2f seconds for %s" % (job_time, gb_file)
==============================================================

Test script output for the undoctored GenBank file from the NCBI's website
(sent to file, I just searched for U00096 in google):

==============================================================
Starting... Done
1 U00096 8877 4639675
BCT
08-SEP-2005
Starting... Done
Time elapsed 79.05 seconds for /tmp/U00096_full_locus.gbk
==============================================================

Test script output for the truncated locus line version, where I edited the
first line by hand from:

LOCUS       U00096               4639675 bp    DNA     circular BCT

to:

LOCUS       U00096

==============================================================
Starting...

Traceback (most recent call last):
  File "/tmp/U00096_test.py", line 17, in -toplevel-
    cur_record = gb_iterator.next()
  File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 129, in
next
    return self._parser.parse(File.StringHandle(data))
  File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 219, in
parse
    self._scanner.feed(handle, self._consumer)
  File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 1382,
in feed
    assert False, \
AssertionError: Did not recognise the LOCUS line layout:
LOCUS       U00096
==============================================================

The patch should be straight forward (I'll try and do this this afternoon) but
note Jan T. Kim's warning:-

> I haven't checked whether missing division / length /
> DNA/RNA/protein / circular/linear information results in
> appropriate defaults in the objects created by parsing.
> As long as the corresponding members are not used, there
> should not be any problem.

Peter

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.