[BioPython] GenBank parser stops at keyword-like entries in /note block

Wolfgang Schueler wolfgang@proceryon.at
Wed, 10 Oct 2001 16:07:07 +0200


Hi BioPy(th)oneers,

on processing GenBank the NCBI nucleotide entry:   AL138972 GI:6946668

parsed with the GenBank Parser like this:

   feature_parser = GenBank.FeatureParser()
   gb_iterator = GenBank.Iterator(gb_file, feature_parser)
 
   while 1:
      cur_record = gb_iterator.next()
...

results in following error (while other entries work fine):

Traceback (most recent call last):
  File "/home/people/wolfgang/gag/EXT_DATA/bio_wsDB.py", line 683, in ?
    present(extract_GEN_summary_by_keywords(sys.argv[2],sys.argv[3:]))
  File "/home/people/wolfgang/gag/EXT_DATA/bio_wsDB.py", line 565, in
extract_GEN_summary_by_keywords
    cur_record = gb_iterator.next()
  File "/home/people/wolfgang/lib/python/Bio/GenBank/__init__.py", line
182, in next
    return self._parser.parse(File.StringHandle(data))
  File "/home/people/wolfgang/lib/python/Bio/GenBank/__init__.py", line
260, in parse
    self._scanner.feed(handle, self._consumer)
  File "/home/people/wolfgang/lib/python/Bio/GenBank/__init__.py", line
1108, in feed
    self._parser.parseFile(handle)
  File "/home/people/wolfgang/lib/python/Martel/Parser.py", line 226, in
parseFile
    self.parseString(fileobj.read())
  File "/home/people/wolfgang/lib/python/Martel/Parser.py", line 254, in
parseString
    self._err_handler.fatalError(result)
  File "/var/tmp/python-root//usr/lib/python2.0/xml/sax/handler.py",
line 38, in fatalError
Martel.Parser.ParserPositionException: error parsing at or beyond
character 3217



Examination of the record shows that the double quoted multiline data
block of /note
contains keyword-like entries like /prediction and /match which are
standing at the 
beginning of some lines within the /note block.
In this position the parser reads them as keywords and stops, whereas if
you 
shift these lines e.g. one position to the right by adding a blank, the
keyword-like
entries appear as text in the /note.

I couldn't figure out how to fix that in the code (these are my first
steps with Python and 
Biopython) so I would be grateful for advice.

Wolfgang



attached below: part of the FEATURES part of the mentioned entry

...
FEATURES             Location/Qualifiers
     source          1..154329
                     /organism="Drosophila melanogaster"
                     /db_xref="taxon:7227"
                     /clone="BAC BACR25B3"
     gene            complement(22148..27773)
                     /gene="EG:BACR25B3.11"
     CDS            
complement(join(22148..22299,22375..22791,22860..23560,
                    
23630..24555,24616..24888,25024..25178,26677..27009,
                     27623..27773))
                     /gene="EG:BACR25B3.11"
                     /note="/prediction=(method:''genefinder'',
===================  version:''084'', score:''105.71'');
parsing stops here=> /prediction=(method:''genscan'', version:''1.0'');
===================  /match=(desc:''BASEMENT MEMBRANE-SPECIFIC HEPARAN
SULFATE
                     PROTEOGLYCAN CORE PROTEIN PRECURSOR (HSPG)
(PERLECAN)
                     (PLC)'', species:''Homo sapiens (Human)'',
                     ranges:(query:24292..24549,
                     target:SWISS-PROT::P98160:3713..3628,
score:''201.00''),
                     (query:24016..24291,
target:SWISS-PROT::P98160:3815..3724,
                     score:''139.00''), (query:23857..24006,
...
                     /protein_id="CAB72284.1"
                     /db_xref="GI:6946669"
                    
/translation="MACNCNQSMIYQSNERRDYNCPGAPQYPYNRFKGGVSLKDTPCM
...