[BioPython] GenBank parser stops at keyword-like entries in /note block

Brad Chapman chapmanb@arches.uga.edu
Thu, 11 Oct 2001 08:42:22 -0400


--cWoXeonUoKmBZSoM
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hi Wolfgang;
Thanks for reporting the problem. Thanks also for the precise bug
report, it made it very easy to track down the error.

> on processing GenBank the NCBI nucleotide entry:   AL138972 GI:6946668
> results in following error (while other entries work fine):
[...]
> Examination of the record shows that the double quoted multiline data
> block of /note contains keyword-like entries like /prediction and 
> /match which are standing at the beginning of some lines within 
> the /note block.
> In this position the parser reads them as keywords and stops, 
> whereas if you  shift these lines e.g. one position to the 
> right by adding a blank, the keyword-like entries appear as 
> text in the /note.

This analysis is right on target with what the problem was. I have
attached a patch to Bio/GenBank/genbank_format.py which allows the
parser to parse this file.

The "fix" is not 100% correct though, as the parser will now parse
the following block from that file:

/note="/prediction=(method:''genefinder'',
version:''084'', score:''105.71'');
/prediction=(method:''genscan'', version:''1.0'');

as two separate keyword/value entries, like:

/note  "/prediction=(method:''genefinder'', version:''084'', score:''105.71'');
and
/prediction (method:''genscan'', version:''1.0'');

This is not really the "intention" of the GenBank file, which would
be to parse the entire thing as a single /note keyword. The only way
to do this completely right would be to try to parse around the
"s in the file, which is also fraught with complexity and is likely
to break (especially with "GenBank-like" files that come from other
sources).

In my opinion, it is not that bad to be "wrong" on this, because
this note is painfully formatted and even hard to "parse" by eye.
Additionally, this is probably closer to the author's original
intention, because I imagine they did this keywords-inside-of-a-note
hack because /prediction and /match aren't official approved
keywords (which is the actual reason why the parser broke in the
first place).

Anyways, that's enough rambling justification :-). Please let us
know if this doesn't work for you. Suggestions from people on how to
make the parser handle this 100% correctly are welcome.

Thanks again for the helpful bug report.
Brad




Brad
-- 
PGP public key available from http://pgp.mit.edu/

--cWoXeonUoKmBZSoM
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="genbank_format.py.diff"

*** genbank_format.py.orig	Thu Oct 11 07:57:26 2001
--- genbank_format.py	Thu Oct 11 08:23:10 2001
***************
*** 494,499 ****
--- 494,500 ----
                        #   between macronuclear and micronuclear stages,
                        #   this qualifier is used to denote that the
                        #   sequence is from macronuclear DNA.
+     "match",
      "map",            # Map position of the feature in free-format text
      "mitochondrion",  # Organelle type from which the sequence was obtained
      "mod_base",       # Abbreviation for a modified nucleotide base
***************
*** 511,516 ****
--- 512,518 ----
      "phenotype",      # Phenotype conferred by the feature
      "plasmid",        # Name of plasmid from which sequence was obtained
      "pop_variant",    # Population variant from which the sequence was obtained
+     "prediction",
      "product",        # Name of a product encoded by a coding region (CDS)
                        #   feature
      "protein_id",     # Protein Identifier, issued by International

--cWoXeonUoKmBZSoM--