[Biopython-dev] [Bug 2591] GenBank files misparsed for long organism names

Thu Dec 18 11:07:16 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2591


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk  2008-12-18 06:07 EST -------
(In reply to comment #5)
> I received the following response to my followup.  It now appears that the bug
> is with BioPython, since GenBank has changed its definition.  It seems likely
> that all Bio* flatfile parsers will be affected.

Thanks for chasing this up Joel :)

> I just received the wording that will appear in Section 3.4.2 of gbrel.txt 
> for this month's release:
> >
> >   ORGANISM     - Formal scientific name of the organism (first line)
> >and taxonomic classification levels (second and subsequent lines).
> >Mandatory subkeyword in all annotated entries/two or more records.
> >
> >   In the event that the organism name exceeds 68 characters (80-13+1)
> >   in length, it will be line-wrapped and continue on a second line,
> >   prior to the taxonomic classification. Unfortunately, very long 
> >   organism names were not anticipated when the fixed-length GenBank
> >   flatfile format was defined in the 1980s. The possibility of linewraps
> >   makes the job of flatfile parsers more difficult : essentially, one
> >   cannot be sure that the second line is truly a classification/lineage
> >   unless it consists of multiple tokens, delimited by semi-colons.
> >   The long-term solution to this problem is to introduce an additional
> >   subkeyword, probably 'LINEAGE' . This might occur sometime in 2009
> >   or 2010.


It looks like my guess was right, see comment #1:
> Let's wait and hear what the NCBI says - I expect they will have to change the
> file format definition slightly.
> 
> If they say this is a valid file, I hope they will also explain officially
> how we should split up the species and its lineage.  One option would be
> some thing like looking for semi-colons in the following text as indicative
> of the lineage (rather than as more of the ORGANISM).

Now that we've had the NCBI recommend the semi-colon approach, I've fixed our
parser in CVS:
Bio/GenBank/Record.py revision 1.14
Bio/GenBank/Scanner.py revision 1.26

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.