[Biopython-dev] [Bug 2591] GenBank files misparsed for long organism names
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Thu Dec 18 11:07:16 UTC 2008
http://bugzilla.open-bio.org/show_bug.cgi?id=2591
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2008-12-18 06:07 EST -------
(In reply to comment #5)
> I received the following response to my followup. It now appears that the bug
> is with BioPython, since GenBank has changed its definition. It seems likely
> that all Bio* flatfile parsers will be affected.
Thanks for chasing this up Joel :)
> I just received the wording that will appear in Section 3.4.2 of gbrel.txt
> for this month's release:
> >
> > ORGANISM - Formal scientific name of the organism (first line)
> >and taxonomic classification levels (second and subsequent lines).
> >Mandatory subkeyword in all annotated entries/two or more records.
> >
> > In the event that the organism name exceeds 68 characters (80-13+1)
> > in length, it will be line-wrapped and continue on a second line,
> > prior to the taxonomic classification. Unfortunately, very long
> > organism names were not anticipated when the fixed-length GenBank
> > flatfile format was defined in the 1980s. The possibility of linewraps
> > makes the job of flatfile parsers more difficult : essentially, one
> > cannot be sure that the second line is truly a classification/lineage
> > unless it consists of multiple tokens, delimited by semi-colons.
> > The long-term solution to this problem is to introduce an additional
> > subkeyword, probably 'LINEAGE' . This might occur sometime in 2009
> > or 2010.
It looks like my guess was right, see comment #1:
> Let's wait and hear what the NCBI says - I expect they will have to change the
> file format definition slightly.
>
> If they say this is a valid file, I hope they will also explain officially
> how we should split up the species and its lineage. One option would be
> some thing like looking for semi-colons in the following text as indicative
> of the lineage (rather than as more of the ORGANISM).
Now that we've had the NCBI recommend the semi-colon approach, I've fixed our
parser in CVS:
Bio/GenBank/Record.py revision 1.14
Bio/GenBank/Scanner.py revision 1.26
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Biopython-dev
mailing list