[Biopython-dev] [Bug 2591] GenBank files misparsed for long organism names
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Wed Dec 17 23:44:58 UTC 2008
http://bugzilla.open-bio.org/show_bug.cgi?id=2591
------- Comment #5 from joelb at lanl.gov 2008-12-17 18:44 EST -------
I received the following response to my followup. It now appears that the bug
is with BioPython, since GenBank has changed its definition. It seems likely
that all Bio* flatfile parsers will be affected.
>I just received the wording that will appear in Section 3.4.2 of gbrel.txt
>for this month's release:
>
> ORGANISM - Formal scientific name of the organism (first line)
>and taxonomic classification levels (second and subsequent lines).
>Mandatory subkeyword in all annotated entries/two or more records.
>
> In the event that the organism name exceeds 68 characters (80 - 13 +
>1)
> in length, it will be line-wrapped and continue on a second line,
> prior to the taxonomic classification. Unfortunately, very long
> organism names were not anticipated when the fixed-length GenBank
> flatfile format was defined in the 1980s. The possibility of linewraps
> makes the job of flatfile parsers more difficult : essentially, one
> cannot be sure that the second line is truly a classification/lineage
> unless it consists of multiple tokens, delimited by semi-colons.
> The long-term solution to this problem is to introduce an additional
> subkeyword, probably 'LINEAGE' . This might occur sometime in 2009
> or 2010.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Biopython-dev
mailing list