[Biopython-dev] [Bug 2591] GenBank files misparsed for long organism names
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Fri Sep 19 19:05:45 UTC 2008
http://bugzilla.open-bio.org/show_bug.cgi?id=2591
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-09-19 15:05 EST -------
That file starts as follows:
LOCUS NC_011147 4581797 bp DNA circular BCT 29-AUG-2008
DEFINITION Salmonella enterica subsp. enterica serovar Paratyphi A str.
AKU_12601, complete genome.
ACCESSION NC_011147
VERSION NC_011147.1 GI:197361212
KEYWORDS complete genome.
SOURCE Salmonella enterica subsp. enterica serovar Paratyphi A str.
AKU_12601
ORGANISM Salmonella enterica subsp. enterica serovar Paratyphi A str.
AKU_12601
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
Enterobacteriaceae; Salmonella.
REFERENCE 1
...
The multiline DEFINITION and SOURCE should be fine. However, we expect
ORGANISM to be a single line followed by a multiline taxonomy lineage - hense
the problem you observed.
This may well be an NCBI bug but it seems likely this kind of problem will
occur more often in future as more and more (sub)strains of bacteria are
sequenced, requiring longer names.
Let's wait and hear what the NCBI says - I expect they will have to change the
file format definition slightly.
If they say this is a valid file, I hope they will also explain officially how
we should split up the species and its lineage. One option would be some thing
like looking for semi-colons in the following text as indicative of the lineage
(rather than as more of the ORGANISM).
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Biopython-dev
mailing list