[Biopython-dev] [Bug 2591] GenBank files misparsed for long organism names

Fri Sep 19 19:05:45 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2591

------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2008-09-19 15:05 EST -------
That file starts as follows:

LOCUS       NC_011147            4581797 bp    DNA     circular BCT 29-AUG-2008
DEFINITION  Salmonella enterica subsp. enterica serovar Paratyphi A str.
            AKU_12601, complete genome.
ACCESSION   NC_011147
VERSION     NC_011147.1  GI:197361212
KEYWORDS    complete genome.
SOURCE      Salmonella enterica subsp. enterica serovar Paratyphi A str.
            AKU_12601
  ORGANISM  Salmonella enterica subsp. enterica serovar Paratyphi A str.
            AKU_12601
            Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
            Enterobacteriaceae; Salmonella.
REFERENCE   1
...

The multiline DEFINITION and SOURCE should be fine.  However, we expect
ORGANISM to be a single line followed by a multiline taxonomy lineage - hense
the problem you observed.

This may well be an NCBI bug but it seems likely this kind of problem will
occur more often in future as more and more (sub)strains of bacteria are
sequenced, requiring longer names.

Let's wait and hear what the NCBI says - I expect they will have to change the
file format definition slightly.

If they say this is a valid file, I hope they will also explain officially how
we should split up the species and its lineage.  One option would be some thing
like looking for semi-colons in the following text as indicative of the lineage
(rather than as more of the ORGANISM).

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.