[Biopython-dev] [Bug 2591] New: GenBank files misparsed for long organism names

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Fri Sep 19 14:26:17 EDT 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2591

           Summary: GenBank files misparsed for long organism names
           Product: Biopython
           Version: 1.47
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: joelb at lanl.gov


I've noticed a problem with BioPython 1.47 mis-parsing the organism and lineage
in GenBank files from certain bacteria.  All of the problem organisms have
names longer than 61 characters, and a line wrap is introduced into the SOURCE
and ORGANISM records, which causes the mis-parsing.

My reading of the GenBank file docs says that lines should be of variable
length rather than being split, so it appears this bug is GenBank's problem
rather than BioPython's.  I have sent e-mail to info at ncbi.nlm.nih.gov about the
issue just now.  GenBank doesn't seem to have a bug tracker, though, so I'm
writing the issue here to document it for other people.  The issue exists for a
number of organisms (more than 6, though I haven't done the exact count).  

One example may be found at
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Salmonella_enterica_serovar_Paratyphi_A_AKU_12601/NC_011147.gbk
or
http://tinyurl.com/47yg5g 

When parsing this file, the taxonomy list returned begins with 
["AKU_12601 Bacteria","Proteobacteria"...

Some of the other examples have made it onto web sites which have included the
mis-parsed data, e.g. Superfam
http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/cgi-bin/gen_list.cgi?genome=x6
which shows the error for Salmonella enterica subsp. enterica serovar
Choleraesuis str. SC-B67.

I'll append the response from GenBank to this bug if and when I get one.  If I
don't get one, then I'll try to come up with a workaround.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


More information about the Biopython-dev mailing list