[Biopython-dev] LocationParserError

Fri Mar 9 06:23:58 EST 2012

On Fri, Mar 9, 2012 at 10:06 AM, Matthias Bernt <MatatTHC at gmx.de> wrote:
> Just in case you need more test cases. I send all cases I found (all
> in mitochondria).

Trying with the current release (Biopython 1.59) I didn't get an exception
with NC_016406 but something wasn't quite right - I was missing the
external exon... which appears to be a bug in Entrez.

Here is NC_016406 from Entrez using GenBank (with parts),
http://www.ncbi.nlm.nih.gov/nuccore/NC_016406.1?report=gbwithparts&log$=seqview

     gene            join(complement(149815..150200),
                     complement(293787..295573),181647..181905)
                     /gene="nad1"
                     /trans_splicing
                     /note="exons 1, 2, 3, and 5 on chromosome 1 are
                     trans-spliced with exon 4 on chromosome 3 to form the
                     complete coding region"
                     /db_xref="GeneID:11447159"
     CDS             join(complement(149815..150200),
                     complement(295492..295573),complement(293787..293978),
                     181647..181905)
                     /gene="nad1"
                     /trans_splicing
                     /note="exons 1, 2, 3, and 5 on chromosome 1 are
                     trans-spliced with exon 4 on chromosome 3 to form the
                     complete coding region"
                     /codon_start=1
                     /transl_except=(pos:complement(150198..150200),aa:Met)
                     /product="NADH dehydrogenase subunit 1"
                     /protein_id="YP_004935334.1"
                     /db_xref="GI:357967323"
                     /db_xref="GeneID:11447159"
                     /translation="MYIAVPAEILGIILPLLLGVAFLVLAERKVMAFVQRRKGPDVVG
                     SFGLLQPLADGSKLILKEPISPSSANFSLFRMAPVTTFMLSLVARAVVPFDYGMVLSD
                     PNIGLLYLFAISSLGVYGIIIAGWSSNSKYAFLGALRSAAQMVPYEVSIGLILITVLI
                     CVGPRNSSEIVMAQKQIWSGIPLFPVLVMFFISCLAETNRAPFDLPEAEAELVAGYNV
                     EYSSMGSALFFLGEYANMILMSGLCTLLSPGGWPPILDLPISKKIPGSIWFSIKVILF
                     LFLYIWVRAAFPRYRYDQLMGLGRKVFLPLSLARVVAVSGVLVTFQWLP"

Here is NC_016406 from Entrez using GenBank (default, not with parts):
http://www.ncbi.nlm.nih.gov/nuccore/NC_016406.1?report=genbank&log$=seqview

     gene            join(complement(149815..150200),
                     complement(293787..295573),NC_016402.1:6618..6676,
                     181647..181905)
                     /gene="nad1"
                     /trans_splicing
                     /note="exons 1, 2, 3, and 5 on chromosome 1 are
                     trans-spliced with exon 4 on chromosome 3 to form the
                     complete coding region"
                     /db_xref="GeneID:11447159"
     CDS             join(complement(149815..150200),
                     complement(295492..295573),complement(293787..293978),
                     NC_016402.1:6618..6676,181647..181905)
                     /gene="nad1"
                     /trans_splicing
                     /note="exons 1, 2, 3, and 5 on chromosome 1 are
                     trans-spliced with exon 4 on chromosome 3 to form the
                     complete coding region"
                     /codon_start=1
                     /transl_except=(pos:complement(150198..150200),aa:Met)
                     /product="NADH dehydrogenase subunit 1"
                     /protein_id="YP_004935334.1"
                     /db_xref="GI:357967323"
                     /db_xref="GeneID:11447159"
                     /translation="MYIAVPAEILGIILPLLLGVAFLVLAERKVMAFVQRRKGPDVVG
                     SFGLLQPLADGSKLILKEPISPSSANFSLFRMAPVTTFMLSLVARAVVPFDYGMVLSD
                     PNIGLLYLFAISSLGVYGIIIAGWSSNSKYAFLGALRSAAQMVPYEVSIGLILITVLI
                     CVGPRNSSEIVMAQKQIWSGIPLFPVLVMFFISCLAETNRAPFDLPEAEAELVAGYNV
                     EYSSMGSALFFLGEYANMILMSGLCTLLSPGGWPPILDLPISKKIPGSIWFSIKVILF
                     LFLYIWVRAAFPRYRYDQLMGLGRKVFLPLSLARVVAVSGVLVTFQWLP"

Can you see the difference? Using Genbank "with parts" the external
location in this gene and CDS feature has been lost! I will report this
bug to the NCBI.

However, with that hurdle out of the way I found the problem in
Biopython - the regular expression for an external sequence
reference wasn't allowing for an underscore. The fix itself is
very trivial, in Bio/GenBank/__init__.py we replace this line:

_complex_location =
r"([a-zA-z][a-zA-Z0-9]*(\.[a-zA-Z0-9]+)?\:)?(%s|%s|%s|%s|%s)" \

with:

_complex_location =
r"([a-zA-z][a-zA-Z0-9_]*(\.[a-zA-Z0-9]+)?\:)?(%s|%s|%s|%s|%s)" \

The commit does this and adds a few tests (and fixes a typo):
https://github.com/biopython/biopython/commit/16efc7bc51b5ccef7f81f443d4b52f490f6fc354

If you are happy installing from source, you can download the
latest code from GitHub, or via git at the command line.

Peter